#### *Description: The Dataset describes ratings and comments of one electronic device from Amazon.*
##### *Columns of Dataset:*

1. reviewerID: ID Number of Customer
2. asin: ID Number of Product
3. reviewer Name: Name of Customer
4. helpful: Vote of comments (positive, negative)
5. reviewText: comment
6. overall: rating
7. summary: summary of comment
8. unixReviewTime: Date of comment, that created from Amazon
9. reviewTime: Date of comment
10. day_diff: Difference between date of Analyse and date of comment
11. helpful_yes: Positive votes of comment (These are from another customers)
12. total_vote: total votes

##### *Firstly we will calculate the average rating and also time-based-weighted average rating. Then, we will see how the current comments effect the average rating.<br>At the end we will sort the comments with wilson lower bound score more accurately.*


#### Import the Dataset

In [1]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/amazon-reviews/amazon_reviews.csv


In [3]:
df = pd.read_csv("/kaggle/input/amazon-reviews/amazon_reviews.csv")

In [4]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


#### First, we can take our average rating and save it for later.

In [5]:
par = df["overall"].mean()
df["overall"].mean()

4.587589013224822

#### Also we are gonna check the quantiles of day difference. Day difference is between the analysis date and the comment date. And we will see that there are already more than 3 months of comments below 3%. That's why we should edit the column "day_diff" with scaler, so that all comments are gonna be scaled in a year and name it as "day_diff_scaled".

In [6]:
df["day_diff"].describe([0.03, 0.05, 0.01, 0.25, 0.75, 0.90, 0.95, 0.99])

count    4915.000000
mean      437.367040
std       209.439871
min         1.000000
1%          6.000000
3%         64.420000
5%         98.000000
25%       281.000000
50%       431.000000
75%       601.000000
90%       708.000000
95%       748.000000
99%       943.000000
max      1064.000000
Name: day_diff, dtype: float64

In [7]:
df["day_diff_scaled"] = MinMaxScaler(feature_range=(1,360)).fit(df[["day_diff"]]).transform(df[["day_diff"]])
df["day_diff_scaled"] = df["day_diff_scaled"].astype(int)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,day_diff_scaled
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0,47
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0,138
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0,242
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0,129
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0,173


#### So, now we can make a time based weighted average rating. The idea is, that we are gonna put forward current comments and show the current rating.<br><br> What will we do? What is the time-based-weighted-average?<br><br> We have now all comments and ratings in one year. Day difference will be divided to 4 parts and also we are gonna give our weighted ratings as total of 1, like below:<br>(You can give your scores, it is completely subjective) <br><br> *within 1 month:* 0.28<br>*between 1 and 3 months:* 0.26<br>*between 3 and 6 months:* 0.24<br>*over than 6 months:* 0.22<br>

In [8]:
tbwa = df.loc[df["day_diff_scaled"] <= 30, "overall"].mean() * 28 / 100 + \
df.loc[(df["day_diff_scaled"] > 30) & (df["day_diff_scaled"] <= 90), "overall"].mean() * 26 / 100 + \
df.loc[(df["day_diff_scaled"] > 90) & (df["day_diff_scaled"] <= 180), "overall"].mean() * 24 / 100 + \
df.loc[df["day_diff_scaled"] > 180, "overall"].mean() * 22 / 100

In [9]:
print("Time-Based Weighted Average Rating: " + f'{tbwa:.2f}' + "\n"
      "Average Rating: " + f'{par:.2f}')

Time-Based Weighted Average Rating: 4.65
Average Rating: 4.59


#### That means, we see the time-based-weighted average rating more than average rating. So it means, current ratings were mostly positive than before.

#### Not finished! There is a sorting option and it is with Wilson Lower Bound Score.<br><br>Wilson lower bound score provides a way to sort a product based on positive and negative votes.<br>There are positive(helpful_yes) and negative(helpful_no) votes and we want to understand how popular the product will be across the all customers.<br>We can predict with %95 confidence, that means that we have a confidence interval and we use here the low limit.<br>To understanding well, if there are lots of negative votes despite the positive votes more than negatives, it will go down anyway. Because the importance is, less negative votes. Don't forget it. Votes are the evalution of comments. Now let's see how is the wilson lower bound sorting. 

In [10]:
df["helpful_no"] = df["total_vote"] - df["helpful_yes"]
up = df["helpful_yes"].tolist()
down = df["helpful_no"].tolist()
comments = pd.DataFrame({"up": up, "down": down})

In [11]:
def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

In [12]:
comments["wlb"] = comments.apply(lambda x: wilson_lower_bound(x["up"], x["down"]), axis=1)
df["wilson_lower_bound"] = comments["wlb"]

#### Here is the list of top 20 comments.<br>As you can see also low ratings have a high place in our list. Because people find the comment useful and they click up vote.<br>If we take the line 3, we see the up votes(positive) 1568 pieces and down votes(negative) 126 pieces. That means, people find that comment useful and they think, that product deserves 1.0 rating.<br>For the compare let's take index number 1609 and 4302. So we see 7 up votes(positive) higher than 14 up votes(positive) because it has no down votes(negative) and 14 up votes(positive) is with 2 down(negative) votes, it means the down votes(negativ) takes the comment go down as I mentioned above.

In [13]:
df_top_comments = df.sort_values("wilson_lower_bound",ascending=False).head(20)
df_top_comments[["overall", "summary", "helpful_yes", "helpful_no", "wilson_lower_bound"]]

Unnamed: 0,overall,summary,helpful_yes,helpful_no,wilson_lower_bound
2031,5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1952,68,0.957544
3449,5.0,Top of the class among all (budget-priced) mic...,1428,77,0.936519
4212,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1568,126,0.912139
317,1.0,"Warning, read this!",422,73,0.818577
4672,5.0,Super high capacity!!! Excellent price (on Am...,45,4,0.808109
1835,5.0,I own it,60,8,0.784651
3981,5.0,"Resolving confusion between ""Mobile Ultra"" and...",112,27,0.732136
3807,3.0,"Good buy for the money but wait, I had an issue!",22,3,0.700442
4306,5.0,Awesome Card!,51,14,0.670334
4596,1.0,Designed incompatibility/Don't support SanDisk,82,27,0.663595
