# __Rating Product & Sorting Reviews in Amazon__

## __Business Problem__

The accurate computation of the points awarded to the items after sales is one of the most significant issues in e-commerce. Greater customer happiness for the e-commerce site, product prominence for the sellers, and a streamlined shopping experience for the buyers are the key components of the solution to this issue. The proper arranging of the remarks made on the goods is another issue. Misleading remarks will result in both financial loss and a loss of consumers since they will immediately impact the sale of the goods. When these two fundamental issues are resolved, sales on e-commerce sites and among vendors will rise, and customers' shopping experiences will be trouble-free.

## __Story of the Dataset__

This Amazon product data file contains product categories and additional information. User ratings and reviews are available for the product with the most reviews in the electronics category.

- __reviewerID__ : Reviewer ID
- __asin__ : Product ID
- __reviewerName__ : Reviewer Name
- __helpful__ :  Useful rating
- __reviewText__ : Evaluation text
- __overall__ : Product Rating
- __summary__ : Review Summary
- __unixReviewTime__ : Review Time
- __reviewTime__ : Review Time Raw
- __day_diff__ : Number of days since review
- __helpful_yes__ : Number of helpful reviews
- __total_vote__ : Total number of reviews

Import Necessary Libraries

In [1]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [2]:
df_=pd.read_csv("amazon_review.csv")
df=df_.copy()
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


__Average Rating__ 

Direct Average Rating

In [3]:
df.overall.mean()

4.587589013224822

Obtaining weighted average rating according to review dates
- After converting review time feature to datetime, firstly age of reviews found in days according to most recent review.
- Secondly, quantiles of this feature found.
- Finally, to find a weighted average rating each quantile multiplied by a coefficient
    - The reason behind this approach is that as we get closer to the present, it is assumed that the ratings about the product become more up-to-date and closer to reality.  

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4915 entries, 0 to 4914
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   reviewerID      4915 non-null   object 
 1   asin            4915 non-null   object 
 2   reviewerName    4914 non-null   object 
 3   helpful         4915 non-null   object 
 4   reviewText      4914 non-null   object 
 5   overall         4915 non-null   float64
 6   summary         4915 non-null   object 
 7   unixReviewTime  4915 non-null   int64  
 8   reviewTime      4915 non-null   object 
 9   day_diff        4915 non-null   int64  
 10  helpful_yes     4915 non-null   int64  
 11  total_vote      4915 non-null   int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 460.9+ KB


In [5]:
df.reviewTime=pd.to_datetime(df.reviewTime)
current_date=df["reviewTime"].max()
df["day_diff"]=(current_date-df["reviewTime"]).dt.days
q1=df["day_diff"].quantile([0.25,0.5,0.75]).iloc[0]
q2=df["day_diff"].quantile([0.25,0.5,0.75]).iloc[1]
q3=df["day_diff"].quantile([0.25,0.5,0.75]).iloc[2]
df.loc[df["day_diff"]<=q1,"overall"].mean()*0.35+\
df.loc[(df["day_diff"]>q1)&(df["day_diff"]<=q2),"overall"].mean()*0.30+\
df.loc[(df["day_diff"]>q2)&(df["day_diff"]<=q3),"overall"].mean()*0.25+\
df.loc[(df["day_diff"]>q3),"overall"].mean()*0.10

4.62191041603578

Inspecting average rating of the each quantile

In [6]:
df.loc[df["day_diff"]<=q1,"overall"].mean()

4.6957928802588995

In [7]:
df.loc[(df["day_diff"]>q1)&(df["day_diff"]<=q2),"overall"].mean()


4.636140637775961

In [8]:
df.loc[(df["day_diff"]>q2)&(df["day_diff"]<=q3),"overall"].mean()


4.571661237785016

In [9]:
df.loc[(df["day_diff"]>q3),"overall"].mean()


4.4462540716612375

__Comment__
- As the grading periods approach closer to the present, rating averages rise. Thus, consumers now see the product more favorably.

__Specifying 10 reviews for the product to be displayed on the product detail page__

Not helpful comments

In [10]:
df["helpful_no"]=df.total_vote-df.helpful_yes

3 different approaches to find useful reviews. However, we will use only Wilson Lower Bound Score Calculation. Reasons behind it:
- Getting the direct up to down vote difference can bring down the rare but useful comments in the ranking.
- Using second func,also can bring up much less voted comments

In [11]:
def score_pos_neg_diff(up, down):
    return up - down
def score_average_rating(up, down):
    if up + down == 0:
        return 0
    return up / (up + down)
def wilson_lower_bound(up, down, confidence=0.95):
    """
    Wilson Lower Bound Score Calculation

    - The lower limit of the confidence interval to be calculated for the Bernoulli parameter p is accepted as the WLB score.
    - The score to be calculated is used for review ranking.

    Parameters
    ----------
    up: int
        up count
    down: int
        down count
    confidence: float
        confidence

    Returns
    -------
    wilson score: float

    """
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)


__Wilson Lower Bound Score Examples__

In [12]:
# Votes with same ratio
print(wilson_lower_bound(1, 2, confidence=0.95))
print(wilson_lower_bound(100, 200, confidence=0.95))

0.061491944720396215
0.2823934472922627


In [13]:
# Votes with same difference
print(wilson_lower_bound(3, 6, confidence=0.95))
print(wilson_lower_bound(100, 106, confidence=0.95))

0.1205838183869109
0.4180809445791726


- As can be see above this Wilson Lower Bound Score method consider both the ratio of the votes and the wisdom of the crowds.

In [14]:
df["wilson_lower_bound"]=df.apply(lambda x:wilson_lower_bound(x["helpful_yes"],x["helpful_no"]),axis=1)

In [15]:
df.sort_values(by="wilson_lower_bound",ascending=False).head(20)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,helpful_no,wilson_lower_bound
2031,A12B7ZMXFI6IXY,B007WTAJTO,"Hyoun Kim ""Faluzure""","[1952, 2020]",[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1367366400,2013-01-05,701,1952,2020,68,0.95754
3449,AOEAD7DPLZE53,B007WTAJTO,NLee the Engineer,"[1428, 1505]",I have tested dozens of SDHC and micro-SDHC ca...,5.0,Top of the class among all (budget-priced) mic...,1348617600,2012-09-26,802,1428,1505,77,0.93652
4212,AVBMZZAFEKO58,B007WTAJTO,SkincareCEO,"[1568, 1694]",NOTE: please read the last update (scroll to ...,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1375660800,2013-05-08,578,1568,1694,126,0.91214
317,A1ZQAQFYSXL5MQ,B007WTAJTO,"Amazon Customer ""Kelly""","[422, 495]","If your card gets hot enough to be painful, it...",1.0,"Warning, read this!",1346544000,2012-02-09,1032,422,495,73,0.81858
4672,A2DKQQIZ793AV5,B007WTAJTO,Twister,"[45, 49]",Sandisk announcement of the first 128GB micro ...,5.0,Super high capacity!!! Excellent price (on Am...,1394150400,2014-07-03,157,45,49,4,0.80811
1835,A1J6VSUM80UAF8,B007WTAJTO,goconfigure,"[60, 68]",Bought from BestBuy online the day it was anno...,5.0,I own it,1393545600,2014-02-28,282,60,68,8,0.78465
3981,A1K91XXQ6ZEBQR,B007WTAJTO,"R. Sutton, Jr. ""RWSynergy""","[112, 139]",The last few days I have been diligently shopp...,5.0,"Resolving confusion between ""Mobile Ultra"" and...",1350864000,2012-10-22,776,112,139,27,0.73214
3807,AFGRMORWY2QNX,B007WTAJTO,R. Heisler,"[22, 25]",I bought this card to replace a lost 16 gig in...,3.0,"Good buy for the money but wait, I had an issue!",1361923200,2013-02-27,648,22,25,3,0.70044
4306,AOHXKM5URSKAB,B007WTAJTO,Stellar Eller,"[51, 65]","While I got this card as a ""deal of the day"" o...",5.0,Awesome Card!,1339200000,2012-09-06,822,51,65,14,0.67033
4596,A1WTQUOQ4WG9AI,B007WTAJTO,"Tom Henriksen ""Doggy Diner""","[82, 109]",Hi:I ordered two card and they arrived the nex...,1.0,Designed incompatibility/Don't support SanDisk,1348272000,2012-09-22,806,82,109,27,0.66359
