In [None]:
## Rating Products & Sorting Reviews on AMAZON ##

In [None]:
## PROJECT STEPS ##
# 1. Business Problem
# 2. Data Understanding
# 3. Calculating product ratings 
# 4. Sorting customer reviews 

In [None]:
# 1. Business Problem

# One of the most important problems in e-commerce is "the correct calculation of the scores given to products after-sales". (Rating Products)
# The solution to this problem provides greater customer satisfaction, highlights the product for sellers, and provides a smooth shopping experience for buyers.

# Another problem is "the correct ordering of the comments given to products". (Sorting Reviews)
# Since the prominence of misleading comments will directly affect the sales of the product, it will cause both financial and customer loss.
# In solving these 2 basic problems, e-commerce sites and sellers will increase their sales, while customers will complete their purchasing journey without any problems.

In [None]:
## Dataset Story

# This dataset, which contains Amazon product data, includes product categories and various metadata.
# This dataset contains user ratings and reviews of the most reviewed product in the electronics category.

## Variables:
# reviewerID: User ID
# asin: Product ID
# reviewerName: User Name
# helpful: Helpful review rating (up-ratings)
# reviewText: Review
# overall: Product rating
# summary: Review summary
# unixReviewTime: Review duration
# reviewTime: Review duration Raw
# day_diff: Number of days since review
# helpful_yes: Number of times review was found helpful
# total_vote: Number of votes given to review

In [None]:
# 2. Understanding Data

In [1]:
# import the libraries

import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

In [2]:
pd.set_option("display.max_columns", None)  
pd.set_option("display.max_rows", 100)      
pd.set_option('display.width', 500)         
pd.set_option("display.precision", 2)       
pd.set_option('display.expand_frame_repr', False)

In [3]:
df = pd.read_csv("amazon_review.csv")

In [4]:
# Exploratory Data Analysis Function : Displays basic characteristics of the DataFrame.

def check_df(dataframe, head=5):
    print("__________________________________________________________________ FIRST 5 ROWS __________________________________________________________________ ")
    print(dataframe.head(head))
    print("__________________________________________________________________  LAST 5 ROWS __________________________________________________________________ ")
    print(dataframe.tail(head))
    print("__________________________________________________________________  DATA SHAPE ___________________________________________________________________ ")
    print(dataframe.shape)
    print("_________________________________________________________________  GENERAL INFO __________________________________________________________________ ")
    print(dataframe.info())
    print("__________________________________________________________________  NULL VALUES __________________________________________________________________ ")
    print(dataframe.isnull().sum().sort_values(ascending=False))
    print("_______________________________________________________________  DUPLICATED VALUES _______________________________________________________________ ")
    print(dataframe.duplicated().sum())
    print("____________________________________________________________________ DESCRIBE ____________________________________________________________________ ")
    print(dataframe.describe([0, 0.05, 0.1, 0.25, 0.50, 0.95, 0.99, 1]).T)

In [5]:
check_df(df)

__________________________________________________________________ FIRST 5 ROWS __________________________________________________________________ 
       reviewerID        asin  reviewerName helpful                                         reviewText  overall                                 summary  unixReviewTime  reviewTime  day_diff  helpful_yes  total_vote
0  A3SBTW3WS4IQSN  B007WTAJTO           NaN  [0, 0]                                         No issues.      4.0                              Four Stars      1406073600  2014-07-23       138            0           0
1  A18K1ODH1I2MVB  B007WTAJTO          0mie  [0, 0]  Purchased this for my device, it worked as adv...      5.0                           MOAR SPACE!!!      1382659200  2013-10-25       409            0           0
2  A2FII3I2MBMUIA  B007WTAJTO           1K3  [0, 0]  it works as expected. I should have sprung for...      4.0               nothing to really say....      1356220800  2012-12-23       715            0     

In [None]:
# 3. Rating Products (Calculation of product ratings)

In [6]:
# 1) Average

# Calculating Average Rating According to Current Comments. (Calculating the Average Rating of the Product.)

df["overall"].mean()

4.587589013224822

In [9]:
# 2) Time-Based Weighted Average

# Calculating the Weighted Average of Points by Date.
# Weighted Average of Points by Time: It is a method of taking averages over time so that new, successful, and trendy products can stand out.

# Entering today's date.
df["reviewTime"].max()  # last date: ('2014-12-07 00:00:00')
current_date = pd.to_datetime('2014-12-08 00:00:00')

# Converting the reviewTime variable to the historical data type.
df["reviewTime"] = pd.to_datetime(df["reviewTime"])

# Creating a days column: Extracting the date of each rating from current_date and converting it to days.
df["days"] = (current_date - df["reviewTime"]).dt.days


In [20]:
# Average of ratings in the last 30 days in the Dataset:

df.loc[df["days"] <= 30, "overall"].mean()

4.742424242424242

In [21]:
# Average of the ratings in the Data Set for the last 1-3 months:

df.loc[(df["days"] > 30) & (df["days"] <= 90), "overall"].mean()

4.803149606299213

In [22]:
# Average of the ratings in the last 3-6 months in the Data Set:

df.loc[(df["days"] > 90) & (df["days"] <= 180), "overall"].mean()

4.649484536082475

In [23]:
# Average of ratings older than 6 months in the Dataset:

df.loc[(df["days"] > 180), "overall"].mean()

4.573373327180434

In [24]:
## Weighted Score Average Calculation Function According to Scoring Times

def time_based_weighted_average(dataframe, w1=28, w2=26, w3=24, w4=22):
    return dataframe.loc[df["days"] <= 30, "overall"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["days"] > 30) & (dataframe["days"] <= 90), "overall"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["days"] > 90) & (dataframe["days"] <= 180), "overall"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["days"] > 180), "overall"].mean() * w4 / 100

In [25]:
# Weighted Point Average by Date:

time_based_weighted_average(df)

4.6987161061560725

In [None]:
# Weighted Score Average by Date: 4.7, while the overall scoring average for the Dataframe was 4.59.

In [None]:
## Comparing and interpreting the average of each time period in weighted scoring

# Current ratings are higher. Older ratings are lower. An average rating increase over time may indicate a slight increase in "product popularity or user satisfaction."

# Last 30 Days: Average rating is 4.74. Users may have been happy with the product recently. Or "a particular feature of the product has been noticed or promoted."
# Older than Last 30 Days - Newer than 3 Months: Average rating is 4.80. It seems a bit higher than the ratings in the last 30 days.
# Older than Last 3 Months - Newer than 6 Months: Average rating is 4.64. The rating during this period is slightly lower but still high.

In [None]:
# 4. Sorting Reviews 

In [None]:
# We don't care about the low/high rating of the review or product. We try to provide the most useful result to the user.

# Determining 20 Reviews to be displayed on the Product Detail Page for the Product.

In [31]:
## Step 1. Generating the "helpful_no" Variable

# total_vote variable: Total number of up ratings-down ratings given to a comment. "up" means helpful.
# helpful_yes variable: Number of helpful votes. (up ratings)

# Generating the helpful_no variable: (we need both up and down ratings, we only have helpful_yes, which is up ratings)

df["helpful_no"] = df["total_vote"] - df["helpful_yes"]

In [None]:
## Step 2. Calculating and Adding the Following Scores to the Data:

# 1) Up-Down Difference Score = (up ratings) − (down ratings)
# 2) Average rating Score = (up ratings) / (all ratings)
# 3) Wilson Lower Bound Score

In [33]:
# 1) Up-Down Difference Score = (up ratings) − (down ratings)

# Sorting comments based on the number of likes.

def score_up_down_diff(helpful_yes, helpful_no):
    return helpful_yes - helpful_no

# Creating a new column in the data frame by applying the function we created above.

df["score_pos_neg_diff"] = df.apply(lambda x: score_up_down_diff(x["helpful_yes"], x["helpful_no"]), axis=1)

In [44]:
# 2) Average rating Score = (up ratings) / (all ratings)

def score_average_rating(helpful_yes, helpful_no):
    if helpful_yes + helpful_no == 0:
        return 0
    return helpful_yes / (helpful_yes + helpful_no)

# Creating a new column in the data frame by applying the function we created above.

df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["helpful_yes"], x["helpful_no"]), axis=1)

In [None]:
# 3) Wilson Lower Bound Score

# WLB Score is a statistical method used in product or review rankings.
# It ranks by taking into account the score of the review and the "number of comments of the customer" who made the review. 
# The number of comments from the customer indicates how reliable the review is. 
# The Wilson Lower Bound method uses the mathematical formula Bayes Theorem to evaluate the reliability of the reviews and whether the scores are statistically significant.

In [46]:
def wilson_lower_bound(helpful_yes, helpful_no, confidence=0.95):
    
    """
    Calculating Wilson Lower Bound Score

    - The lower limit of the confidence interval to be calculated for the Bernoulli parameter p is accepted as the WLB score.
    - The calculated score is used for product ranking.
    - Note: If the scores are between 1-5, 1-3 is marked as negative, 4-5 as positive and can be made suitable for Bernoulli.

    This brings some problems with it. For this reason, it is necessary to make a Bayesian average rating.

    Parameters
    ----------
    up: int
        up count
    down: int
        down count
    confidence: float
        confidence

    Returns
    -------
    wilson score: float

    """
   
    n = helpful_yes + helpful_no
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * helpful_yes / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

# Creating a new column in the data frame by applying the function we created above.

df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)

In [47]:
## Step 3. Identify 20 Comments

top_20_reviews = df.sort_values(by="wilson_lower_bound", ascending=False).head(20)

In [50]:
top_20_reviews

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,days,helpful_no,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,A12B7ZMXFI6IXY,B007WTAJTO,"Hyoun Kim ""Faluzure""","[1952, 2020]",[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1367366400,2013-01-05,702,1952,2020,702,68,1884,0.97,0.96
3449,AOEAD7DPLZE53,B007WTAJTO,NLee the Engineer,"[1428, 1505]",I have tested dozens of SDHC and micro-SDHC ca...,5.0,Top of the class among all (budget-priced) mic...,1348617600,2012-09-26,803,1428,1505,803,77,1351,0.95,0.94
4212,AVBMZZAFEKO58,B007WTAJTO,SkincareCEO,"[1568, 1694]",NOTE: please read the last update (scroll to ...,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1375660800,2013-05-08,579,1568,1694,579,126,1442,0.93,0.91
317,A1ZQAQFYSXL5MQ,B007WTAJTO,"Amazon Customer ""Kelly""","[422, 495]","If your card gets hot enough to be painful, it...",1.0,"Warning, read this!",1346544000,2012-02-09,1033,422,495,1033,73,349,0.85,0.82
4672,A2DKQQIZ793AV5,B007WTAJTO,Twister,"[45, 49]",Sandisk announcement of the first 128GB micro ...,5.0,Super high capacity!!! Excellent price (on Am...,1394150400,2014-07-03,158,45,49,158,4,41,0.92,0.81
1835,A1J6VSUM80UAF8,B007WTAJTO,goconfigure,"[60, 68]",Bought from BestBuy online the day it was anno...,5.0,I own it,1393545600,2014-02-28,283,60,68,283,8,52,0.88,0.78
3981,A1K91XXQ6ZEBQR,B007WTAJTO,"R. Sutton, Jr. ""RWSynergy""","[112, 139]",The last few days I have been diligently shopp...,5.0,"Resolving confusion between ""Mobile Ultra"" and...",1350864000,2012-10-22,777,112,139,777,27,85,0.81,0.73
3807,AFGRMORWY2QNX,B007WTAJTO,R. Heisler,"[22, 25]",I bought this card to replace a lost 16 gig in...,3.0,"Good buy for the money but wait, I had an issue!",1361923200,2013-02-27,649,22,25,649,3,19,0.88,0.7
4306,AOHXKM5URSKAB,B007WTAJTO,Stellar Eller,"[51, 65]","While I got this card as a ""deal of the day"" o...",5.0,Awesome Card!,1339200000,2012-09-06,823,51,65,823,14,37,0.78,0.67
4596,A1WTQUOQ4WG9AI,B007WTAJTO,"Tom Henriksen ""Doggy Diner""","[82, 109]",Hi:I ordered two card and they arrived the nex...,1.0,Designed incompatibility/Don't support SanDisk,1348272000,2012-09-22,807,82,109,807,27,55,0.75,0.66


In [None]:
# We have raised reviews that are more helpful to customers to the top. This is not affected by whether the review is positive or negative. 

# Example:
# Customers in reviews 1 and 2 provided positive feedback about the product’s performance and gave it a rating of 5.
# Customers in reviews 3 and 4 indicated that the product had specific issues or drawbacks and gave it a rating of 1.