# Business Problem

* One of the most important problems in e-commerce is the correct calculation of the points given to the products after sales.
* Solving this problem means providing more customer satisfaction for the e-commerce site, making the product stand out for sellers, and a smooth shopping experience for shoppers. 
* Another problem is that the comments given to the products are sorted correctly. Since the prominence of misleading comments will directly affect the sale of the product, it will cause both financial loss and customer loss. 
* In solving these 2 main problems, the e-commerce site and sellers will increase their sales, while customers will complete their purchase journey without any problems.

## Dataset Story
* This data set, which includes **Amazon** product data, includes product categories and various metadata. 
* The product  with the most comments in the electronics category has user ratings and reviews.

## Variables
* **_reviewerID_**: User ID
* **_asin_**: Product ID
* **_reviewerName_**: Username
* **_helpful_**: Useful rating
* **_reviewText_**: Evaluation
* **_overall_**: Product rating
* **_summary_**: Summary of evaluation
* **_unixReviewTime_**: Evaluation time
* **_reviewTime_**: Raw evaluation time
* **_day_diff_**: The number of days since the evaluation
* **_helpful_yes_**: The number of useful evaluations
* **_total_vote_**: The number of votes given to the evaluation

In [1]:
# Importing necessary libraries and cosmetic settings
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [2]:
# Reading and Analyzing Data
df_ = pd.read_csv("/Users/hikmetburakozcan/pythonProject1/dsmlbc_9_abdulkadir/Homeworks/burak_ozcan/3_Olcumleme_Problemleri/amazon_review.csv")
df = df_.copy()

In [3]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


In [4]:
df.shape

(4915, 12)

In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
overall,4915.0,4.58759,0.99685,1.0,5.0,5.0,5.0,5.0
unixReviewTime,4915.0,1379465001.66836,15818574.32275,1339200000.0,1365897600.0,1381276800.0,1392163200.0,1406073600.0
day_diff,4915.0,437.36704,209.43987,1.0,281.0,431.0,601.0,1064.0
helpful_yes,4915.0,1.31109,41.61916,0.0,0.0,0.0,0.0,1952.0
total_vote,4915.0,1.52146,44.12309,0.0,0.0,0.0,0.0,2020.0


In [6]:
df.isnull().sum()

reviewerID        0
asin              0
reviewerName      1
helpful           0
reviewText        1
overall           0
summary           0
unixReviewTime    0
reviewTime        0
day_diff          0
helpful_yes       0
total_vote        0
dtype: int64

In [7]:
# Average of the product rating 
df["overall"].mean()

4.587589013224822

### Calculation of the weighted average score by date

In [8]:
df["reviewTime"] = pd.to_datetime(df["reviewTime"])

In [10]:
current_date = df["reviewTime"].max()
current_date

Timestamp('2014-12-07 00:00:00')

In [11]:
df["days"] = (current_date - df["reviewTime"]).dt.days

In [12]:
df["days"].quantile([.25, .5, .75])

0.25000   280.00000
0.50000   430.00000
0.75000   600.00000
Name: days, dtype: float64

In [13]:
df.loc[df["days"] < 280, "overall"].mean() * 28 / 100 + \
df.loc[df["days"] < 430, "overall"].mean() * 26 / 100 + \
df.loc[df["days"] < 600, "overall"].mean() * 24 / 100 + \
df.loc[df["days"] > 600, "overall"].mean() * 22 / 100

4.618390829533052

In [14]:
# Comparing the average of each time period in the weighted calculation
df["segment"] = pd.qcut(df["days"], 4, labels=["A", "B", "C", "D"])
df.groupby("segment").agg({"overall": "mean"})

Unnamed: 0_level_0,overall
segment,Unnamed: 1_level_1
A,4.69579
B,4.63614
C,4.57166
D,4.44625


### Top 20 reviews that will be displayed on the product detail page for the product

In [15]:
# Generating the helpful_no variable
df["helpful_no"] = df["total_vote"] - df["helpful_yes"]

In [16]:
# Calculation of the score_pos_neg_diff, score_average_rating and wilson_lower_bound scores
def score_pos_neg_diff(up, down):
    return up - down

def score_average_rating(up, down):
    if up + down == 0:
        return 0
    return up / (up + down)

def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

df["score_pos_neg_diff"] = df.apply(lambda x: score_pos_neg_diff(x["helpful_yes"], x["helpful_no"]), axis=1)
df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["helpful_yes"], x["helpful_no"]), axis=1)
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)

In [17]:
# Determining and soringt top 20 comments according to wilson_lower_bound
df.sort_values("wilson_lower_bound", ascending=False).head(20)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,days,segment,helpful_no,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,A12B7ZMXFI6IXY,B007WTAJTO,"Hyoun Kim ""Faluzure""","[1952, 2020]",[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1367366400,2013-01-05,702,1952,2020,701,D,68,1884,0.96634,0.95754
3449,AOEAD7DPLZE53,B007WTAJTO,NLee the Engineer,"[1428, 1505]",I have tested dozens of SDHC and micro-SDHC ca...,5.0,Top of the class among all (budget-priced) mic...,1348617600,2012-09-26,803,1428,1505,802,D,77,1351,0.94884,0.93652
4212,AVBMZZAFEKO58,B007WTAJTO,SkincareCEO,"[1568, 1694]",NOTE: please read the last update (scroll to ...,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1375660800,2013-05-08,579,1568,1694,578,C,126,1442,0.92562,0.91214
317,A1ZQAQFYSXL5MQ,B007WTAJTO,"Amazon Customer ""Kelly""","[422, 495]","If your card gets hot enough to be painful, it...",1.0,"Warning, read this!",1346544000,2012-02-09,1033,422,495,1032,D,73,349,0.85253,0.81858
4672,A2DKQQIZ793AV5,B007WTAJTO,Twister,"[45, 49]",Sandisk announcement of the first 128GB micro ...,5.0,Super high capacity!!! Excellent price (on Am...,1394150400,2014-07-03,158,45,49,157,A,4,41,0.91837,0.80811
1835,A1J6VSUM80UAF8,B007WTAJTO,goconfigure,"[60, 68]",Bought from BestBuy online the day it was anno...,5.0,I own it,1393545600,2014-02-28,283,60,68,282,B,8,52,0.88235,0.78465
3981,A1K91XXQ6ZEBQR,B007WTAJTO,"R. Sutton, Jr. ""RWSynergy""","[112, 139]",The last few days I have been diligently shopp...,5.0,"Resolving confusion between ""Mobile Ultra"" and...",1350864000,2012-10-22,777,112,139,776,D,27,85,0.80576,0.73214
3807,AFGRMORWY2QNX,B007WTAJTO,R. Heisler,"[22, 25]",I bought this card to replace a lost 16 gig in...,3.0,"Good buy for the money but wait, I had an issue!",1361923200,2013-02-27,649,22,25,648,D,3,19,0.88,0.70044
4306,AOHXKM5URSKAB,B007WTAJTO,Stellar Eller,"[51, 65]","While I got this card as a ""deal of the day"" o...",5.0,Awesome Card!,1339200000,2012-09-06,823,51,65,822,D,14,37,0.78462,0.67033
4596,A1WTQUOQ4WG9AI,B007WTAJTO,"Tom Henriksen ""Doggy Diner""","[82, 109]",Hi:I ordered two card and they arrived the nex...,1.0,Designed incompatibility/Don't support SanDisk,1348272000,2012-09-22,807,82,109,806,D,27,55,0.75229,0.66359
