# Rating Product & Sorting Reviews in Amazon

### İş Problemi
E-ticaretteki en önemli problemlerden bir tanesi ürünlere satış sonrası verilen puanların doğru şekilde hesaplanmasıdır. Bu problemin çözümü e-ticaret sitesi için daha fazla müşteri memnuniyeti sağlamak,satıcılar için ürünün öne çıkması ve satın alanlar için sorunsuz bir alışveriş deneyimi demektir. Bir diğer problem ise ürünlere verilen yorumların doğru bir şekilde sıralanması olarak karşımıza çıkmaktadır. Yanıltıcı yorumların öne çıkması ürünün satışını doğrudan etkileyeceğinden dolayı hem maddi kayıp hem de müşteri kaybına neden olacaktır. Bu 2 temel problemin çözümünde e-ticaret sitesi ve satıcılar satışlarını arttırırken müşteriler ise satın alma yolculuğunu sorunsuz olarak tamamlayacaktır.



### Veri Seti Hikayesi 
Amazon ürün verilerini içeren bu veri seti ürün kategorileri ile çeşitli metadatalarıiçermektedir. Elektronik kategorisindeki en fazla yorum alan ürünün kullanıcı puanları ve yorumları vardır.
- reviewerID: Kullanıcı ID’si
- asin: Ürün ID’si
- reviewerName: Kullanıcı Adı
- helpful: Faydalı değerlendirme derecesi
- reviewText: Değerlendirme
- overall: Ürün rating’i
- summary: Değerlendirme özeti
- unixReviewTime: Değerlendirme zamanı Raw
- reviewTime: Değerlendirme zamanı
- day_diff: Değerlendirmeden itibaren geçen gün sayısı
- helpful_yes: Değerlendirmenin faydalı bulunma sayısı
- total_vote: Değerlendirmeye verilen oy sayısı

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import scipy.stats as st
import math

In [2]:
df_ = pd.read_csv("D:\\Rating Product&SortingReviewsinAmazon\\amazon_review.csv")
df = df_.copy()

In [3]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4915 entries, 0 to 4914
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   reviewerID      4915 non-null   object 
 1   asin            4915 non-null   object 
 2   reviewerName    4914 non-null   object 
 3   helpful         4915 non-null   object 
 4   reviewText      4914 non-null   object 
 5   overall         4915 non-null   float64
 6   summary         4915 non-null   object 
 7   unixReviewTime  4915 non-null   int64  
 8   reviewTime      4915 non-null   object 
 9   day_diff        4915 non-null   int64  
 10  helpful_yes     4915 non-null   int64  
 11  total_vote      4915 non-null   int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 460.9+ KB


In [5]:
df["overall"].mean()
# ürünün ortalama puanı 

4.587589013224822

In [6]:
df.loc[df["day_diff"] <= 30, "overall"].mean()
# son 30 gündeki puan ortalamaları

4.742424242424242

In [7]:
df.loc[df["day_diff"] <= 30, "overall"].count()
# son 30 gündeki puan sayıları

66

Tarihe Göre Ağırlıklı Puan Ortalamasını Hesaplanması

In [9]:
#Time-Based Weighted Average (Puan Zamanlarına Göre Ağırlıklı Ortalama)
def time_based_weighted_average(dataframe,w1=28,w2=26,w3=24,w4=22):
    return df.loc[df["day_diff"] <= 30, "overall"].mean() * w1/100 + \
           df.loc[(df["day_diff"] > 30) & (df["day_diff"] <= 90), "overall"].mean() * w2/100 + \
           df.loc[(df["day_diff"] > 90) & (df["day_diff"] <= 180), "overall"].mean() * w3/100 + \
           df.loc[df["day_diff"] > 180, "overall"].mean() * w4/100

In [10]:
time_based_weighted_average(df)

4.6987161061560725

In [11]:
df.loc[df["day_diff"] <= 30, "overall"].mean()
# son 30 gün

4.742424242424242

In [12]:
df.loc[(df["day_diff"] > 30) & (df["day_diff"] <= 90), "overall"].mean()
# 30-90 gün arası

4.803149606299213

In [13]:
df.loc[(df["day_diff"] > 90) & (df["day_diff"] <= 180), "overall"].mean()
# 90-180 gün arası

4.649484536082475

In [14]:
df.loc[df["day_diff"] > 180, "overall"].mean()
# 180 günden fazla

4.573373327180434

Toplam oy sayısından yararlı oy sayısı çıkarılarak yararlı bulunmayan oy sayıları bulunur.

In [16]:
df["helpful_no"] = df["total_vote"] - df["helpful_yes"]
# helpful_no değişkeninin üretilmesi

### Up-Down Diff

In [17]:
def score_up_down_diff(df):
    return df["helpful_yes"] - df["helpful_no"]

In [18]:
df["score_pos_neg_diff"] = score_up_down_diff(df)

### Average Rating

In [19]:
def score_average_rating(up,total):
    if total == 0:
        return 0 
    return up / total

In [20]:
df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["helpful_yes"], x["helpful_no"]), axis=1)

In [21]:
df

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,helpful_no,score_pos_neg_diff,score_average_rating
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0,0,0,0.0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0,0,0,0.0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0,0,0,0.0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0,0,0,0.0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4910,A2LBMKXRM5H2W9,B007WTAJTO,"ZM ""J""","[0, 0]",I bought this Sandisk 16GB Class 10 to use wit...,1.0,Do not waste your money.,1374537600,2013-07-23,503,0,0,0,0,0.0
4911,ALGDLRUI1ZPCS,B007WTAJTO,Zo,"[0, 0]",Used this for extending the capabilities of my...,5.0,Great item!,1377129600,2013-08-22,473,0,0,0,0,0.0
4912,A2MR1NI0ENW2AD,B007WTAJTO,Z S Liske,"[0, 0]",Great card that is very fast and reliable. It ...,5.0,Fast and reliable memory card,1396224000,2014-03-31,252,0,0,0,0,0.0
4913,A37E6P3DSO9QJD,B007WTAJTO,Z Taylor,"[0, 0]",Good amount of space for the stuff I want to d...,5.0,Great little card,1379289600,2013-09-16,448,0,0,0,0,0.0


### Wilson Lower Bound Score

- Bernoulli parametresi p için hesaplanacak güven aralığının alt sınırı WLB skoru olarak kabul edilir.
- Hesaplanacak skor ürün sıralaması için kullanılır.

In [22]:
def wilson_lower_bound(up,down,confidence=0.95):
    n = up+down
    if n==0:
        return 0
    z = st.norm.ppf(1-(1-confidence)/2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1-phat) + z * z / (4 * n)) /      n )) / (1 + z * z / n)


In [23]:
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)

In [24]:
df.sort_values("wilson_lower_bound", ascending=False).head(20)
#WLB skoruna göre ilk 20 yorum

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,helpful_no,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,A12B7ZMXFI6IXY,B007WTAJTO,"Hyoun Kim ""Faluzure""","[1952, 2020]",[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1367366400,2013-01-05,702,1952,2020,68,1884,28.705882,0.957544
3449,AOEAD7DPLZE53,B007WTAJTO,NLee the Engineer,"[1428, 1505]",I have tested dozens of SDHC and micro-SDHC ca...,5.0,Top of the class among all (budget-priced) mic...,1348617600,2012-09-26,803,1428,1505,77,1351,18.545455,0.936519
4212,AVBMZZAFEKO58,B007WTAJTO,SkincareCEO,"[1568, 1694]",NOTE: please read the last update (scroll to ...,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1375660800,2013-05-08,579,1568,1694,126,1442,12.444444,0.912139
317,A1ZQAQFYSXL5MQ,B007WTAJTO,"Amazon Customer ""Kelly""","[422, 495]","If your card gets hot enough to be painful, it...",1.0,"Warning, read this!",1346544000,2012-02-09,1033,422,495,73,349,5.780822,0.818577
4672,A2DKQQIZ793AV5,B007WTAJTO,Twister,"[45, 49]",Sandisk announcement of the first 128GB micro ...,5.0,Super high capacity!!! Excellent price (on Am...,1394150400,2014-07-03,158,45,49,4,41,11.25,0.808109
1835,A1J6VSUM80UAF8,B007WTAJTO,goconfigure,"[60, 68]",Bought from BestBuy online the day it was anno...,5.0,I own it,1393545600,2014-02-28,283,60,68,8,52,7.5,0.784651
3981,A1K91XXQ6ZEBQR,B007WTAJTO,"R. Sutton, Jr. ""RWSynergy""","[112, 139]",The last few days I have been diligently shopp...,5.0,"Resolving confusion between ""Mobile Ultra"" and...",1350864000,2012-10-22,777,112,139,27,85,4.148148,0.732136
3807,AFGRMORWY2QNX,B007WTAJTO,R. Heisler,"[22, 25]",I bought this card to replace a lost 16 gig in...,3.0,"Good buy for the money but wait, I had an issue!",1361923200,2013-02-27,649,22,25,3,19,7.333333,0.700442
4306,AOHXKM5URSKAB,B007WTAJTO,Stellar Eller,"[51, 65]","While I got this card as a ""deal of the day"" o...",5.0,Awesome Card!,1339200000,2012-09-06,823,51,65,14,37,3.642857,0.670334
4596,A1WTQUOQ4WG9AI,B007WTAJTO,"Tom Henriksen ""Doggy Diner""","[82, 109]",Hi:I ordered two card and they arrived the nex...,1.0,Designed incompatibility/Don't support SanDisk,1348272000,2012-09-22,807,82,109,27,55,3.037037,0.663595
