![](https://www.algolia.com/doc/assets/images/guides/solutions/solutions-bayesian-average-ranking-f2f6cbb5.jpg)
image from [link](https://www.algolia.com/doc/guides/managing-results/must-do/custom-ranking/how-to/bayesian-average/)


## Sorting Products

Bayesian average score calculates a probabilistic weighted average based on the score distributions.

For example, let's say two people rated a product with an average of 5, and another product was rated by 40 people with an average of 4.6.

Which one is more meaningful, and which product should rank higher in the rating?

In this project, we will examine the social proof effect through a ranking of scores calculated using the Bayes score.

*Social proof is a psychological and social phenomenon wherein people copy the actions of others in choosing how to behave in a given situation*

### Dataset Details

itemId: Product ID

category: Product Category

name: User Name

rating: Product rating

upVotes: up for reviews

downVotes: down for reviews

reviewTitle
 
reviewContent

-originalRating
-likeCount
-helpful
-relevanceScore
-boughtDate
-clientType
-retrievedDate

In [15]:
import pandas as pd
import math
import scipy.stats as st

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = pd.read_csv("/kaggle/input/lazada-indonesian-reviews/20191002-reviews.csv")

In [3]:
# We are trying to understand the data.

def check_df(dataframe, head=5):
    print("################### Shape ####################")
    print(dataframe.shape)
    print("#################### Info #####################")
    print(dataframe.info())
    print("################### Nunique ###################")
    print(dataframe.nunique())
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("################## Quantiles #################")
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    print("#################### Head ####################")
    print(dataframe.head(head))

check_df(df)

################### Shape ####################
(203787, 15)
#################### Info #####################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203787 entries, 0 to 203786
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   itemId          203787 non-null  int64  
 1   category        203787 non-null  object 
 2   name            203787 non-null  object 
 3   rating          203787 non-null  int64  
 4   originalRating  8 non-null       float64
 5   reviewTitle     23404 non-null   object 
 6   reviewContent   107029 non-null  object 
 7   likeCount       203787 non-null  int64  
 8   upVotes         203787 non-null  int64  
 9   downVotes       203787 non-null  int64  
 10  helpful         203787 non-null  bool   
 11  relevanceScore  203787 non-null  float64
 12  boughtDate      196680 non-null  object 
 13  clientType      203787 non-null  object 
 14  retrievedDate   203787 non-null  object 

In [16]:
# Data Preparation

# We are removing columns that are unnecessary and contain missing values.
df = df.drop(columns=["originalRating", "boughtDate", "likeCount", "helpful", "relevanceScore", "clientType", "retrievedDate"])

# To calculate the prior Bayes score, I need to know the distribution of the ratings received by each product
df["rating2"]=df["rating"]
df_bay=df.pivot_table("rating2", "itemId", "rating", aggfunc="count")


df_bay.head()


rating,1,2,3,4,5
itemId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6068,,,,,25.0
6070,,,,5.0,40.0
19946,,,,2.0,15.0
19949,,,1.0,4.0,28.0
25844,,,,4.0,4.0


In [17]:
# We are changing the names of the variables we created.
df_bay.columns=["1_point","2_point","3_point","4_point","5_point"]
df_bay = df_bay.reset_index()

# We are filling in missing values with 0.
df_bay = df_bay.fillna(0)


df_bay.head(10)

Unnamed: 0,itemId,1_point,2_point,3_point,4_point,5_point
0,6068,0.0,0.0,0.0,0.0,25.0
1,6070,0.0,0.0,0.0,5.0,40.0
2,19946,0.0,0.0,0.0,2.0,15.0
3,19949,0.0,0.0,1.0,4.0,28.0
4,25844,0.0,0.0,0.0,4.0,4.0
5,25850,0.0,0.0,4.0,0.0,0.0
6,40339,0.0,0.0,0.0,5.0,25.0
7,44569,0.0,0.0,0.0,0.0,4.0
8,45076,0.0,0.0,0.0,0.0,2.0
9,49780,2.0,0.0,34.0,88.0,208.0


In [18]:
# Bayes score calculation function
def bayesian_average_rating(n, confidence=0.95):
    if sum(n) == 0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1) * (k + 1) * (n[k] + 1) / (N + K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    return score


# We are creating the Bayes score under the variable name bar_score.
df_bay["bar_score"] = df_bay.apply(lambda x: bayesian_average_rating(x[["1_point",
                                                                        "2_point",
                                                                        "3_point",
                                                                        "4_point",
                                                                        "5_point"]]), axis=1)

In [19]:
# We are listing the top 20 products according to their Bayes score.

df_bay.sort_values(ascending=False, by="bar_score").head(20)

Unnamed: 0,itemId,1_point,2_point,3_point,4_point,5_point,bar_score
2095,170127119,0.0,0.0,0.0,5.0,305.0,4.91601
3380,439142140,0.0,0.0,4.0,28.0,704.0,4.91563
1991,160033631,0.0,0.0,0.0,10.0,305.0,4.89943
3602,487388679,0.0,0.0,0.0,4.0,224.0,4.89158
4032,563778338,0.0,0.0,0.0,8.0,260.0,4.89059
2498,354135263,0.0,4.0,6.0,44.0,848.0,4.88964
3571,480270305,8.0,4.0,8.0,12.0,884.0,4.87866
3572,480274479,0.0,0.0,4.0,16.0,376.0,4.87768
3360,436732295,0.0,0.0,0.0,8.0,226.0,4.87515
1921,156289469,0.0,0.0,10.0,25.0,530.0,4.87097


## Sorting Reviews

The WLB Score method calculates a confidence interval for the Bernoulli parameter p and considers the lower limit of this confidence interval as the WLB score

Bernoulli bir olasilik dagilimidir. Daha cok ikili olaylarin olasiligini hesaplamak icin kullanilir.



In [20]:
# We are trying to understand the data.

def check_df(dataframe, head=5):
    print("################### Shape ####################")
    print(dataframe.shape)
    print("#################### Info #####################")
    print(dataframe.info())
    print("################### Nunique ###################")
    print(dataframe.nunique())
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("################## Quantiles #################")
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    print("#################### Head ####################")
    print(dataframe.head(head))

check_df(df)

################### Shape ####################
(203787, 9)
#################### Info #####################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203787 entries, 0 to 203786
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   itemId         203787 non-null  int64 
 1   category       203787 non-null  object
 2   name           203787 non-null  object
 3   rating         203787 non-null  int64 
 4   reviewTitle    23404 non-null   object
 5   reviewContent  107029 non-null  object
 6   upVotes        203787 non-null  int64 
 7   downVotes      203787 non-null  int64 
 8   rating2        203787 non-null  int64 
dtypes: int64(5), object(4)
memory usage: 14.0+ MB
None
################### Nunique ###################
itemId            4422
category             5
name             40099
rating               5
reviewTitle       5725
reviewContent    38071
upVotes            125
downVotes           57
rating2    

In [22]:
# We are removing rating2 from the previous section
df = df.drop(columns="rating2", axis=1)

# upVotes cannot be -1, so we are removing it
df = df[(df["upVotes"] > -1)] 

df.head()

Unnamed: 0,itemId,category,name,rating,reviewTitle,reviewContent,upVotes,downVotes
0,100002528,beli-harddisk-eksternal,Kamal U.,5,,bagus mantap dah sesui pesanan,0,0
1,100002528,beli-harddisk-eksternal,yofanca m.,4,,"Bagus, sesuai foto",0,0
2,100002528,beli-harddisk-eksternal,Lazada Customer,5,ok mantaaapppp barang sesuai pesanan.. good,okkkkk mantaaaaaaapppp ... goood,0,0
3,100002528,beli-harddisk-eksternal,Lazada Customer,4,,bagus sesuai,0,0
4,100002528,beli-harddisk-eksternal,Yosep M.,5,,,0,0


In [23]:
# Wilson lower bound calculation function
def wilson_lower_bound(up, down, confidence=0.95):

    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)


# We are creating the Wilson lower bound score under the variable name wilson_lower_bound.
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["upVotes"], x["downVotes"]), axis=1)


# We are listing the reviews of the top 20 products according to their WLB score.
df.sort_values(ascending=False, by="wilson_lower_bound").head(20)

Unnamed: 0,itemId,category,name,rating,reviewTitle,reviewContent,upVotes,downVotes,wilson_lower_bound
185060,160020846,shop-televisi-digital,alfonsus y.,1,PELAYANAN BURUK,"pesan barang Dari tgl 12 Feb 2017, sampai deti...",1387,8,0.98872
126168,160020846,jual-flash-drives,alfonsus y.,1,PELAYANAN BURUK,"pesan barang Dari tgl 12 Feb 2017, sampai deti...",1387,8,0.98872
15062,160020846,beli-harddisk-eksternal,alfonsus y.,1,PELAYANAN BURUK,"pesan barang Dari tgl 12 Feb 2017, sampai deti...",1387,8,0.98872
90592,160020846,beli-smart-tv,alfonsus y.,1,PELAYANAN BURUK,"pesan barang Dari tgl 12 Feb 2017, sampai deti...",1387,8,0.98872
15063,160020846,beli-harddisk-eksternal,M M.,1,SUDAH 2 MINGGU BARANG BELUM DATANG,Sesuai detail pesanan seharusnya barang sudah ...,1776,12,0.98831
185061,160020846,shop-televisi-digital,M M.,1,SUDAH 2 MINGGU BARANG BELUM DATANG,Sesuai detail pesanan seharusnya barang sudah ...,1776,12,0.98831
126169,160020846,jual-flash-drives,M M.,1,SUDAH 2 MINGGU BARANG BELUM DATANG,Sesuai detail pesanan seharusnya barang sudah ...,1776,12,0.98831
90593,160020846,beli-smart-tv,M M.,1,SUDAH 2 MINGGU BARANG BELUM DATANG,Sesuai detail pesanan seharusnya barang sudah ...,1776,12,0.98831
15980,160022809,beli-harddisk-eksternal,Tekad A.,5,Melebihi Ekspektasi,"Kualitas barang sangat mengesankan, walau di s...",768,4,0.98675
91510,160022809,beli-smart-tv,Tekad A.,5,Melebihi Ekspektasi,"Kualitas barang sangat mengesankan, walau di s...",768,4,0.98675
