# Measurement Problems

   Social proof is a psychological phenomenon that occurs when individuals seek to demonstrate appropriate conduct for a circumstance by imitating the acts of others. Simply having social proof increases a company's credibility since, by definition, it comes from actual consumers, and because fewer and fewer people believe traditional advertising, their opinions often have more significance than those of the firms themselves. The important question is what makes a product sell.

In this code, the topics of rating products, sorting products, sorting reviews, and AB Testing will be examined.

In [1]:
# libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import math
from sklearn.preprocessing import MinMaxScaler
import scipy.stats as st

# visual settings
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)
pd.set_option("display.width",500)
pd.set_option("display.float_format",lambda x: '%.5f' % x)
sns.set(rc={"figure.figsize":(12,12)})

In [2]:
data = pd.read_csv("datas/course_reviews.csv")
data.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


In [3]:
data.shape

(4323, 6)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rating              4323 non-null   float64
 1   Timestamp           4323 non-null   object 
 2   Enrolled            4323 non-null   object 
 3   Progress            4323 non-null   float64
 4   Questions Asked     4323 non-null   float64
 5   Questions Answered  4323 non-null   float64
dtypes: float64(4), object(2)
memory usage: 202.8+ KB


In [5]:
# distribution of the ratings
data["Rating"].value_counts()

5.00000    3267
4.50000     475
4.00000     383
3.50000      96
3.00000      62
1.00000      15
2.00000      12
2.50000      11
1.50000       2
Name: Rating, dtype: int64

In [6]:
data["Questions Asked"].value_counts()

0.00000     3867
1.00000      276
2.00000       80
3.00000       43
4.00000       15
5.00000       13
6.00000        9
8.00000        5
9.00000        3
14.00000       2
11.00000       2
7.00000        2
10.00000       2
15.00000       2
22.00000       1
12.00000       1
Name: Questions Asked, dtype: int64

Many users tend not to ask questions.

In [7]:
data.groupby("Questions Asked").agg({"Questions Asked":"count",
                                    "Rating":"mean"})

Unnamed: 0_level_0,Questions Asked,Rating
Questions Asked,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,3867,4.76519
1.0,276,4.74094
2.0,80,4.80625
3.0,43,4.74419
4.0,15,4.83333
5.0,13,4.65385
6.0,9,5.0
7.0,2,4.75
8.0,5,4.9
9.0,3,5.0


## Time-Based Weighted Average

Timestamp column is object, therefore it need to be converted to datetime.

In [8]:
data["Timestamp"] = pd.to_datetime(data["Timestamp"])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Rating              4323 non-null   float64       
 1   Timestamp           4323 non-null   datetime64[ns]
 2   Enrolled            4323 non-null   object        
 3   Progress            4323 non-null   float64       
 4   Questions Asked     4323 non-null   float64       
 5   Questions Answered  4323 non-null   float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 202.8+ KB


In [9]:
current_date = pd.to_datetime("2021-02-10 0:0:0") # max datetime

data["Days"] = (current_date - data["Timestamp"]).dt.days
data.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,Days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


##### Average of evaluations made in the last thirty days

In [10]:
data.loc[data["Days"] <= 30, "Rating"].mean()

4.775773195876289

In [11]:
data.loc[(data["Days"] > 30) & data["Days"] <= 90, "Rating"].mean()

4.764284061993986

In [12]:
data.loc[(data["Days"] > 90) & data["Days"] <= 180, "Rating"].mean()

4.764284061993986

In [13]:
data.loc[data["Days"] > 180, "Rating"].mean()

4.76641586867305

In [14]:
def time_based_weighted_average(dataframe, w1, w2, w3, w4):
    
    return data.loc[data["Days"] <= 30, "Rating"].mean() * w1/100 + \
    data.loc[(data["Days"] > 30) & data["Days"] <= 90, "Rating"].mean() * w2/100 + \
    data.loc[(data["Days"] > 90) & data["Days"] <= 180, "Rating"].mean() * w3/100 + \
    data.loc[data["Days"] > 180, "Rating"].mean() * w3/100

In [15]:
time_based_weighted_average(data,30,26,22,22)

4.7681997996280705

## User-Based Weighted Average

A user watching 10% of the course and a user watching 80% of the course should not be evaluated with the same weight.

In [16]:
print(data.groupby("Progress").agg({"Rating":"mean"}))

           Rating
Progress         
0.00000   4.67391
1.00000   4.64269
2.00000   4.65476
3.00000   4.66355
4.00000   4.77733
5.00000   4.69821
6.00000   4.75510
7.00000   4.73256
8.00000   4.74194
9.00000   4.83125
10.00000  4.74569
11.00000  4.83333
12.00000  4.83333
13.00000  4.59375
14.00000  4.86957
15.00000  4.71458
16.00000  4.90000
17.00000  4.85000
18.00000  4.87500
19.00000  4.81250
20.00000  4.78352
21.00000  4.80000
22.00000  4.60000
23.00000  4.84615
24.00000  4.73333
25.00000  4.81687
26.00000  5.00000
27.00000  4.81250
28.00000  4.94444
29.00000  4.57692
30.00000  4.79808
31.00000  4.91667
32.00000  5.00000
33.00000  4.93750
34.00000  4.75000
35.00000  4.85959
36.00000  4.33333
37.00000  5.00000
38.00000  4.90000
39.00000  4.58333
40.00000  4.76786
41.00000  3.16667
42.00000  5.00000
43.00000  5.00000
44.00000  4.95455
45.00000  4.80769
46.00000  5.00000
47.00000  5.00000
48.00000  5.00000
49.00000  5.00000
50.00000  4.80198
51.00000  4.50000
52.00000  5.00000
53.00000  

In this analysis, it is seen that users who make progress in the course rate higher.

In [17]:
def user_weighted_average_rating(dataframe, w1, w2, w3, w4):
    
    return dataframe.loc[dataframe["Progress"] <= 10, "Rating"].mean() * w1/100 + \
    dataframe.loc[(dataframe["Progress"] > 10) & (dataframe["Progress"] <= 45), "Rating"].mean() * w2/100 + \
    dataframe.loc[(dataframe["Progress"] > 45) & (dataframe["Progress"] <= 75), "Rating"].mean() * w3/100 + \
    dataframe.loc[dataframe["Progress"] > 75, "Rating"].mean() * w4/100

In [18]:
user_weighted_average_rating(data, 20, 24, 26, 30)

4.803286469062915

## Weighted Rating

In [19]:
def weighted_rating(dataframe, time_weighted = 50, user_weight = 50):
    
    return time_based_weighted_average(dataframe,30,26,22,22) * time_weighted/100 + user_weighted_average_rating(dataframe,20, 24, 26, 30) * user_weight/100

In [20]:
weighted_rating(data)

4.785743134345493

In [21]:
weighted_rating(data, time_weighted= 40, user_weight= 60)

4.7892518012889775

# Sorting Products

## Sorting by Rating

The current Web uses this basic user-rating sorting interaction pattern to rank products (Amazon), postings (Reddit), businesses (Yelp), videos (YouTube), and more. Designers must create a logical total ordering over them using a distribution of ratings for each item in order to use this pattern.

In [22]:
data_course = pd.read_csv("datas/product_sorting.csv")
data_course.head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10


In [23]:
data_course.sort_values(by="rating", ascending=False).head(15)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0
6,Course_2,Instructor_3,3171,4.7,856,582,205,51,9,9
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24
8,A'dan Z'ye Apache Spark (Scala & Python),Veri Bilimi Okulu,6920,4.7,214,154,41,13,2,4
13,Course_5,Instructor_6,6056,4.7,144,82,46,12,1,3
27,Course_15,Instructor_1,1164,4.6,98,65,24,6,0,3
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45


When sorting by rating alone, some important points may be overlooked. For example, *Course_6* has only *140* purchases and *20* reviews, while it has a high score of *4.6*, whereas, the *Python and Machine Learning* course just below Course_6 scored *4.6* with *11314* purchases and *969* reviews. It would not be the right approach to accept the scores of both courses as having the same value in the ranking. For this reason, logic should be developed by including other qualities instead of ranking only according to the scores obtained.

## Sorting by Rating, Comment and Purchase

In [24]:
data_course["purchase_count_scaler"] = MinMaxScaler(feature_range=(1,5)).fit(data_course[["purchase_count"]]).transform(data_course[["purchase_count"]])

In [25]:
data_course["commment_count_scaled"] = MinMaxScaler(feature_range=(1,5)).fit(data_course[["commment_count"]]).transform(data_course[["commment_count"]])

In [26]:
data_course.head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaler,commment_count_scaled
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.43801,5.0
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.8847
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.54684,3.04161
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.54669,1.88427
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.93525,1.83398


In [27]:
(data_course["purchase_count_scaler"] * 26/100 + data_course["commment_count_scaled"] * 32/100 + data_course["rating"] * 42/100)

0    4.24988
1    4.79510
2    3.48349
3    2.93711
4    3.02204
5    2.75165
6    2.85721
7    2.52239
8    2.75990
9    2.81623
10   3.42792
11   2.98739
12   2.52869
13   2.72186
14   3.50198
15   3.36577
16   2.51798
17   2.28232
18   2.45807
19   2.72659
20   3.68156
21   2.51906
22   2.53828
23   2.35459
24   2.26479
25   2.02105
26   2.43612
27   2.56168
28   2.54467
29   1.92584
30   1.92400
31   2.27376
dtype: float64

In [28]:
def weighted_sorting_score(dataframe, w1=32, w2=26, w3=42):
    
    return((data_course["commment_count_scaled"] * w1/100 +
            data_course["purchase_count_scaler"] * w2/100 +
            data_course["rating"] * w3/100))

In [29]:
data_course["weighted_sorting_score"] = weighted_sorting_score(data_course)

In [30]:
data_course.sort_values("weighted_sorting_score", ascending=False).head(20)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaler,commment_count_scaled,weighted_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.8847,4.7951
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.43801,5.0,4.24988
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.07051,3.91634,3.68156
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.09623,3.50198
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.54684,3.04161,3.48349
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.78937,2.95839,3.42792
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.94213,3.03381,3.36577
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.93525,1.83398,3.02204
11,Course_3,Instructor_4,24809,4.3,250,95,87,51,12,5,3.05375,1.21066,2.98739
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.54669,1.88427,2.93711


# Bayesian Average Rating Score

In [31]:
def bayesian_average_rating(n, confidence=0.95):
    
    if sum(n) == 0:
        return 0
    
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    
    first_part = 0.0
    second_part = 0.0
    
    for k,n_k in enumerate(n):
        first_part += (k+1) * (n[k] + 1) / (N+K)
        second_part += (k+1) * (k+1) * (n[k] + 1) / (N+K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N+K+1))
    
    return score

In [32]:
data_course["bar_score"] = data_course.apply(lambda x: bayesian_average_rating(x[["1_point",
                                                                                 "2_point",
                                                                                 "3_point",
                                                                                 "4_point",
                                                                                 "5_point"]]), axis=1)

In [33]:
data_course.sort_values("weighted_sorting_score", ascending=False).head(20)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaler,commment_count_scaled,weighted_sorting_score,bar_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.8847,4.7951,4.51604
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.43801,5.0,4.24988,4.66586
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.07051,3.91634,3.68156,4.48063
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.09623,3.50198,4.56816
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.54684,3.04161,3.48349,4.51521
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.78937,2.95839,3.42792,4.64168
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.94213,3.03381,3.36577,4.45481
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.93525,1.83398,3.02204,4.59567
11,Course_3,Instructor_4,24809,4.3,250,95,87,51,12,5,3.05375,1.21066,2.98739,3.87774
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.54669,1.88427,2.93711,4.48208


In [34]:
data_course.sort_values("bar_score", ascending=False).head(20)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaler,commment_count_scaled,weighted_sorting_score,bar_score
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0,1.25901,1.19766,2.72659,4.72913
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.43801,5.0,4.24988,4.66586
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.78937,2.95839,3.42792,4.64168
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0,1.37886,1.17859,2.75165,4.63448
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.93525,1.83398,3.02204,4.59567
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.09623,3.50198,4.56816
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.8847,4.7951,4.51604
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.54684,3.04161,3.48349,4.51521
6,Course_2,Instructor_3,3171,4.7,856,582,205,51,9,9,1.26033,1.73602,2.85721,4.50797
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.54669,1.88427,2.93711,4.48208


# Hybrid Sorting

In [35]:
def hybrid_sorting_score(dataframe, bar_w = 60, wss_w = 40):
    
    bar_score = dataframe.apply(lambda x: bayesian_average_rating(x[["1_point",
                                                                                 "2_point",
                                                                                 "3_point",
                                                                                 "4_point",
                                                                                 "5_point"]]), axis=1)
    
    wss = weighted_sorting_score(dataframe)
    
    return (bar_score * bar_w/100) + (wss * wss_w /100) 

In [36]:
data_course["hybrid_sorting_score"] = hybrid_sorting_score(data_course)

data_course.sort_values("hybrid_sorting_score", ascending=False).head(20)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaler,commment_count_scaled,weighted_sorting_score,bar_score,hybrid_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.8847,4.7951,4.51604,4.62766
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.43801,5.0,4.24988,4.66586,4.49947
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.07051,3.91634,3.68156,4.48063,4.161
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.78937,2.95839,3.42792,4.64168,4.15618
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.09623,3.50198,4.56816,4.14169
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.54684,3.04161,3.48349,4.51521,4.10252
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.94213,3.03381,3.36577,4.45481,4.0192
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.93525,1.83398,3.02204,4.59567,3.96622
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0,1.25901,1.19766,2.72659,4.72913,3.92811
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0,1.37886,1.17859,2.75165,4.63448,3.88135
