# General Information
- In this section, we will discuss Rating Products, one of the solutions to the measurement problems in purchasing.

# When buying a product, what are the things that the buyer pays attention to?

 - Product Ratings.
 - Number of product stars. (how many votes from which star)
 - Number of Product Reviews.
 - Number of product sales.
 - Social proof (Social Proof) (i.e. the most useful comment)
 
 - Here, we will examine the measurement methods that affect the seller's purchase of a product while selling it, and as a marketplace, we will examine the measurement methods we use to bring the most logical product to this buyer.
 
 - **The user:** wants to get the best product in terms of price and performance. As a marketplace, we want to make it easier for them to make the best choice in this regard by using some measurement methods.
 
**When ranking products in a marketplace, the following situations are taken into account.**
 
- Product Ratings
- Number of product reviews
- Number of purchases
- Number of useful comments

In this section we will try to solve measurement problems using the **Rating Products method**.


# Rating Products
- We will be performing weighted product ratings, taking into account possible factors. In this section, we will learn how companies calculate the scores that their websites give to their products. We will use the following methods for these calculations.

- Average
- Time-Based Weighted Average
- User-Based Weighted Average
- Weighted Rating

# Business Problem
- A website that offers online trainings wants to calculate the Course Score of a training content on its web page using Average, Time - Based Weighted, User-Based Weighted Average and Weighted Rating methods, which is a combination of both weighting methods.

# Data Story
- The dataset contains data from **(50+ Hours) Python A-Z™: Data Science and Machine Learning** training course. 

**This data set** 

- **Rating:** Product Rating.
- **Timestamp:** Date of product rating.
- **Enrolled:** Date of product purchase.
- **Progress:** Percentage of training content watched (%)
- **Questions Asked:** Number of questions about training content.
- **Questions Answered:** Number of answers given to questions asked.

In [1]:
# import Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import math
import scipy.stats as st

import matplotlib.pyplot as plt
import plotly.express as px
import warnings

from sklearn.preprocessing import MinMaxScaler

warnings.simplefilter(action="ignore")

In [2]:
# Adjusting Row Column Settings

pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [3]:
# Loading the Data Set

df = pd.read_csv("/kaggle/input/course-reviewscsv/course_reviews.csv")

In [4]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


In [5]:
# Preliminary examination of the data set

def check_df(dataframe, head=5):
    print('##################### Shape #####################')
    print(dataframe.shape)
    print('##################### Types #####################')
    print(dataframe.dtypes)
    print('##################### Head #####################')
    print(dataframe.head(head))
    print('##################### Tail #####################')
    print(dataframe.tail(head))
    print('##################### NA #####################')
    print(dataframe.isnull().sum())
    print('##################### Quantiles #####################')
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

##################### Shape #####################
(4323, 6)
##################### Types #####################
Rating                float64
Timestamp              object
Enrolled               object
Progress              float64
Questions Asked       float64
Questions Answered    float64
dtype: object
##################### Head #####################
   Rating            Timestamp             Enrolled  Progress  Questions Asked  Questions Answered
0 5.00000  2021-02-05 07:45:55  2021-01-25 15:12:08   5.00000          0.00000             0.00000
1 5.00000  2021-02-04 21:05:32  2021-02-04 20:43:40   1.00000          0.00000             0.00000
2 4.50000  2021-02-04 20:34:03  2019-07-04 23:23:27   1.00000          0.00000             0.00000
3 5.00000  2021-02-04 16:56:28  2021-02-04 14:41:29  10.00000          0.00000             0.00000
4 4.00000  2021-02-04 15:00:24  2020-10-13 03:10:07  10.00000          0.00000             0.00000
##################### Tail #####################
    

In [6]:
# Let's check the distribution of points given to the product.

df["Rating"].value_counts()

5.00000    3267
4.50000     475
4.00000     383
3.50000      96
3.00000      62
1.00000      15
2.00000      12
2.50000      11
1.50000       2
Name: Rating, dtype: int64

In [7]:
# Let's check the distribution of questions asked about the product.

df["Questions Asked"].value_counts() 

0.00000     3867
1.00000      276
2.00000       80
3.00000       43
4.00000       15
5.00000       13
6.00000        9
8.00000        5
9.00000        3
14.00000       2
11.00000       2
7.00000        2
10.00000       2
15.00000       2
22.00000       1
12.00000       1
Name: Questions Asked, dtype: int64

In [8]:
# Let's look at the number of people asking questions and the average score.

df.groupby("Questions Asked").agg({"Questions Asked": "count",
                                   "Rating": "mean"})

Unnamed: 0_level_0,Questions Asked,Rating
Questions Asked,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,3867,4.76519
1.0,276,4.74094
2.0,80,4.80625
3.0,43,4.74419
4.0,15,4.83333
5.0,13,4.65385
6.0,9,5.0
7.0,2,4.75
8.0,5,4.9
9.0,3,5.0


# Average

In [9]:
# Average Score

df["Rating"].mean()

4.764284061993986

# Time-Based Weighted Average

- **Question:** What can we do to better reflect the current trend in the average?
- **Answer:** Time - Based Weighted Average If we make a weighted average based on score times, then we can calculate a weighted average based on time with combinations such as giving a different weight to the last 30 days and a different weight to the last 60 days.
- In this way, we reflect the current trend to the average in the best way possible.

In [10]:
# Let's change the type of the Timestamp variable to datetime a time variable.

df["Timestamp"] = pd.to_datetime(df["Timestamp"])

In [11]:
df["Timestamp"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 4323 entries, 0 to 4322
Series name: Timestamp
Non-Null Count  Dtype         
--------------  -----         
4323 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 33.9 KB


In [12]:
# Let's change the type of the Enrolled variable to datetime a time variable.

df["Enrolled"]=pd.to_datetime(df["Enrolled"])

In [13]:
df["Enrolled"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 4323 entries, 0 to 4322
Series name: Enrolled
Non-Null Count  Dtype         
--------------  -----         
4323 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 33.9 KB


In [14]:
# Present Date (date of analysis)

today_date = pd.to_datetime('2021-02-10 0:0:0')

In [15]:
today_date

Timestamp('2021-02-10 00:00:00')

In [16]:
# Let's determine how many days ago the last comment on the product was made by creating a variable called days.

df["days"] = (today_date - df["Timestamp"]).dt.days

In [17]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


In [18]:
# Let's check the number of comments in the last 30 days.

df[df["days"] <= 30].count()

Rating                194
Timestamp             194
Enrolled              194
Progress              194
Questions Asked       194
Questions Answered    194
days                  194
dtype: int64

In [19]:
# Let's check the average of comments made in the last 30 days.

df.loc[df["days"] <= 30, "Rating"].mean()

4.775773195876289

In [20]:
# Let's check the average of comments made between the last 30 days and 90 days.

df.loc[(df["days"] > 30) & (df["days"] <= 90), "Rating"].mean()

4.763833992094861

In [21]:
# Let's check the average of comments made between the last 90 days and 180 days.

df.loc[(df["days"] > 90) & (df["days"] <= 180), "Rating"].mean()

4.752503576537912

In [22]:
# Let's check the average score of people who have been commenting for more than 180 days.

df.loc[(df["days"] > 180), "Rating"].mean()

4.76641586867305

- **Information:** As you can see above, in each of the operations we did, there were changes in the 3rd or 4th digit after the comma. Is this important for us? Yes, it is. On the e-commerce side, every digit after the comma has a serious meaning. With these operations, we have sensitized the result.

- We will focus on certain different time intervals in a different way in the following processes. For this reason, we will try to reflect the effect of time in the weighting calculation by giving different weights for the time intervals we have determined.
- In order to reflect the effect of time in the weight calculation, we set weights for different time intervals as follows. 
- These weights are determined entirely by ourselves according to the data set and we decide by our own interpretation.

**In this data set, we decided on the weights as follows:**
- **28%** of the impact of those who commented **between 0 and 30 days**
- **26%** impact of those who commented **between 30 and 90 days**
- **24%** impact of those who commented **between 90 and 180 days**
- For **more than 180 days**, we set the impact at **22%**.

We determine these rates entirely on our own according to the data set.

In [23]:
df.loc[df["days"] <= 30, "Rating"].mean() * 28/100 + \
    df.loc[(df["days"] > 30) & (df["days"] <= 90), "Rating"].mean() * 26/100 + \
    df.loc[(df["days"] > 90) & (df["days"] <= 180), "Rating"].mean() * 24/100 + \
    df.loc[(df["days"] > 180), "Rating"].mean() * 22/100

4.765025682267194

In [24]:
# If we want to use it, the functionalized version is as follows.

def time_based_weighted_average(dataframe, w1=28, w2=26, w3=24, w4=22):
    return dataframe.loc[df["days"] <= 30, "Rating"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["days"] > 30) & (dataframe["days"] <= 90), "Rating"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["days"] > 90) & (dataframe["days"] <= 180), "Rating"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["days"] > 180), "Rating"].mean() * w4 / 100

In [25]:
# we use default weitghts

time_based_weighted_average(df)

4.765025682267194

In [26]:
# if we use different weights

time_based_weighted_average(df, 30, 26, 22, 22)

4.765491074653962

**- In the above study, we have made product scoring with the weights given to the timeliness of the comment made, which is one of the methods used in scoring the product sold in a marketplace.**

**- The importance of this study is that we tried to see the effect of the most up-to-date comment on the product scoring and we calculated it.**

# User-Based Weighted Average 
- User Quality

- In the above study, we calculated the most recent comments by giving them more weighted points.

- **Question:** Should all user ratings have the same weight? So should the score given by the person who watches the whole course and the person who watches 1 percent of the course have the same weight?
- We can say that one user has opened the course and closed it, but the other has watched the whole course. Now should their scores be weighted the same? This will be our reference point.
- Here, we need to make a different weighting according to the rate of following the course. We make such an assumption.

- We want to weight the points awarded according to the progress of the course.

In [27]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


In [28]:
# Let's look at the average score according to the rate of course watching.

df.groupby("Progress").agg({"Rating": "mean"})

Unnamed: 0_level_0,Rating
Progress,Unnamed: 1_level_1
0.00000,4.67391
1.00000,4.64269
2.00000,4.65476
3.00000,4.66355
4.00000,4.77733
...,...
94.00000,5.00000
95.00000,4.79412
97.00000,5.00000
98.00000,5.00000


**Your Monitoring Rate**
- We took **22%** of the effect of the average score of those with **less than 10%**.
- We took **24%** of the average score of those who monitored between **10% and 45%**.
- We took **26%** of the impact of the average score of those who monitored between **45% and 75%**.
- We took **28%** of the average score of those who watched more than **75%**.

Thus, we calculated a user-based weighted average.

In [29]:
# Let's calculate the score weighted by the viewership of the course.

df.loc[df["Progress"] <= 10, "Rating"].mean() * 22 / 100 + \
    df.loc[(df["Progress"] > 10) & (df["Progress"] <= 45), "Rating"].mean() * 24 / 100 + \
    df.loc[(df["Progress"] > 45) & (df["Progress"] <= 75), "Rating"].mean() * 26 / 100 + \
    df.loc[(df["Progress"] > 75), "Rating"].mean() * 28 / 100

4.800257704672543

In [30]:
# If we want to use it, the functionalized version is as follows.

def user_based_weighted_average(dataframe, w1=22, w2=24, w3=26, w4=28):
    return dataframe.loc[dataframe["Progress"] <= 10, "Rating"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 10) & (dataframe["Progress"] <= 45), "Rating"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 45) & (dataframe["Progress"] <= 75), "Rating"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 75), "Rating"].mean() * w4 / 100

In [31]:
# we use default weitghts

user_based_weighted_average(df)

4.800257704672543

In [32]:
# if we use different weights

user_based_weighted_average(df, 20, 24, 26, 30)

4.803286469062915

# Weighted Rating

- In the two studies we did above, we scored the product using both **Time-Based Weighted Average** and **User-Based Weighted Average** separately.
- Now we will calculate the score of the product by using the effect of both of these methods at the same time.

In [33]:
def course_weighted_rating(dataframe, time_w=50, user_w=50):
    return time_based_weighted_average(dataframe) * time_w/100 + user_based_weighted_average(dataframe)*user_w/100

In [34]:
# we use default weitghts

course_weighted_rating(df)

4.782641693469868

In [35]:
# if we use different weights

course_weighted_rating(df, time_w=40, user_w=60)

4.786164895710403