<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:100%; text-align:left">

<h3 align="left"><font color='#4E5672'>📝 Description:</font></h3>

* I will apply various sorting algorithms to the trending data of US videos here.
    1. Way 1 - Difference between Like and Dislike 
    1. Way 2 - Percentage of Frequency
    1. Way 3 - The Wilson Lower Bound Score
    1. Way 4 - Custom Weights on Different Parameters
    1. Way 5 - Weighted Raiting
    

In [89]:
import pandas as pd
import numpy as np
import scipy.stats as st
import math
from sklearn.preprocessing import MinMaxScaler

In [54]:
df = pd.read_csv("/kaggle/input/youtube-new/USvideos.csv", 
                 usecols=["video_id","views","likes","dislikes","comment_count"])
df.head()

Unnamed: 0,video_id,views,likes,dislikes,comment_count
0,2kyS6SvSYSE,748374,57527,2966,15954
1,1ZAPwfrtAFY,2418783,97185,6146,12703
2,5qpjK5DgCt4,3191434,146033,5339,8181
3,puqaWrEC7tY,343168,10172,666,2146
4,d380meD0W0M,2095731,132235,1989,17518


<div style="border-radius:10px; border:#6B8BA0 solid; padding: 15px; background-color: #F2EADF; font-size:100%; text-align:left">

<h3 align="left"><font color='#6B8BA0'>👀 Features: </font></h3>
  


1. **`video_id`:**
   - A string representing the unique identifier of the video. It serves as a unique ID for each video.

2. **`views`:**
   - Indicates the number of times a video has been viewed. This number increases with each view of the video.

3. **`likes`:**
   - Represents the number of likes received for the video. As users like a video, this number increases.

4. **`dislikes`:**
   - Represents the number of dislikes received for the video. This number increases when users express dislike for the video.

5. **`comment_count`:**
   - Indicates the number of comments made under the video. This is influenced by users commenting on the video.

These features can be utilized to measure and rank the performance of a YouTube video. For instance, the `views` (number of views) feature can be a significant indicator of a video's popularity. The `likes` and `dislikes` features can reflect the overall satisfaction of users with the video content. The `comment_count` measures the tendency of viewers to comment under the video.


In [55]:
print(f"Dropped Duplicated {df.duplicated().sum()} Rows")
df.drop_duplicates(inplace=True)

Dropped Duplicated 48 Rows


In [39]:
df.isnull().sum()

video_id         0
views            0
likes            0
dislikes         0
comment_count    0
dtype: int64

<div style="border-radius:10px; border:#3B3C37 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#545450'>📝 Notes</font></h3>

* So, we need only views, likes, disslikes and comment_counts for simple sorting algorithms
* Dropped duplicates, and checked if we have any null value

In [20]:
df.sort_values(by="views", ascending=False).head(10)

Unnamed: 0,video_id,views,likes,dislikes,comment_count
38547,VYOjWnS4cMY,225211923,5023450,343541,517232
38345,VYOjWnS4cMY,220490543,4962403,338105,512337
38146,VYOjWnS4cMY,217750076,4934188,335462,509799
37935,VYOjWnS4cMY,210338856,4836448,326902,501722
37730,VYOjWnS4cMY,205643016,4776680,321493,496211
37531,VYOjWnS4cMY,200820941,4714942,316129,491005
37333,VYOjWnS4cMY,196222618,4656929,311042,485797
37123,VYOjWnS4cMY,190950401,4594931,305435,479917
36913,VYOjWnS4cMY,184446490,4512326,298157,473039
36710,VYOjWnS4cMY,179045286,4437175,291098,466470


<div style="border-radius:10px; border:#3B3C37 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#545450'>📝 Notes</font></h3>

* It looks like we have several counts for the same video at different times; let's get only the maximum counts for each video.

In [56]:
df = df.groupby("video_id").agg({"views":"max",
                                "likes":"max",
                                "dislikes":"max",
                                "comment_count": "max"})
df.sort_values(by="views", ascending=False).head(10)

Unnamed: 0_level_0,views,likes,dislikes,comment_count
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
VYOjWnS4cMY,225211923,5023450,343541,517232
FlsCjmMhFmw,149376127,3093544,1643059,827755
ffxKSjUwKdU,148689896,3094021,129502,242039
zEf423kYfqk,139334502,1425496,119798,83941
7C2z4GqqS5E,123010920,5613827,206892,1228655
M4ZoCHID9GI,122544931,1427436,40837,55320
TyHvyGVs42U,102012605,2376636,117196,134224
xTlNMmZKwpA,94254507,1816753,102474,101077
6ZfuNTqbHE8,91933007,2625661,53709,350458
-BQJo3vK8O8,87264467,815369,71494,35945


<div style="border-radius:10px; border:#3B3C37 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#545450'>📝 Notes</font></h3>

* Let's try simple sorting algorithms firstly

In [41]:
def score_pos_neg_diff(col1,col2):
    return col1 - col2

def score_average_rating(col1, col2):
    if col1 + col2 == 0:
        return 0
    return col1 / (col1 + col2)

def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

In [57]:
df["score_like_disslike"] = score_pos_neg_diff(df["likes"], df["dislikes"])
df.sort_values(by="score_like_disslike", ascending=False).head(10)


Unnamed: 0_level_0,views,likes,dislikes,comment_count,score_like_disslike
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7C2z4GqqS5E,123010920,5613827,206892,1228655,5406935
VYOjWnS4cMY,225211923,5023450,343541,517232,4679909
ffxKSjUwKdU,148689896,3094021,129502,242039,2964519
kTlv5_Bs8aw,36857298,2729292,47896,546100,2681396
p8npDG2ulKQ,29741771,2700800,29341,371864,2671459
OK3GJ0WIQ8s,23416810,2672431,29088,477233,2643343
6ZfuNTqbHE8,91933007,2625661,53709,350458,2571952
aJOTlE1K90k,66529577,2488565,43464,142410,2445101
TyHvyGVs42U,102012605,2376636,117196,134224,2259440
kX0vO4vlJuU,17624672,2162749,18342,227648,2144407


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>🔍 Way 1 - Difference between Like and Dislike </font></h3>

* When we take the difference between Like and Dislike, it looks like old videos always gets the top list.
* There is frequency but not percentage.
    
<h3 align="left"><font color='#5E5273'>🎞️ Rank 1 : BTS - 'FAKE LOVE'
</font></h3>

    
<center><a href="http://www.youtube.com/watch?feature=player_embedded&v=7C2z4GqqS5E" target="_blank">
 <img src="http://img.youtube.com/vi/7C2z4GqqS5E/mqdefault.jpg" alt="Watch the video" width="540" />
    </a></center>

In [58]:
df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["likes"],
                                                                     x["dislikes"]), axis=1)
df.sort_values(by="score_average_rating", ascending=False).head(10)

Unnamed: 0_level_0,views,likes,dislikes,comment_count,score_like_disslike,score_average_rating
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
p7KHGUwqF24,4351,10,0,0,10,1.0
yirvgC-kMq0,3311,42,0,2,42,1.0
yq4mgb1PDTI,10345,39,0,0,39,1.0
Eafi9hqm3OY,5817,22,0,1,22,1.0
a9i2wTxsxV0,6122,16,0,8,16,1.0
OmM425PFd3Y,1402,20,0,0,20,1.0
aHjS9YBXzXU,5454,60,0,4,60,1.0
aHsfKnrNCG4,3896,9,0,1,9,1.0
3rhw4KgcvFM,1426,14,0,0,14,1.0
k-w8GxCpsCU,3491,19,0,2,19,1.0


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>🔍 Way 2 - Percentage of Frequency </font></h3>

* When we take the difference between Likes and Dislikes and divide it by the total number of likes/dislikes, you can see that the less watched videos are on the top list, that means new videos will be always on top list.
* There is percentage but not frequency.
    
<h3 align="left"><font color='#5E5273'>🎞️ Rank 2 :Hugh Jackman On Keeping His 21-Year Marriage Strong</font></h3>
    
<center><a href="http://www.youtube.com/watch?feature=player_embedded&v=yirvgC-kMq0" target="_blank">
 <img src="http://img.youtube.com/vi/yirvgC-kMq0/mqdefault.jpg" alt="Watch the video" width="540" />
    </a></center>    

In [59]:
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["likes"],
                                                                 x["dislikes"]), axis=1)
df.sort_values(by="wilson_lower_bound", ascending=False).head(10)

Unnamed: 0_level_0,views,likes,dislikes,comment_count,score_like_disslike,score_average_rating,wilson_lower_bound
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p1af9PKM8Eo,43597,5975,4,436,5971,0.999331,0.998281
ONI_06wGbsQ,180546,29537,48,2409,29489,0.998378,0.99785
CFwXUarN-wg,83200,15262,26,881,15236,0.998299,0.997509
6ixU_vdE0Es,280065,20289,38,939,20251,0.998131,0.997435
Q48VduIflPk,2997335,381809,963,23925,380846,0.997484,0.99732
J41qe-TM1DY,7200045,1021328,2716,117897,1018612,0.997348,0.997246
floMqK_yHf8,2412639,382528,1063,27385,381465,0.997229,0.997057
8Jmd7-1quDM,98413,7471,14,313,7457,0.99813,0.996863
X5YJU6_Mfpg,141366,15728,37,1734,15691,0.997653,0.996767
KyB0zVHm2P0,1365473,100090,290,2427,99800,0.997111,0.996759


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>🔍 Way 3 -  The Wilson Lower Bound Score</font></h3>

* The Wilson Lower Bound Score seems fair; the age of the video doesn't matter for sorting. It considers the total weight of the like/dislike count.
* Frequency and percentage calculated together.
    
<h3 align="left"><font color='#5E5273'>🎞️ Rank 1 : Jonghyun "Lonely" - Piano Cover</font></h3>

    
<center><a href="http://www.youtube.com/watch?feature=player_embedded&v=p1af9PKM8Eo" target="_blank">
 <img src="http://img.youtube.com/vi/p1af9PKM8Eo/mqdefault.jpg" alt="Watch the video" width="540" />
    </a></center>

In [63]:
view_score = MinMaxScaler(feature_range=(1, 10)).fit_transform(df[["views"]])
likes_score = MinMaxScaler(feature_range=(1, 10)).fit_transform(df[["likes"]])
dislikes_score = MinMaxScaler(feature_range=(1, 10)).fit_transform(df[["dislikes"]])
comment_count_score = MinMaxScaler(feature_range=(1, 10)).fit_transform(df[["comment_count"]])

df["custom_weight"] = (view_score * 10/100 + \
                       likes_score * 40/100 + \
                       dislikes_score * 30/100 + \
                       comment_count_score * 20/100)
df.sort_values(by="custom_weight", ascending=False).head(10)

Unnamed: 0_level_0,views,likes,dislikes,comment_count,score_like_disslike,score_average_rating,wilson_lower_bound,custom_weight
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FlsCjmMhFmw,149376127,3093544,1643059,827755,1450485,0.653114,0.652686,7.324467
7C2z4GqqS5E,123010920,5613827,206892,1228655,5406935,0.964456,0.964305,7.049467
QwZT7T-TXT0,37539570,1402578,1674420,1361580,-271842,0.455827,0.45527,6.549452
VYOjWnS4cMY,225211923,5023450,343541,517232,4679909,0.93599,0.935783,6.359143
ffxKSjUwKdU,148689896,3094021,129502,242039,2964519,0.959826,0.959611,4.107109
oWjxSkJpxFU,24286474,1988746,497847,658130,1490899,0.799788,0.79929,4.045204
kTlv5_Bs8aw,36857298,2729292,47896,546100,2681396,0.982754,0.9826,3.696685
6ZfuNTqbHE8,91933007,2625661,53709,350458,2571952,0.979955,0.979786,3.601061
OK3GJ0WIQ8s,23416810,2672431,29088,477233,2643343,0.989233,0.989109,3.48514
p8npDG2ulKQ,29741771,2700800,29341,371864,2671459,0.989253,0.98913,3.38972


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>🔍 Way 4 - Custom Weights on Different Parameters</font></h3>

* Here, we included 'views' and 'comment_count' in our calculations, applying custom weights. This ensures that even if a video receives a large number of dislikes, it still has the chance to secure a place in the top list, as demonstrated by rank 3.
    
<h3 align="left"><font color='#5E5273'>🎞️ Rank 1 : YouTube Rewind: The Shape of 2017</font></h3>

    
<center><a href="http://www.youtube.com/watch?feature=player_embedded&v=FlsCjmMhFmw" target="_blank">
 <img src="http://img.youtube.com/vi/FlsCjmMhFmw/mqdefault.jpg" alt="Watch the video" width="540" />
    </a></center>

In [64]:
def weighted_rating(r, v, M, C):
    return (v / (v + M) * r) + (M / (v + M) * C)

In [67]:
average = (df["likes"] + df["dislikes"])/df.shape[0]
vote_count = df["likes"] + df["dislikes"]
M = 2500
C = average.mean()
df["degree"] = weighted_rating(average, vote_count, M, C)
df.sort_values(by="degree", ascending=False).head(10)

Unnamed: 0_level_0,views,likes,dislikes,comment_count,score_like_disslike,score_average_rating,wilson_lower_bound,custom_weight,degree
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7C2z4GqqS5E,123010920,5613827,206892,1228655,5406935,0.964456,0.964305,7.049467,916.114829
VYOjWnS4cMY,225211923,5023450,343541,517232,4679909,0.93599,0.935783,6.359143,844.673201
FlsCjmMhFmw,149376127,3093544,1643059,827755,1450485,0.653114,0.652686,7.324467,745.415728
ffxKSjUwKdU,148689896,3094021,129502,242039,2964519,0.959826,0.959611,4.107109,507.175316
QwZT7T-TXT0,37539570,1402578,1674420,1361580,-271842,0.455827,0.45527,6.549452,484.104501
kTlv5_Bs8aw,36857298,2729292,47896,546100,2681396,0.982754,0.9826,3.696685,436.898606
p8npDG2ulKQ,29741771,2700800,29341,371864,2671459,0.989253,0.98913,3.38972,429.490946
OK3GJ0WIQ8s,23416810,2672431,29088,477233,2643343,0.989233,0.989109,3.48514,424.984347
6ZfuNTqbHE8,91933007,2625661,53709,350458,2571952,0.979955,0.979786,3.601061,421.496938
aJOTlE1K90k,66529577,2488565,43464,142410,2445101,0.982834,0.982674,3.120069,398.297807


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>🔍 Way 5 - Weighted Raiting </font></h3>

* Here, we calculated a weighted ranking based on the entire dataset, and it appears to be logical.
    
<h3 align="left"><font color='#5E5273'>🎞️ Rank 1 : BTS - 'FAKE LOVE'
</font></h3>

    
<center><a href="http://www.youtube.com/watch?feature=player_embedded&v=7C2z4GqqS5E" target="_blank">
 <img src="http://img.youtube.com/vi/7C2z4GqqS5E/mqdefault.jpg" alt="Watch the video" width="540" />
    </a></center>