# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [1]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [2]:
import json
import os
import random
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import utils
import langdetect as ld

Since the real analysis is going to take a long time to complete, we'll start with a toy example which takes about a minute. 

Let's load comments from a json file and see how many there are

In [3]:
from get_video import load_comments

comments = load_comments("comments_4DCbZJh-Gk4.json")
print(f"There are {len(comments)} comments in this dataset")
print(comments[:10])

Reading from local file <comments_4DCbZJh-Gk4.json>
There are 100 comments in this dataset
['Outstanding!', 'my lord i imagine a war space i love this emg videos john browne are the best ever   ...ever suppppport  djent\nim from argentina say love metal instrumental', "1:31 that switch up is fucking timeless. I've known this song for about 5 years now and it still never gets old.", 'And yet we still search for proof of alien life 🤷🏻\u200d♂️', 'After almost 7 years this playthrough still brings tears to my eyes. So good and hope to see Monuments live again soon!', 'Djesus Christ', 'Masterpiece!!', 'god DAMN that part beginning at 1:58 is SOOO fucking good like holy FUCK man', 'All finger breaker chords. All the time.', 'Bruh']


##### Now we can start analyzing them!

We will analyze the comments with BERT first, and save the results to a list. This is due to the fact that the BERT classifier pipeline works better when given a list of comments rather than a single comment

In [4]:
bert_scores = utils.bert_classifier(comments)

Now we analyze the comments using Vader and Textblob

In [5]:
df_list = []
scores_list = []

with Pool() as p:
    analysis1 = p.map(utils.vader_classifier, comments)
    analysis2 = p.map(utils.textblob_classifier, comments)
    scores_list.append(analysis1)
    scores_list.append(analysis2)

for comment, v_score, t_score in zip(comments, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": comment,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

We should merge the results of both analyses into a single dataframe

In [6]:
df = pd.DataFrame(df_list)
series_bert = pd.Series(bert_scores)

df['BERT score'] = series_bert

Let's inspect the dataframe. Start by looking at the first 10 rows

In [7]:
df.head(10)

Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Outstanding!,0.6476,0.625,0.896599
1,my lord i imagine a war space i love this emg ...,0.8658,0.666667,0.894711
2,1:31 that switch up is fucking timeless. I've ...,0.0,-0.25,-0.551378
3,And yet we still search for proof of alien lif...,0.0,-0.25,0.367565
4,After almost 7 years this playthrough still br...,0.7428,0.435227,0.736152
5,Djesus Christ,0.0,0.0,0.272853
6,Masterpiece!!,0.6892,0.0,0.879835
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,0.15,-0.496737
8,All finger breaker chords. All the time.,0.0,0.0,0.759309
9,Bruh,0.0,0.0,-0.386926


Scores are useful for computation, but not so much for legibility. It would be good to have labels so we know what the scores mean exactly. The Vader team has ranges defined on their GitHub repository, so assigning labels is going to be very easy. BERT returns a label, but it's not formatted in a way that we like. TextBlob doesn't return labels at all, and ideal ranges are not defined in the documentation. Let's add some labels now

In [8]:
vader_labels = utils.generate_labels(df["Vader score"], "vader")
blob_labels = utils.generate_labels(df["TextBlob score"], "blob")
bert_labels = utils.generate_labels(df["BERT score"], "bert")

vader_labels_series = pd.Series(vader_labels)
blob_labels_series = pd.Series(blob_labels)
bert_labels_series = pd.Series(bert_labels)

df['Vader label'] = vader_labels_series
df['TextBlob label'] = blob_labels_series
df['BERT label'] = bert_labels_series

Let's rearrange the columns and see what it looks like

In [9]:
df = df[["Comment","Vader score", "Vader label", "TextBlob score", "TextBlob label", "BERT score", "BERT label"]]

df.head(10)

Unnamed: 0,Comment,Vader score,Vader label,TextBlob score,TextBlob label,BERT score,BERT label
0,Outstanding!,0.6476,Positive,0.625,Positive,0.896599,Positive
1,my lord i imagine a war space i love this emg ...,0.8658,Positive,0.666667,Positive,0.894711,Positive
2,1:31 that switch up is fucking timeless. I've ...,0.0,Neutral,-0.25,Negative,-0.551378,Negative
3,And yet we still search for proof of alien lif...,0.0,Neutral,-0.25,Negative,0.367565,Positive
4,After almost 7 years this playthrough still br...,0.7428,Positive,0.435227,Positive,0.736152,Positive
5,Djesus Christ,0.0,Neutral,0.0,Netural,0.272853,Netural
6,Masterpiece!!,0.6892,Positive,0.0,Netural,0.879835,Positive
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,Negative,0.15,Netural,-0.496737,Negative
8,All finger breaker chords. All the time.,0.0,Neutral,0.0,Netural,0.759309,Positive
9,Bruh,0.0,Neutral,0.0,Netural,-0.386926,Negative


And now let's check the stats for our numerical columns 

In [10]:
print(df.describe())

       Vader score  TextBlob score  BERT score
count   100.000000      100.000000  100.000000
mean      0.245066        0.177426    0.246427
std       0.460005        0.421838    0.564940
min      -0.726200       -1.000000   -0.967942
25%       0.000000        0.000000   -0.289669
50%       0.080000        0.129018    0.385416
75%       0.682900        0.482924    0.723057
max       0.957100        1.000000    0.951091


The models seem to be predicting very similarly, but we should still check mathematically using Pearson correlation

In [11]:
print(df.corr(method='pearson'))

                Vader score  TextBlob score  BERT score
Vader score        1.000000        0.667741    0.465176
TextBlob score     0.667741        1.000000    0.561260
BERT score         0.465176        0.561260    1.000000


## Real analysis

Now we have an idea of what our analysis is going to look like. However, one video is not enough. Let's build a bigger dataset using the top 100 most searched terms on YouTube

In [12]:
terms = pd.read_csv('search.tsv', sep='\t') # obtained from https://ahrefs.com/blog/top-youtube-searches/
terms.head(10)

Unnamed: 0,#,Keyword,Search Volume
0,1,bts,16723304
1,2,pewdiepie,16495659
2,3,asmr,14655088
3,4,billie eilish,13801247
4,5,baby shark,12110100
5,6,old town road,10456524
6,7,music,10232134
7,8,badabun,10188997
8,9,blackpink,9580131
9,10,fortnite,9117342


We can use these search terms to get YouTube video links which we can then use to gather comments. To save some time (and some precious API quota) we'll use the top 50 searched terms

This cell should not be run more than once a month. To make sure you never have to run it again, save the output to a file and load it for subsequent runs

In [13]:
# from get_video import get_videos
# half = terms.head(50)
# search_terms = half["Keyword"]
# urls = []

# for term in search_terms:
#     urls.append(get_videos(term))

# links = list(set(item for sublist in urls for item in sublist))

# with open("videos__.json", "w") as f:
#     json.dump(links, f)

In [14]:
with open("videos__.json", "r") as f:
    links = json.load(f)

print(f"This file contains <{len(links)}> links")

This file contains <911> links


This is a lot of links (and therefore a lot of data). However, taking into account the rate of analysis and the amount of comments per video, we realize that it's going to take a lot of time: many of these videos have around 20,000 comments minimum. Knowing that it takes this program around 30 seconds to analyze 100 comments, analyzing all comments would take days. 

Instead of waiting for 9 days, we can make a compromise: we can take a sample of 500 comments per video. 

Now that we have our list of links, we can start collecting data. Let's do the fast part first: collecting the like/dislike ratio and saving it somewhere convenient for later. 

As before, this cell should ideally be run once only as it takes a bit of time - around 15 minutes. If the notebook needs to be restarted, you should load the csv file instead.

In [15]:
# from get_video import get_likes
# import time
# start_time = time.time()

# with open("videos__.json", "r") as f:
#     links = json.load(f)
# df_list = []

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# for i, link in enumerate(links):
#     current_time = time.time()
#     elapsed_time = current_time - start_time
#     data = get_likes(link)
#     df_list.append(data)
#     time.sleep(1) # otherwise we will get an HTTP error
#     if i + 1 in checkpoints:
#             print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds")  
     
# df = pd.DataFrame(df_list)
# utils.save_analysis(df, "top50_ratio")

Let's look at the first 10 entries

In [16]:
likes_df = pd.read_csv("analysis_top50_ratio.csv", index_col = 0)
likes_df = likes_df.reset_index()
likes_df.head(10)

Unnamed: 0,URL,likes,dislikes
0,https://www.youtube.com/watch?v=feLcMaiClOw,312761,19093
1,https://www.youtube.com/watch?v=tASwv-tkMlc,35026,165
2,https://www.youtube.com/watch?v=3U2dNKBM28o,126427,6010
3,https://www.youtube.com/watch?v=1cPDfXU95Xw,23783,18
4,https://www.youtube.com/watch?v=pptIU_ZLlOo,4600,7
5,https://www.youtube.com/watch?v=ClQ-ymoXJZc,1079676,8874
6,https://www.youtube.com/watch?v=8EJ3zbKTWQ8,11371069,1420624
7,https://www.youtube.com/watch?v=t99KH0TR-J4,1435738,33092
8,https://www.youtube.com/watch?v=DovdIspaqmw,146007,5324
9,https://www.youtube.com/watch?v=DQQRjFzB8gY,870202,49922


This is good data, but it is not complete. We are still missing the ratio of likes to dislikes. Let's compute that now!

In [17]:
likes, dislikes = likes_df['likes'], likes_df['dislikes']

ratios = []

for like, dislike in zip(likes, dislikes):
    if like == 0 or dislike == 0:
        ratio = 0
        ratios.append(ratio)
    else:
        ratio = round(like / dislike, 3)
        ratios.append(ratio)

ratios = pd.Series(ratios)
ratios.head(10)

0      16.381
1     212.279
2      21.036
3    1321.278
4     657.143
5     121.667
6       8.004
7      43.386
8      27.424
9      17.431
dtype: float64

Let's add this list to our dataframe and compute the mean ratio

In [18]:
likes_df['ratio'] = ratios.astype('float64') 
mean_ratio = ratios.mean()
print(f"Mean like/dislike ratio = {mean_ratio :.2f}")
likes_df.head(10)

Mean like/dislike ratio = 123.55


Unnamed: 0,URL,likes,dislikes,ratio
0,https://www.youtube.com/watch?v=feLcMaiClOw,312761,19093,16.381
1,https://www.youtube.com/watch?v=tASwv-tkMlc,35026,165,212.279
2,https://www.youtube.com/watch?v=3U2dNKBM28o,126427,6010,21.036
3,https://www.youtube.com/watch?v=1cPDfXU95Xw,23783,18,1321.278
4,https://www.youtube.com/watch?v=pptIU_ZLlOo,4600,7,657.143
5,https://www.youtube.com/watch?v=ClQ-ymoXJZc,1079676,8874,121.667
6,https://www.youtube.com/watch?v=8EJ3zbKTWQ8,11371069,1420624,8.004
7,https://www.youtube.com/watch?v=t99KH0TR-J4,1435738,33092,43.386
8,https://www.youtube.com/watch?v=DovdIspaqmw,146007,5324,27.424
9,https://www.youtube.com/watch?v=DQQRjFzB8gY,870202,49922,17.431


We will use the data in this dataframe as the baseline for the sentiment analysis.

This process is going to take a long time to complete, even with multiprocessing. 

Let's start by collecting the comments.
The results will be saved to a file, so there's no need to run this cell more than once if everything goes well.

In [19]:
# from get_video import get_id, get_comments
# import googleapiclient.discovery

# urls = list(likes_df["URL"])

# target_dir = f"{os.getcwd()}/Data"
# if not os.path.exists(target_dir):
#     os.mkdir(target_dir)

# for url in urls:
#     video_id = get_id(url)
#     try:
#         comments = get_comments(url)
#     except googleapiclient.errors.HttpError:
#         continue
#     utils.move_dir(f"comments_{video_id}.json", target_dir)

Written <600> comments to <comments_feLcMaiClOw.json>
Moved <comments_feLcMaiClOw.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments/Data\comments_feLcMaiClOw.json>
Written <600> comments to <comments_tASwv-tkMlc.json>
Moved <comments_tASwv-tkMlc.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments/Data\comments_tASwv-tkMlc.json>
Written <600> comments to <comments_3U2dNKBM28o.json>
Moved <comments_3U2dNKBM28o.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments/Data\comments_3U2dNKBM28o.json>
Written <100> comments to <comments_1cPDfXU95Xw.json>
Written <100> comments to <comments_1cPDfXU95Xw.json>
Moved <comments_1cPDfXU95Xw.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments/Data\comments_1cPDfXU95Xw.json>
Written <0> comments to <comments_pptIU_ZLlOo.json>
Written <0> comments to <comments_pptIU_ZLlOo.json>
Moved <comments_pptIU_ZLlOo.json> to <f:\Deskt

Let's look at one of these files

In [28]:
from get_video import load_comments

target_dir = f"{os.getcwd()}\\Data"
data = os.listdir(target_dir)
filename = os.path.join(target_dir, data[2])

print(f"There are <{len(data)}> files in this directory")

comment = load_comments(filename)
print(comment[:10])
print(f"This file contains <{len(comment)}> comments")

There are <823> files in this directory
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-8MQVIHTjIQ.json>
['Cuantos años tendra esta pelicula', 'Como del 83', 'Lo me jor q he visto.', 'Como se llama la musica del minuto 41: 00', 'Ya se volvió loco XD que buena broma 😝', 'Yo la vii hace unos años y desde el 2021 la estaba buscando y al fin la subiste new seguidor.', 'Soy deportista y me encanta peliculas de básquetball😼💘', 'Excelente película me gustó mucho un saludo para vos y tu familia 👏❤✨🎆', 'Ya lavi 100 beses', 'Mejor película']
This file contains <49> comments


This file is not in English, which is problematic for Vader and TextBlob since they are not multilingual models like BERT. We can work around this by translating the comments into English using a reputable translator. Ideally, we would use language specific models for each language. However, we don't have the time or resources for that, so we will have to use the translator and lose some information. 

In [29]:
langs = []
for c in comment:
    try:
        lang = ld.detect(c)
    except ld.lang_detect_exception.LangDetectException:
        lang = "NA"
    langs.append(lang)

with Pool() as p:
    tr_comment = p.starmap(utils.translate, zip(comment, repeat(langs)))

print(tr_comment[:10])

['How old is this movie?', 'As of 83', 'Lo me jor q he visto.', 'What is the name of the music of the minute 41: 00', "He's already gone crazy XD what a good joke 😝", 'I saw it a few years ago and since 2021 I was looking for it and finally you uploaded it new follower.', 'I am sporty and I love basketball movies😼💘.', 'Excellent movie I liked it very much a greeting to you and your family 👏❤✨🎆.', 'Ya lavi 100 beses', 'Best Film']


In [30]:
bert_scores = utils.bert_classifier(comment)

df_list = []
scores_list = []

with Pool() as p:
    analysis1 = p.map(utils.vader_classifier, tr_comment)
    analysis2 = p.map(utils.textblob_classifier, tr_comment)
    scores_list.append(analysis1)
    scores_list.append(analysis2)

for c, v_score, t_score in zip(comment, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": c,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

In [31]:
series_bert = pd.Series(bert_scores)
df['BERT score'] = series_bert
utils.save_analysis(df, "test")
df.head(10)

saved to <analysis_test.csv>


Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Cuantos años tendra esta pelicula,0.0,0.1,0.482494
1,Como del 83,0.0,0.0,0.276382
2,Lo me jor q he visto.,0.0,0.0,0.249014
3,Como se llama la musica del minuto 41: 00,0.0,0.0,0.419252
4,Ya se volvió loco XD que buena broma 😝,0.8038,0.05,0.408127
5,Yo la vii hace unos años y desde el 2021 la es...,0.0,-0.021212,0.660254
6,Soy deportista y me encanta peliculas de básqu...,0.8979,0.5,0.814831
7,Excelente película me gustó mucho un saludo pa...,0.9419,0.62,0.708752
8,Ya lavi 100 beses,0.0,0.0,-0.407686
9,Mejor película,0.6369,1.0,0.896647


The results aren't perfect, but they're good enough. However, there is a problem: the file has 49 comments, and we need at least 500. So, we will have to remove all files with less than 500 comments, and replace them with new files. Let's start with removing files

In [32]:
# import time

# target_dir = f"{os.getcwd()}\\Data"
# trash = f"{os.getcwd()}\\Trash"

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# if not os.path.exists(trash):
#     os.mkdir(trash)

# data = os.listdir(target_dir)

# start =  time.time()

# for i, file in enumerate(data):
#     current_time = time.time()
#     elapsed_time = current_time - start
#     filepath = os.path.join(target_dir, file)
#     content = load_comments(filepath)
#     if len(content) < 500:
#         utils.move_dir(filename = file, destination = trash, source = target_dir)
#     if i + 1 in checkpoints:
#         print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds") 
# trash_list = os.listdir(trash)

# print(f"<{len(trash_list)}> files moved")

Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-0OJwoguQzY.json>
Moved <comments_-0OJwoguQzY.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Trash\comments_-0OJwoguQzY.json>
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-6-AnhFG3cI.json>
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-8MQVIHTjIQ.json>
Moved <comments_-8MQVIHTjIQ.json> to <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Trash\comments_-8MQVIHTjIQ.json>
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-8VfKZCOo_I.json>
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-BjZmE2gtdo.json>
Reading from local file <f:\Desktop\Scuola\Uni\Y

Quick sanity check:

In [34]:
target_dir = f"{os.getcwd()}\\Data"

data = os.listdir(target_dir)
print(f"This directory contains <{len(data)}> files")
for file in data:
    filepath = os.path.join(target_dir, file)
    content = load_comments(filepath)
    print(len(content))

This directory contains <569> files
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-6-AnhFG3cI.json>
600
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-8VfKZCOo_I.json>
594
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-BjZmE2gtdo.json>
600
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-CmadmM5cOk.json>
600
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-eGEKFMHGr8.json>
600
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-fWXLAkHmxA.json>
600
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-Hqp5Mwq3Zs.json>
600
Reading from

Great! We have over 500 videos left, each with at least 500 comments. All that's left to do is run the sentiment analysis on each video. This is going to take a lot of time, which means we'll have to do it in batches. 100 videos per batch seems feasible, so let's do that. 