# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [1]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [53]:
import json
import os
import random
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import utils
import langdetect as ld

Since the real analysis is going to take a long time to complete, we'll start with a toy example which takes about a minute. 

Let's load comments from a json file and see how many there are

In [3]:
from get_video import load_comments

comments = load_comments("comments_4DCbZJh-Gk4.json")
print(f"There are {len(comments)} comments in this dataset")
print(comments[:10])

Reading from local file <comments_4DCbZJh-Gk4.json>
There are 100 comments in this dataset
['Outstanding!', 'my lord i imagine a war space i love this emg videos john browne are the best ever   ...ever suppppport  djent\nim from argentina say love metal instrumental', "1:31 that switch up is fucking timeless. I've known this song for about 5 years now and it still never gets old.", 'And yet we still search for proof of alien life 🤷🏻\u200d♂️', 'After almost 7 years this playthrough still brings tears to my eyes. So good and hope to see Monuments live again soon!', 'Djesus Christ', 'Masterpiece!!', 'god DAMN that part beginning at 1:58 is SOOO fucking good like holy FUCK man', 'All finger breaker chords. All the time.', 'Bruh']


##### Now we can start analyzing them!

We will analyze the comments with BERT first, and save the results to a list. This is due to the fact that the BERT classifier pipeline works better when given a list of comments rather than a single comment

In [4]:
bert_scores = utils.bert_classifier(comments)

Now we analyze the comments using Vader and Textblob

In [5]:
df_list = []
scores_list = []

with Pool() as p:
    analysis1 = p.map(utils.vader_classifier, comments)
    analysis2 = p.map(utils.textblob_classifier, comments)
    scores_list.append(analysis1)
    scores_list.append(analysis2)

for comment, v_score, t_score in zip(comments, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": comment,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

We should merge the results of both analyses into a single dataframe

In [6]:
df = pd.DataFrame(df_list)
series_bert = pd.Series(bert_scores)

df['BERT score'] = series_bert

Let's inspect the dataframe. Start by looking at the first 10 rows

In [7]:
df.head(10)

Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Outstanding!,0.6476,0.625,0.896599
1,my lord i imagine a war space i love this emg ...,0.8658,0.666667,0.894711
2,1:31 that switch up is fucking timeless. I've ...,0.0,-0.25,-0.551378
3,And yet we still search for proof of alien lif...,0.0,-0.25,0.367565
4,After almost 7 years this playthrough still br...,0.7428,0.435227,0.736152
5,Djesus Christ,0.0,0.0,0.272853
6,Masterpiece!!,0.6892,0.0,0.879835
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,0.15,-0.496737
8,All finger breaker chords. All the time.,0.0,0.0,0.759309
9,Bruh,0.0,0.0,-0.386926


Scores are useful for computation, but not so much for legibility. It would be good to have labels so we know what the scores mean exactly. The Vader team has ranges defined on their GitHub repository, so assigning labels is going to be very easy. BERT returns a label, but it's not formatted in a way that we like. TextBlob doesn't return labels at all, and ideal ranges are not defined in the documentation. Let's add some labels now

In [8]:
vader_labels = utils.generate_labels(df["Vader score"], "vader")
blob_labels = utils.generate_labels(df["TextBlob score"], "blob")
bert_labels = utils.generate_labels(df["BERT score"], "bert")

vader_labels_series = pd.Series(vader_labels)
blob_labels_series = pd.Series(blob_labels)
bert_labels_series = pd.Series(bert_labels)

df['Vader label'] = vader_labels_series
df['TextBlob label'] = blob_labels_series
df['BERT label'] = bert_labels_series

Let's rearrange the columns and see what it looks like

In [9]:
df = df[["Comment","Vader score", "Vader label", "TextBlob score", "TextBlob label", "BERT score", "BERT label"]]

df.head(10)

Unnamed: 0,Comment,Vader score,Vader label,TextBlob score,TextBlob label,BERT score,BERT label
0,Outstanding!,0.6476,Positive,0.625,Positive,0.896599,Positive
1,my lord i imagine a war space i love this emg ...,0.8658,Positive,0.666667,Positive,0.894711,Positive
2,1:31 that switch up is fucking timeless. I've ...,0.0,Neutral,-0.25,Negative,-0.551378,Negative
3,And yet we still search for proof of alien lif...,0.0,Neutral,-0.25,Negative,0.367565,Positive
4,After almost 7 years this playthrough still br...,0.7428,Positive,0.435227,Positive,0.736152,Positive
5,Djesus Christ,0.0,Neutral,0.0,Netural,0.272853,Netural
6,Masterpiece!!,0.6892,Positive,0.0,Netural,0.879835,Positive
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,Negative,0.15,Netural,-0.496737,Negative
8,All finger breaker chords. All the time.,0.0,Neutral,0.0,Netural,0.759309,Positive
9,Bruh,0.0,Neutral,0.0,Netural,-0.386926,Negative


And now let's check the stats for our numerical columns 

In [10]:
print(df.describe())

       Vader score  TextBlob score  BERT score
count   100.000000      100.000000  100.000000
mean      0.245066        0.177426    0.246427
std       0.460005        0.421838    0.564940
min      -0.726200       -1.000000   -0.967942
25%       0.000000        0.000000   -0.289669
50%       0.080000        0.129018    0.385416
75%       0.682900        0.482924    0.723057
max       0.957100        1.000000    0.951091


The models seem to be predicting very similarly, but we should still check mathematically using Pearson correlation

In [11]:
print(df.corr(method='pearson'))

                Vader score  TextBlob score  BERT score
Vader score        1.000000        0.667741    0.465176
TextBlob score     0.667741        1.000000    0.561260
BERT score         0.465176        0.561260    1.000000


## Real analysis

Now we have an idea of what our analysis is going to look like. However, one video is not enough. Let's build a bigger dataset using the top 100 most searched terms on YouTube

In [12]:
terms = pd.read_csv('search.tsv', sep='\t') # obtained from https://ahrefs.com/blog/top-youtube-searches/
terms.head(10)

Unnamed: 0,#,Keyword,Search Volume
0,1,bts,16723304
1,2,pewdiepie,16495659
2,3,asmr,14655088
3,4,billie eilish,13801247
4,5,baby shark,12110100
5,6,old town road,10456524
6,7,music,10232134
7,8,badabun,10188997
8,9,blackpink,9580131
9,10,fortnite,9117342


We can use these search terms to get YouTube video links which we can then use to gather comments. To save some time (and some precious API quota) we'll use the top 50 searched terms

This cell should not be run more than once a month. To make sure you never have to run it again, save the output to a file and load it for subsequent runs

In [13]:
# from get_video import get_videos
# half = terms.head(50)
# search_terms = half["Keyword"]
# urls = []

# for term in search_terms:
#     urls.append(get_videos(term))

# links = list(set(item for sublist in urls for item in sublist))

# with open("videos__.json", "w") as f:
#     json.dump(links, f)

In [14]:
with open("videos__.json", "r") as f:
    links = json.load(f)

print(f"This file contains {len(links)} links")

This file contains 911 links


This is a lot of links (and therefore a lot of data). However, taking into account the rate of analysis and the amount of comments per video, we realize that it's going to take a lot of time: many of these videos have around 20,000 comments minimum. Knowing that it takes this program around 30 seconds to analyze 100 comments, analyzing all comments would take days. 

Instead of waiting for 9 days, we can make a compromise: we can use 500 comments and 500 comments per video. 

Let's start by randomly selecting 500 URLs.

In [15]:
random.seed(19)

half_links = random.sample(links, k=500)
links = list(half_links)
print(f"This file contains {len(links)} links")

with open("links.json", "w") as f:
    json.dump(links, f)

This file contains 500 links


Now that we have our list of links, we can start collecting data. Let's do the fast part first: collecting the like/dislike ratio and saving it somewhere convenient for later. 

As before, this cell should ideally be run once only as it takes a bit of time - around 5 minutes. If the notebook needs to be restarted, you should load the csv file instead.

In [16]:
# from get_video import get_likes

# start_time = time.time()

# with open("links.json", "r") as f:
#     links = json.load(f)
# df_list = []

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# for i, link in enumerate(links):
#     current_time = time.time()
#     elapsed_time = current_time - start_time
#     data = get_likes(link)
#     df_list.append(data)
#     time.sleep(0.5) # otherwise we will get an HTTP error
#     if i + 1 in checkpoints:
#             print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds")  
     
# df = pd.DataFrame(df_list)
# utils.save_analysis(df, "top50_ratio")

Let's look at the first 10 entries

In [17]:
likes_df = pd.read_csv("analysis_top50_ratio.csv", index_col = 0)
likes_df.head(10)

Unnamed: 0,URL,likes,dislikes
0,https://www.youtube.com/watch?v=ttWQK5VXskA,784677,25441
1,https://www.youtube.com/watch?v=5ula1NjaHUA,37445,723
2,https://www.youtube.com/watch?v=CPK_IdHe1Yg,3781735,145753
3,https://www.youtube.com/watch?v=QVLvD5_6ld0,224538,16484
4,https://www.youtube.com/watch?v=4bMOTTJqGgM,25754,9165
5,https://www.youtube.com/watch?v=KCwK2ECPDLA,42762,1603
6,https://www.youtube.com/watch?v=L40EFeYQmv8,85259,387
7,https://www.youtube.com/watch?v=Gl6ekgobG2k,444028,59264
8,https://www.youtube.com/watch?v=bheqvt6Z8S8,866,28
9,https://www.youtube.com/watch?v=evy4n3gW7LU,1015,22


This is good data, but it is not complete. We are still missing the ratio of likes to dislikes. Let's compute that now!

In [18]:
likes, dislikes = likes_df['likes'], likes_df['dislikes']

ratios = []

for like, dislike in zip(likes, dislikes):
    if like == 0 or dislike == 0:
        ratio = 0
        ratios.append(ratio)
    else:
        ratio = round(like / dislike, 3)
        ratios.append(ratio)

ratios = pd.Series(ratios)
ratios.head(10)

0     30.843
1     51.791
2     25.946
3     13.622
4      2.810
5     26.676
6    220.307
7      7.492
8     30.929
9     46.136
dtype: float64

Let's add this list to our dataframe and compute the mean ratio

In [19]:
likes_df['ratio'] = ratios 
mean_ratio = ratios.mean()
print(f"Mean like/dislike ratio = {mean_ratio :.2f}")
likes_df.head(10)

Mean like/dislike ratio = 946.91


Unnamed: 0,URL,likes,dislikes,ratio
0,https://www.youtube.com/watch?v=ttWQK5VXskA,784677,25441,30.843
1,https://www.youtube.com/watch?v=5ula1NjaHUA,37445,723,51.791
2,https://www.youtube.com/watch?v=CPK_IdHe1Yg,3781735,145753,25.946
3,https://www.youtube.com/watch?v=QVLvD5_6ld0,224538,16484,13.622
4,https://www.youtube.com/watch?v=4bMOTTJqGgM,25754,9165,2.81
5,https://www.youtube.com/watch?v=KCwK2ECPDLA,42762,1603,26.676
6,https://www.youtube.com/watch?v=L40EFeYQmv8,85259,387,220.307
7,https://www.youtube.com/watch?v=Gl6ekgobG2k,444028,59264,7.492
8,https://www.youtube.com/watch?v=bheqvt6Z8S8,866,28,30.929
9,https://www.youtube.com/watch?v=evy4n3gW7LU,1015,22,46.136


We will use the data in this dataframe as the baseline for the sentiment analysis.

This process is going to take a long time to complete, even with multiprocessing. 

Let's start by collecting the comments.
The results will be saved to a file, so there's no need to run this cell more than once if everything goes well.

In [20]:
# from get_video import get_id, get_comments
# import googleapiclient.discovery

# start = time.time()

# target_dir = f"{os.getcwd()}/Data"
# if not os.path.exists(target_dir):
#     os.mkdir(target_dir)

# urls = list(likes_df["URL"])
# checkpoints = {int(len(urls) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]}

# target_dir = f"{os.getcwd()}/Data"
# if not os.path.exists(target_dir):
#     os.mkdir(target_dir)

# for i, url in enumerate(urls):
#     current_time = time.time()
#     elapsed_time = current_time - start_time
#     video_id = get_id(url)
#     try:
#         comments = get_comments(url)
#     except googleapiclient.errors.HttpError:
#         continue
#     utils.move_dir(f"comments_{video_id}.json", target_dir)
#     if i + 1 in checkpoints:
#         print(f"{checkpoints[i+1] :.0f}% of the comments have been collected\nTime elapsed: {elapsed_time :.2f} seconds")


10 minutes and 12 MB later, we have a folder full of JSON files containing comments.

Let's look at one of these files

In [21]:
from get_video import load_comments

target_dir = f"{os.getcwd()}\\Data"
data = os.listdir(target_dir)
filename = os.path.join(target_dir, data[0])

print(f"There are <{len(data)}> files in this directory")

comment = load_comments(filename)
print(comment[:10])
print(f"This file contains <{len(comment)}> comments")

There are <452> files in this directory
Reading from local file <f:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data\comments_-0OJwoguQzY.json>
['Si la película no fue de tu gusto no te preocupes, tenemos muchas más, y subimos películas (todas las semanas) en la mejor calidad para tu entretenimiento. Recuerda: Nosotros no producimos las películas, solo las distribuimos gratis para que tú puedas verlas.\r\nTrabajamos para para ti, así que no te vayas sin dejar un comentario y un me gusta 👍Eso nos motiva a traer nuevo contenido. ¡Gracias! 💛💕\r\nSuscríbete para que no te pierdas la próxima película. ✔\r\nhttps://www.youtube.com/c/CanalCinelodeon', 'Perronsicima', 'Muy buena actriz Jenifer Garner, muy buena película, felicitaciones por la distribución que hacen ustedes, queremos de la misma calidad....', 'Exelente pelicula..gracias muchas gracias por subirla....desde q empieza y acaba es pura accion de la buena...la recomiendo 100 por ciento...', 'BUENISIMA LA R

This file is not in English, which is problematic for Vader and TextBlob since they are not multilingual models like BERT. We can work around this by translating the comments into English using a reputable translator. Ideally, we would use language specific models for each language. However, we don't have the time or resources for that, so we will have to use the translator and lose some information. 

In [50]:
langs = []
for c in comment:
    try:
        lang = ld.detect(c)
    except ld.lang_detect_exception.LangDetectException:
        lang = "NA"
    langs.append(lang)

with Pool() as p:
    tr_comment = p.starmap(utils.translate, zip(comment, repeat(langs)))

print(tr_comment[:10])

["If the movie was not to your liking don't worry, we have many more, and we upload movies (every week) in the best quality for your entertainment. Remember: We don't produce the movies, we just distribute them for free so you can watch them.\r\nWe work for you, so don't leave without leaving a comment and a like 👍That motivates us to bring you new content. thanks! 💛💕\r\nSubscribe so you don't miss the next movie. ✔\r\nhttps://www.youtube.com/c/CanalCinelodeon", 'Perronsicima', 'Very good actress Jenifer Garner, very good movie, congratulations for the distribution that you do, we want the same quality....', 'Exelente pelicula..gracias muchas gracias por subirla....desde q begins and ends is pure action of the good ...la recomiendo 100 por ciento....', 'I HIGHLY RECOMMEND IT, I SAW IT SOME TIME AGO AND I SAW IT AGAIN BECAUSE IT IS SO GOOD.', 'How I see it on my PC', 'Ave maria parcer@s super elegant buenicima good audio, good image, good quality, thank you very much and many successes.

In [54]:
bert_scores = utils.bert_classifier(comment)

df_list = []
scores_list = []

with Pool() as p:
    analysis1 = p.map(utils.vader_classifier, tr_comment)
    analysis2 = p.map(utils.textblob_classifier, tr_comment)
    scores_list.append(analysis1)
    scores_list.append(analysis2)

for c, v_score, t_score in zip(comment, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": c,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

In [55]:
series_bert = pd.Series(bert_scores)
df['BERT score'] = series_bert
utils.save_analysis(df, "test")
df.head(10)

saved to <analysis_test.csv>


Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Si la película no fue de tu gusto no te preocu...,0.9794,0.398052,0.287919
1,Perronsicima,0.0,0.0,0.221338
2,"Muy buena actriz Jenifer Garner, muy buena pel...",0.8968,0.606667,0.646682
3,Exelente pelicula..gracias muchas gracias por ...,0.4404,0.338095,0.878554
4,BUENISIMA LA RECOMIENDO LA VI AC TIEMPO Y LA V...,0.7604,0.43,0.747503
5,Cómo la veo en mi PC,0.0,0.0,0.585962
6,Ave maria parcer@s super elegante buenicima bu...,0.9674,0.527619,0.822938
7,Excelente!!!!la vi hoy,0.0,0.0,0.853855
8,Muy buena,0.4927,0.91,0.743148
9,Excelente pelicula muy buena recomenda,0.7841,0.955,0.790435


The results aren't perfect, but they're good enough