# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [20]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [21]:
import time
import pandas as pd
import utils
import json
import random
from multiprocessing import Pool

Let's load comments from a json file and see how many there are

In [22]:
from get_video import load_comments

comments = load_comments("comments_4DCbZJh-Gk4.json")
print(f"There are {len(comments)} comments in this dataset")
print(comments[:10])

Reading from local file <comments_4DCbZJh-Gk4.json>
['Outstanding!', 'my lord i imagine a war space i love this emg videos john browne are the best ever   ...ever suppppport  djent\nim from argentina say love metal instrumental', "1:31 that switch up is fucking timeless. I've known this song for about 5 years now and it still never gets old.", 'And yet we still search for proof of alien life 🤷🏻\u200d♂️', 'After almost 7 years this playthrough still brings tears to my eyes. So good and hope to see Monuments live again soon!', 'Djesus Christ', 'Masterpiece!!', 'god DAMN that part beginning at 1:58 is SOOO fucking good like holy FUCK man', 'All finger breaker chords. All the time.', 'Bruh']
There are 100 comments in this dataset


This might take some time depending on the size of our dataset, so let's define some checkpoints

In [23]:
checkpoints = {int(len(comments) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # shamelessly stolen from Computational Linguistics Notebook 8

##### Now we can start analyzing them!

We will analyze the comments with BERT first, and save the results to a list. This is due to the fact that the BERT classifier pipeline works better when given a list of comments rather than a single comment

In [45]:
bert_scores = utils.bert_classifier(comments)

Now we analyze the comments using Vader and Textblob (almost) concurrently

In [46]:
start_time = time.time()

df_list = []
scores_list = []

with Pool() as p:
    analysis = p.map(utils.vader_classifier, comments)
    scores_list.append(analysis)

with Pool() as p:
    analysis = p.map(utils.textblob_classifier, comments)
    scores_list.append(analysis)

for comment, v_score, t_score in zip(comments, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": comment,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

We should merge the results of both analyses into a single dataframe

In [47]:
df = pd.DataFrame(df_list)
series_bert = pd.Series(bert_scores)

df['BERT score'] = series_bert

Let's inspect the dataframe. Start by looking at the first 10 rows

In [48]:
df.head(10)

Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Outstanding!,0.6476,0.625,0.896599
1,my lord i imagine a war space i love this emg ...,0.8658,0.666667,0.894711
2,1:31 that switch up is fucking timeless. I've ...,0.0,-0.25,-0.551378
3,And yet we still search for proof of alien lif...,0.0,-0.25,0.367565
4,After almost 7 years this playthrough still br...,0.7428,0.435227,0.736152
5,Djesus Christ,0.0,0.0,0.272853
6,Masterpiece!!,0.6892,0.0,0.879835
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,0.15,-0.496737
8,All finger breaker chords. All the time.,0.0,0.0,0.759309
9,Bruh,0.0,0.0,-0.386926


Scores are useful for computation, but not so much for legibility. It would be good to have labels so we know what the scores mean exactly. The Vader team has ranges defined on their GitHub repository, so assigning labels is going to be very easy. BERT returns a label, but it's not formatted in a way that we like. TextBlob doesn't return labels at all, and ideal ranges are not defined in the documentation. Let's add some labels now

In [49]:
# Vader first

vader_labels = []

for score in df["Vader score"]:
    if score >= 0.05:
        vader_labels.append("Positive")
    if  score > -0.05 and score < 0.05:
        vader_labels.append("Neutral")
    if score <= -0.05:
        vader_labels.append("Negative")

vader_labels_series = pd.Series(vader_labels)

# Now TextBlob and BERT

blob_labels, bert_labels = [], []

for score in df["TextBlob score"]:
    if score >= 0.33:
        blob_labels.append("Positive")
    if score >= 0 and score < 0.33:
        blob_labels.append("Netural")
    if score < 0:
        blob_labels.append("Negative")

for score in df["BERT score"]:
    if score >= 0.33:
        bert_labels.append("Positive")
    if score >= 0 and score < 0.33:
        bert_labels.append("Netural")
    if score < 0:
        bert_labels.append("Negative")

blob_labels_series = pd.Series(blob_labels)
bert_labels_series = pd.Series(bert_labels)

df['Vader label'] = vader_labels_series
df['TextBlob label'] = blob_labels_series
df['BERT label'] = bert_labels_series

Let's rearrange the columns and see what it looks like

In [50]:
df = df[["Comment","Vader score", "Vader label", "TextBlob score", "TextBlob label", "BERT score", "BERT label"]]

df.head(10)

Unnamed: 0,Comment,Vader score,Vader label,TextBlob score,TextBlob label,BERT score,BERT label
0,Outstanding!,0.6476,Positive,0.625,Positive,0.896599,Positive
1,my lord i imagine a war space i love this emg ...,0.8658,Positive,0.666667,Positive,0.894711,Positive
2,1:31 that switch up is fucking timeless. I've ...,0.0,Neutral,-0.25,Negative,-0.551378,Negative
3,And yet we still search for proof of alien lif...,0.0,Neutral,-0.25,Negative,0.367565,Positive
4,After almost 7 years this playthrough still br...,0.7428,Positive,0.435227,Positive,0.736152,Positive
5,Djesus Christ,0.0,Neutral,0.0,Netural,0.272853,Netural
6,Masterpiece!!,0.6892,Positive,0.0,Netural,0.879835,Positive
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,Negative,0.15,Netural,-0.496737,Negative
8,All finger breaker chords. All the time.,0.0,Neutral,0.0,Netural,0.759309,Positive
9,Bruh,0.0,Neutral,0.0,Netural,-0.386926,Negative


And now let's check the stats for our numerical columns 

In [51]:
print(df.describe())

       Vader score  TextBlob score  BERT score
count   100.000000      100.000000  100.000000
mean      0.245066        0.177426    0.246427
std       0.460005        0.421838    0.564940
min      -0.726200       -1.000000   -0.967942
25%       0.000000        0.000000   -0.289669
50%       0.080000        0.129018    0.385416
75%       0.682900        0.482924    0.723057
max       0.957100        1.000000    0.951091


The models seem to be predicting very similarly, but we should still check mathematically using Pearson correlation

In [52]:
print(df.corr(method='pearson'))

                Vader score  TextBlob score  BERT score
Vader score        1.000000        0.667741    0.465176
TextBlob score     0.667741        1.000000    0.561260
BERT score         0.465176        0.561260    1.000000


Now we have an idea of what our analysis is going to look like. However, one video is not enough. Let's build a bigger dataset using the top 100 most searched terms on YouTube

In [53]:
terms = pd.read_csv('search.tsv', sep='\t') # obtained from https://ahrefs.com/blog/top-youtube-searches/
terms.head(10)

Unnamed: 0,#,Keyword,Search Volume
0,1,bts,16723304
1,2,pewdiepie,16495659
2,3,asmr,14655088
3,4,billie eilish,13801247
4,5,baby shark,12110100
5,6,old town road,10456524
6,7,music,10232134
7,8,badabun,10188997
8,9,blackpink,9580131
9,10,fortnite,9117342


We can use these search terms to get YouTube video links which we can then use to gather comments. To save some time (and some precious API quota) we'll use the top 50 searched terms

This cell should not be run more than once a month. To make sure you never have to run it again, save the output to a file and load it for subsequent runs

In [54]:
# from get_video import get_videos
# half = terms.head(50)
# search_terms = half["Keyword"]
# urls = []

# for term in search_terms:
#     urls.append(get_videos(term))

# links = list(set(item for sublist in urls for item in sublist))

# with open("videos__.json", "w") as f:
#     json.dump(links, f)

In [70]:
with open("videos__.json", "r") as f:
    links = json.load(f)

print(f"This file contains {len(links)} links")

This file contains 911 links


This is a lot of links (and therefore a lot of data). However, taking into account the rate of analysis and the amount of comments per video, we realize that it's going to take a lot of time: many of these videos have around 20,000 comments minimum. Knowing that it takes this program around 30 seconds to analyze 100 comments, analyzing all comments would take days. 

Instead of waiting for 9 days, we can make a compromise: we can use 500 comments and 500 comments per video. 

Let's start by randomly selecting 500 URLs.

In [71]:
random.seed(19)

half_links = random.sample(links, k=500)
links = list(half_links)

with open("links.json", "w") as f:
    json.dump(links, f)

Now that we have our list of links, we can start collecting data. Let's do the fast part first: collecting the like/dislike ratio and saving it somewhere convenient for later. 

As before, this cell should ideally be run once only as it takes a bit of time - around 5 minutes. If the notebook needs to be restarted, you should load the csv file instead.

In [57]:
# from get_video import get_likes

# start_time = time.time()

# with open("links.json", "r") as f:
#     links = json.load(f)
# df_list = []

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# for i, link in enumerate(links):
#     current_time = time.time()
#     elapsed_time = current_time - start_time
#     data = get_likes(link)
#     df_list.append(data)
#     time.sleep(0.5) # otherwise we will get an HTTP error
#     if i + 1 in checkpoints:
#             print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds")  
     
# df = pd.DataFrame(df_list)
# utils.save_analysis(df, "top50_ratio")

Let's look at the first 10 entries

In [72]:
likes_df = pd.read_csv("analysis_top50_ratio.csv", index_col = 0)
likes_df.head(10)

Unnamed: 0,URL,likes,dislikes
0,https://www.youtube.com/watch?v=ttWQK5VXskA,784677,25441
1,https://www.youtube.com/watch?v=5ula1NjaHUA,37445,723
2,https://www.youtube.com/watch?v=CPK_IdHe1Yg,3781735,145753
3,https://www.youtube.com/watch?v=QVLvD5_6ld0,224538,16484
4,https://www.youtube.com/watch?v=4bMOTTJqGgM,25754,9165
5,https://www.youtube.com/watch?v=KCwK2ECPDLA,42762,1603
6,https://www.youtube.com/watch?v=L40EFeYQmv8,85259,387
7,https://www.youtube.com/watch?v=Gl6ekgobG2k,444028,59264
8,https://www.youtube.com/watch?v=bheqvt6Z8S8,866,28
9,https://www.youtube.com/watch?v=evy4n3gW7LU,1015,22


This is good data, but it is not complete. We are still missing the ratio of likes to dislikes. Let's compute that now!

In [73]:
likes, dislikes = likes_df['likes'], likes_df['dislikes']

ratios = []

for like, dislike in zip(likes, dislikes):
    if like == 0 or dislike == 0:
        ratio = 0
        ratios.append(ratio)
    else:
        ratio = round(like / dislike, 3)
        ratios.append(ratio)

ratios = pd.Series(ratios)
ratios.head(10)

0     30.843
1     51.791
2     25.946
3     13.622
4      2.810
5     26.676
6    220.307
7      7.492
8     30.929
9     46.136
dtype: float64

Let's add this list to our dataframe

In [77]:
likes_df['ratio'] = ratios
mean_ratio = likes_df['ratio'].mean()
print(f"Mean like/dislike ratio = {mean_ratio :.2f}")
likes_df.head(10)

Mean like/dislike ratio = 946.91


Unnamed: 0,URL,likes,dislikes,ratio
0,https://www.youtube.com/watch?v=ttWQK5VXskA,784677,25441,30.843
1,https://www.youtube.com/watch?v=5ula1NjaHUA,37445,723,51.791
2,https://www.youtube.com/watch?v=CPK_IdHe1Yg,3781735,145753,25.946
3,https://www.youtube.com/watch?v=QVLvD5_6ld0,224538,16484,13.622
4,https://www.youtube.com/watch?v=4bMOTTJqGgM,25754,9165,2.81
5,https://www.youtube.com/watch?v=KCwK2ECPDLA,42762,1603,26.676
6,https://www.youtube.com/watch?v=L40EFeYQmv8,85259,387,220.307
7,https://www.youtube.com/watch?v=Gl6ekgobG2k,444028,59264,7.492
8,https://www.youtube.com/watch?v=bheqvt6Z8S8,866,28,30.929
9,https://www.youtube.com/watch?v=evy4n3gW7LU,1015,22,46.136


We will use the data in this dataframe as the baseline for the sentiment analysis.

Speaking of which, we need to repeat the analysis from earlier on the videos we collected. This process is going to take a long time to complete, even with multiprocessing. Of course, we will need to collect the comments first, so let's do that here