# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [1]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [21]:
import time
import pandas as pd
import utils
import json

Let's load comments from a json file and see how many there are

In [3]:
from get_video import load_comments

comments = load_comments("comments_4DCbZJh-Gk4.json")
print(comments[:10])
print(f"There are {len(comments)} comments in this dataset")

Reading from local file <comments_4DCbZJh-Gk4.json>
['Outstanding!', 'my lord i imagine a war space i love this emg videos john browne are the best ever   ...ever suppppport  djent\nim from argentina say love metal instrumental', "1:31 that switch up is fucking timeless. I've known this song for about 5 years now and it still never gets old.", 'And yet we still search for proof of alien life 🤷🏻\u200d♂️', 'After almost 7 years this playthrough still brings tears to my eyes. So good and hope to see Monuments live again soon!', 'Djesus Christ', 'Masterpiece!!', 'god DAMN that part beginning at 1:58 is SOOO fucking good like holy FUCK man', 'All finger breaker chords. All the time.', 'Bruh']
There are 100 comments in this dataset


This might take some time depending on the size of our dataset, so let's define some checkpoints

In [4]:
checkpoints = {int(len(comments) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # shamelessly stolen from Computational Linguistics Notebook 8

##### Now we can start analyzing them!

We will analyze the comments with BERT first, and save the results to a list. This is due to the fact that the BERT classifier pipeline works better when given a list of comments rather than a single comment

In [5]:
bert_scores = utils.bert_classifier(comments)

Now we analyze the comments iteratively using Vader and Textblob

In [6]:
start_time = time.time()

df_list = []

for i, comment in enumerate(comments):
        current_time = time.time()
        elapsed_time = current_time - start_time

        lang = utils.detect_language(comment)

        # Vader
        vader_score = utils.vader_classifier(comment)

        # TextBlob - Polarity is a a float between -1 and 1 where -1 is negative and 1 is positive
        blob_score = utils.textblob_classifier(comment)

        df_dict = {
            "Comment": comment,
            "Language": lang,
            "Vader score": vader_score,
            "Textblob score": blob_score,
        }

        df_list.append(df_dict)

        if i + 1 in checkpoints:
            print(f"{checkpoints[i+1] :.0f}% of the comments have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds")  


25% of the comments have been analyzed
Time elapsed: 16.59 seconds
50% of the comments have been analyzed
Time elapsed: 31.92 seconds
75% of the comments have been analyzed
Time elapsed: 47.19 seconds
90% of the comments have been analyzed
Time elapsed: 56.23 seconds
100% of the comments have been analyzed
Time elapsed: 62.50 seconds


We should merge the results of both analyses into a single dataframe

In [7]:
df = pd.DataFrame(df_list)
series_bert = pd.Series(bert_scores)

df['BERT score'] = series_bert

Let's inspect the dataframe. Start by looking at the first 10 rows

In [8]:
df.head(10)

Unnamed: 0,Comment,Language,Vader score,Textblob score,BERT score
0,Outstanding!,de,0.6476,0.625,0.896599
1,my lord i imagine a war space i love this emg ...,en,0.8658,0.666667,0.894711
2,1:31 that switch up is fucking timeless. I've ...,en,0.0,-0.25,-0.551378
3,And yet we still search for proof of alien lif...,en,0.0,-0.25,0.367565
4,After almost 7 years this playthrough still br...,en,0.7428,0.435227,0.736152
5,Djesus Christ,et,0.0,0.0,0.272853
6,Masterpiece!!,ro,0.6892,0.0,0.879835
7,god DAMN that part beginning at 1:58 is SOOO f...,en,-0.1518,0.15,-0.496737
8,All finger breaker chords. All the time.,en,0.0,0.0,0.759309
9,Bruh,id,0.0,0.0,-0.386926


And now let's check the stats for our numerical columns 

In [9]:
print(df.describe())

       Vader score  Textblob score  BERT score
count   100.000000      100.000000  100.000000
mean      0.245066        0.177426    0.246427
std       0.460005        0.421838    0.564940
min      -0.726200       -1.000000   -0.967942
25%       0.000000        0.000000   -0.289669
50%       0.080000        0.129018    0.385416
75%       0.682900        0.482924    0.723057
max       0.957100        1.000000    0.951091


The models seem to be predicting very similarly, but we should still check mathematically using Pearson correlation

In [10]:
print(df.corr(method='pearson'))

                Vader score  Textblob score  BERT score
Vader score        1.000000        0.667741    0.465176
Textblob score     0.667741        1.000000    0.561260
BERT score         0.465176        0.561260    1.000000


Now we have an idea of what our analysis is going to look like. However, one video is not enough. Let's build a bigger dataset using the top 100 most searched terms on YouTube

In [13]:
terms = pd.read_csv('search.tsv', sep='\t') #obtained from https://ahrefs.com/blog/top-youtube-searches/
terms.head(10)

Unnamed: 0,#,Keyword,Search Volume
0,1,bts,16723304
1,2,pewdiepie,16495659
2,3,asmr,14655088
3,4,billie eilish,13801247
4,5,baby shark,12110100
5,6,old town road,10456524
6,7,music,10232134
7,8,badabun,10188997
8,9,blackpink,9580131
9,10,fortnite,9117342


We can use these search terms to get YouTube video links which we can then use to gather comments

In [None]:
from get_video import get_videos
half = terms.head(50)
search_terms = half["Keyword"]
urls = []

for term in search_terms:
    urls.append(get_videos(term))

links = list(set(item for sublist in urls for item in sublist))


In [26]:
print(f"{len(links)} links have been collected")

with open("videos__.json", "w") as f:
    json.dump(links, f)

911 links have been collected
