# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [2]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [3]:
from multiprocessing import Pool
import pandas as pd
import utils
import langdetect as ld

Since the real analysis is going to take a long time to complete, we'll start with a toy example which takes about a minute. 

Let's load comments from a json file and see how many there are

In [4]:
from get_video import load_comments

comments = load_comments("comments_4DCbZJh-Gk4__.json")
print(f"There are {len(comments)} comments in this dataset")
print(comments[:10])

Reading from local file <comments_4DCbZJh-Gk4__.json>
There are 100 comments in this dataset
['Outstanding!', 'my lord i imagine a war space i love this emg videos john browne are the best ever   ...ever suppppport  djent\nim from argentina say love metal instrumental', "1:31 that switch up is fucking timeless. I've known this song for about 5 years now and it still never gets old.", 'And yet we still search for proof of alien life 🤷🏻\u200d♂️', 'After almost 7 years this playthrough still brings tears to my eyes. So good and hope to see Monuments live again soon!', 'Djesus Christ', 'Masterpiece!!', 'god DAMN that part beginning at 1:58 is SOOO fucking good like holy FUCK man', 'All finger breaker chords. All the time.', 'Bruh']


##### Now we can start analyzing them!

We will analyze the comments with BERT first, and save the results to a list. This is due to the fact that the BERT classifier pipeline works better when given a list of comments rather than a single comment

In [5]:
bert_scores = utils.bert_classifier(comments)

Now we analyze the comments using Vader and Textblob

In [6]:
df_list = []
scores_list = []

with Pool() as p:
    analysis1 = p.map(utils.vader_classifier, comments)
    analysis2 = p.map(utils.textblob_classifier, comments)
    scores_list.append(analysis1)
    scores_list.append(analysis2)

for comment, v_score, t_score in zip(comments, scores_list[0], scores_list[1]):
    df_dict = {
        "Comment": comment,
        "Vader score": v_score,
        "TextBlob score": t_score,
    }
    df_list.append(df_dict)

df = pd.DataFrame(df_list)

We should merge the results of both analyses into a single dataframe

In [7]:
df = pd.DataFrame(df_list)
series_bert = pd.Series(bert_scores)

df['BERT score'] = series_bert

Let's inspect the dataframe. Start by looking at the first 10 rows

In [8]:
df.head(10)

Unnamed: 0,Comment,Vader score,TextBlob score,BERT score
0,Outstanding!,0.6476,0.625,0.896599
1,my lord i imagine a war space i love this emg ...,0.8658,0.666667,0.894711
2,1:31 that switch up is fucking timeless. I've ...,0.0,-0.25,-0.551378
3,And yet we still search for proof of alien lif...,0.0,-0.25,0.367565
4,After almost 7 years this playthrough still br...,0.7428,0.435227,0.736152
5,Djesus Christ,0.0,0.0,0.272853
6,Masterpiece!!,0.6892,0.0,0.879835
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,0.15,-0.496737
8,All finger breaker chords. All the time.,0.0,0.0,0.759309
9,Bruh,0.0,0.0,-0.386926


Scores are useful for computation, but not so much for legibility. It would be good to have labels so we know what the scores mean exactly. The Vader team has ranges defined on their GitHub repository, so assigning labels is going to be very easy. BERT returns a label, but it's not formatted in a way that we like. TextBlob doesn't return labels at all, and ideal ranges are not defined in the documentation. Let's add some labels now

In [9]:
vader_labels = utils.generate_labels(df["Vader score"], "vader")
blob_labels = utils.generate_labels(df["TextBlob score"], "blob")
bert_labels = utils.generate_labels(df["BERT score"], "bert")

vader_labels_series = pd.Series(vader_labels)
blob_labels_series = pd.Series(blob_labels)
bert_labels_series = pd.Series(bert_labels)

df['Vader label'] = vader_labels_series
df['TextBlob label'] = blob_labels_series
df['BERT label'] = bert_labels_series

Let's rearrange the columns and see what it looks like

In [10]:
df = df[["Comment","Vader score", "Vader label", "TextBlob score", "TextBlob label", "BERT score", "BERT label"]]

df.head(10)

Unnamed: 0,Comment,Vader score,Vader label,TextBlob score,TextBlob label,BERT score,BERT label
0,Outstanding!,0.6476,Positive,0.625,Positive,0.896599,Positive
1,my lord i imagine a war space i love this emg ...,0.8658,Positive,0.666667,Positive,0.894711,Positive
2,1:31 that switch up is fucking timeless. I've ...,0.0,Neutral,-0.25,Negative,-0.551378,Negative
3,And yet we still search for proof of alien lif...,0.0,Neutral,-0.25,Negative,0.367565,Positive
4,After almost 7 years this playthrough still br...,0.7428,Positive,0.435227,Positive,0.736152,Positive
5,Djesus Christ,0.0,Neutral,0.0,Netural,0.272853,Netural
6,Masterpiece!!,0.6892,Positive,0.0,Netural,0.879835,Positive
7,god DAMN that part beginning at 1:58 is SOOO f...,-0.1518,Negative,0.15,Netural,-0.496737,Negative
8,All finger breaker chords. All the time.,0.0,Neutral,0.0,Netural,0.759309,Positive
9,Bruh,0.0,Neutral,0.0,Netural,-0.386926,Negative


And now let's check the stats for our numerical columns 

In [11]:
print(df.describe())

       Vader score  TextBlob score  BERT score
count   100.000000      100.000000  100.000000
mean      0.245066        0.177426    0.246427
std       0.460005        0.421838    0.564940
min      -0.726200       -1.000000   -0.967942
25%       0.000000        0.000000   -0.289669
50%       0.080000        0.129018    0.385416
75%       0.682900        0.482924    0.723057
max       0.957100        1.000000    0.951091


The models seem to be predicting very similarly, but we should still check mathematically using Pearson correlation

In [12]:
print(df.corr(method='pearson'))

                Vader score  TextBlob score  BERT score
Vader score        1.000000        0.667741    0.465176
TextBlob score     0.667741        1.000000    0.561260
BERT score         0.465176        0.561260    1.000000
