# Thesis project notebook 

##### Uncomment and run the cell below to automatically install the required libraries

In [117]:
# %pip install -r requirements.txt

Start by importing the required libraries

In [118]:
import json
import os
import random
from multiprocessing import Pool
import pandas as pd
import utils
import time

Let's load in the search terms we need to use

In [119]:
terms = pd.read_csv(f'{os.getcwd()}\\Data\\search.tsv', sep='\t') # obtained from https://ahrefs.com/blog/top-youtube-searches/
terms.head(10)

Unnamed: 0,#,Keyword,Search Volume
0,1,bts,16723304
1,2,pewdiepie,16495659
2,3,asmr,14655088
3,4,billie eilish,13801247
4,5,baby shark,12110100
5,6,old town road,10456524
6,7,music,10232134
7,8,badabun,10188997
8,9,blackpink,9580131
9,10,fortnite,9117342


We can use these search terms to get YouTube video links which we can then use to gather comments. To save some time (and some precious API quota) we'll use the top 50 searched terms

This cell should not be run more than once a month. To make sure you never have to run it again, save the output to a file and load it for subsequent runs

In [120]:
# from get_video import get_videos
# half = terms.head(50)
# search_terms = half["Keyword"]
# urls = []

# for term in search_terms:
#     urls.append(get_videos(term))

# links = list(set(item for sublist in urls for item in sublist))

# with open(f"{os.getcwd()}\\Data\\videos__.json", "w") as f:
#     json.dump(links, f)

In [121]:
with open(f"{os.getcwd()}\\Data\\videos__.json", "r") as f:
    links = json.load(f)

print(f"This file contains <{len(links)}> links")

This file contains <911> links


This is a lot of links (and therefore a lot of data). However, taking into account the rate of analysis and the amount of comments per video, we realize that it's going to take a lot of time: many of these videos have around 20,000 comments minimum. Knowing that it takes this program around 30 seconds to analyze 100 comments, analyzing all comments would take days. 

Instead of waiting for 9 days, we can make a compromise: we can take a sample of 500 comments per video. 

Now that we have our list of links, we can start collecting data. Let's do the fast part first: collecting the like/dislike ratio and saving it somewhere convenient for later. 

As before, this cell should ideally be run once only as it takes a bit of time - around 15 minutes. If the notebook needs to be restarted, you should load the csv file instead.

In [122]:
# from get_video import get_likes
# start_time = time.time()

# with open("{os.getcwd()}\\Data\\videos__.json", "r") as f:
#     links = json.load(f)
# df_list = []

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# for i, link in enumerate(links):
#     current_time = time.time()
#     elapsed_time = current_time - start_time
#     data = get_likes(link)
#     df_list.append(data)
#     time.sleep(1) # otherwise we will get an HTTP error
#     if i + 1 in checkpoints:
#             print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds")  
     
# df = pd.DataFrame(df_list)
# utils.save_analysis(df, "top50_ratio")

Let's look at the first 10 entries

In [123]:
likes_df = pd.read_csv(f"{os.getcwd()}\\Data\\analysis_top50_ratio.csv", index_col = 0)
likes_df = likes_df.reset_index()
likes_df.head(10)

Unnamed: 0,URL,likes,dislikes
0,https://www.youtube.com/watch?v=feLcMaiClOw,312761,19093
1,https://www.youtube.com/watch?v=tASwv-tkMlc,35026,165
2,https://www.youtube.com/watch?v=3U2dNKBM28o,126427,6010
3,https://www.youtube.com/watch?v=1cPDfXU95Xw,23783,18
4,https://www.youtube.com/watch?v=pptIU_ZLlOo,4600,7
5,https://www.youtube.com/watch?v=ClQ-ymoXJZc,1079676,8874
6,https://www.youtube.com/watch?v=8EJ3zbKTWQ8,11371069,1420624
7,https://www.youtube.com/watch?v=t99KH0TR-J4,1435738,33092
8,https://www.youtube.com/watch?v=DovdIspaqmw,146007,5324
9,https://www.youtube.com/watch?v=DQQRjFzB8gY,870202,49922


This is good data, but it is not complete. We are still missing the ratio of likes to dislikes. Let's compute that now!

In [124]:
likes, dislikes = likes_df['likes'], likes_df['dislikes']

ratios = []

for like, dislike in zip(likes, dislikes):
    if like == 0 or dislike == 0:
        ratio = 0
        ratios.append(ratio)
    else:
        ratio = round(like / dislike, 3)
        ratios.append(ratio)

ratios = pd.Series(ratios)
ratios.head(10)

0      16.381
1     212.279
2      21.036
3    1321.278
4     657.143
5     121.667
6       8.004
7      43.386
8      27.424
9      17.431
dtype: float64

Let's add this list to our dataframe and compute the mean ratio

In [125]:
likes_df['ratio'] = ratios.astype('float64') 
mean_ratio = ratios.mean()
print(f"Mean like/dislike ratio = {mean_ratio :.2f}")
likes_df.head(10)

Mean like/dislike ratio = 123.55


Unnamed: 0,URL,likes,dislikes,ratio
0,https://www.youtube.com/watch?v=feLcMaiClOw,312761,19093,16.381
1,https://www.youtube.com/watch?v=tASwv-tkMlc,35026,165,212.279
2,https://www.youtube.com/watch?v=3U2dNKBM28o,126427,6010,21.036
3,https://www.youtube.com/watch?v=1cPDfXU95Xw,23783,18,1321.278
4,https://www.youtube.com/watch?v=pptIU_ZLlOo,4600,7,657.143
5,https://www.youtube.com/watch?v=ClQ-ymoXJZc,1079676,8874,121.667
6,https://www.youtube.com/watch?v=8EJ3zbKTWQ8,11371069,1420624,8.004
7,https://www.youtube.com/watch?v=t99KH0TR-J4,1435738,33092,43.386
8,https://www.youtube.com/watch?v=DovdIspaqmw,146007,5324,27.424
9,https://www.youtube.com/watch?v=DQQRjFzB8gY,870202,49922,17.431


We will use the data in this dataframe as the baseline for the sentiment analysis.

This process is going to take a long time to complete, even with multiprocessing. 

Let's start by collecting the comments.
The results will be saved to a file, so there's no need to run this cell more than once if everything goes well.

In [126]:
# from get_video import get_id, get_comments
# import googleapiclient.discovery

# urls = list(likes_df["URL"])

# target_dir = f"{os.getcwd()}/Data"
# if not os.path.exists(target_dir):
#     os.mkdir(target_dir)

# for url in urls:
#     video_id = get_id(url)
#     try:
#         comments = get_comments(url)
#     except googleapiclient.errors.HttpError:
#         continue
#     utils.move_dir(f"comments_{video_id}.json", target_dir)

There is a problem: some of the files have very few comments, and we need at least 500. So, we will have to remove all files with less than 500 comments

In [127]:
# target_dir = f"{os.getcwd()}\\Data"
# trash = f"{os.getcwd()}\\Trash"

# checkpoints = {int(len(links) * perc): perc * 100 for perc in [0.25, 0.5, 0.75, 0.9, 1]} # same as before

# if not os.path.exists(trash):
#     os.mkdir(trash)

# data = os.listdir(target_dir)

# start =  time.time()

# for i, file in enumerate(data):
#     current_time = time.time()
#     elapsed_time = current_time - start
#     filepath = os.path.join(target_dir, file)
#     content = load_comments(filepath)
#     if len(content) < 500:
#         utils.move_dir(filename = file, destination = trash, source = target_dir)
#     if i + 1 in checkpoints:
#         print(f"{checkpoints[i+1] :.0f}% of the videos have been analyzed\nTime elapsed: {elapsed_time :.2f} seconds") 
# trash_list = os.listdir(trash)

# print(f"<{len(trash_list)}> files moved")

Quick sanity check:

In [128]:
# target_dir = f"{os.getcwd()}\\Data"

# data = os.listdir(target_dir)
# print(f"This directory contains <{len(data)}> files")
# for file in data:
#     filepath = os.path.join(target_dir, file)
#     content = load_comments(filepath)
#     print(len(content))

Great! We have over 500 videos left, each with at least 500 comments. All that's left to do is run the sentiment analysis on each video. This is going to take a lot of time, which means we'll have to do it in batches. 100 videos per batch seems feasible, so let's do that. 

In [129]:
# %run split_folder__.py "F:\Desktop\Scuola\Uni\Y3\S2\Thesis\Project\Main project\GetYoutubeComments\Data" 
# #Obtained from https://gist.github.com/zupo/5849843

Another sanity check:

In [130]:
# target_dir = f"{os.getcwd()}\\Data\\Batch_1"
# data = os.listdir(target_dir)
# print(len(data))

Now we can finally start the analysis

In [131]:
# from get_video import load_comments

# batch_6 = f"{os.getcwd()}\\Data\\Batch_6"
# data_6 = os.listdir(batch_6)

# os.chdir(batch_6)

# for file in data_6:
#     if file.startswith("analysis_"): 
#         continue
#     else:
#         start = time.time()
#         filename = f"{batch_6}\\{file}"
#         content = load_comments(filename)

#         with Pool() as p:
#             tr_comment = p.map(utils.translate, content)

#         bert_scores = utils.bert_classifier(content)

#         df_list = []
#         scores_list = []

#         with Pool() as p:
#             analysis1 = p.map(utils.vader_classifier, tr_comment)
#             analysis2 = p.map(utils.textblob_classifier, tr_comment)
#             scores_list.append(analysis1)
#             scores_list.append(analysis2)

#         for c, v_score, t_score in zip(content, scores_list[0], scores_list[1]):
#             df_dict = {
#                 "Comment": c,
#                 "Vader score": v_score,
#                 "TextBlob score": t_score,
#             }
#             df_list.append(df_dict)

#         df = pd.DataFrame(df_list)

#         series_bert = pd.Series(bert_scores)
#         df["BERT score"] = series_bert

#         vader_labels = utils.generate_labels(df["Vader score"], "vader")
#         blob_labels = utils.generate_labels(df["TextBlob score"], "blob")
#         bert_labels = utils.generate_labels(df["BERT score"], "bert")

#         vader_labels_series = pd.Series(vader_labels)
#         blob_labels_series = pd.Series(blob_labels)
#         bert_labels_series = pd.Series(bert_labels)

#         df["Vader label"] = vader_labels_series
#         df["TextBlob label"] = blob_labels_series
#         df["BERT label"] = bert_labels_series

#         df = df[
#             [
#                 "Comment",
#                 "Vader score",
#                 "Vader label",
#                 "TextBlob score",
#                 "TextBlob label",
#                 "BERT score",
#                 "BERT label",
#             ]
#         ]

#         end = time.time()
#         time_taken = end - start
#         print(f"Processed file <{file}>\nTime taken: <{time_taken}> seconds")
#         utils.save_analysis(df, file)

# os.chdir("F:\\Desktop\\Scuola\\Uni\\Y3\\S2\\Thesis\\Project\\Main project\\GetYoutubeComments") # Reset cwd for next batch

Let's copy and merge the CSV files for easier analysis of overall trends

In [132]:
# import shutil

# folders = ["Batch_1", "Batch_2", "Batch_3", "Batch_4", "Batch_5", "Batch_6"]
# data_dir = f"{os.getcwd()}\\Data"
# target_dir = f"{os.getcwd()}\\CSV"

# if not os.path.exists(target_dir):
#     os.mkdir(target_dir)

# for folder in folders:
#     batch = f"{data_dir}\\{folder}"
#     all_data = os.listdir(batch)
#     for file in all_data:
#         if file.endswith(".csv"):
#             filename = os.path.join(batch, file)
#             shutil.copy(filename, target_dir)            

# files = os.listdir(target_dir)

# combined_csv = pd.concat([pd.read_csv(f"{target_dir}\\{f}") for f in files])
# combined_csv.to_csv("combined_csv.csv", index=False, encoding="utf-8-sig")

# all_files = pd.read_csv(f"{target_dir}\\combined_csv.csv")
# all_files.head(10)

In [133]:
# target_dir = f"{os.getcwd()}\\CSV"
# f = pd.read_csv(f"{target_dir}\\combined_csv.csv")

# f = f.replace({"Netural": "Neutral"})

# utils.save_analysis(f, filename = "all_files")

Let's get a ratio of polarities for all videos for each model

In [134]:
target_dir = f"{os.getcwd()}\\Data\\CSV"
all_files = pd.read_csv(f"{target_dir}\\analysis_all_files.csv")

columns = [all_files["Vader label"], all_files["TextBlob label"], all_files["BERT label"]]
models = ["Vader", "TextBlob", "BERT"]
df_list = []

for column, model in zip(columns, models):
    col = column.value_counts(normalize=True)
    d = {"Model": model, "Positive": col['Positive'], "Neutral": col['Neutral'], "Negative": col['Negative']}
    df_list.append(d)

all_ratios = pd.DataFrame(df_list)
print("Ratios: ")
print(all_ratios, "\n")

Ratios: 
      Model  Positive   Neutral  Negative
0     Vader  0.507953  0.357239  0.134808
1  TextBlob  0.259344  0.646309  0.094347
2      BERT  0.452432  0.244917  0.302651 



Vader and BERT seem to be relatively close to each other in the ratio of their predictions, while TextBlob does not seem to follow the trend. 
Let's check if the predictions correlate:

In [135]:
print("Statistics: ")
print(all_files.describe(), "\n")

print("Correlation matrix: ")
print(all_files.corr("pearson"))

Statistics: 
         Vader score  TextBlob score     BERT score
count  339364.000000   339364.000000  339364.000000
mean        0.255158        0.149271       0.204303
std         0.448403        0.324981       0.459851
min        -1.000000       -1.000000      -0.988600
25%         0.000000        0.000000      -0.293853
50%         0.090000        0.000000       0.308578
75%         0.636900        0.350000       0.522687
max         1.000000        1.000000       0.991498 

Correlation matrix: 
                Vader score  TextBlob score  BERT score
Vader score        1.000000        0.524019    0.447403
TextBlob score     0.524019        1.000000    0.446724
BERT score         0.447403        0.446724    1.000000


They seem to all have a slightly positive relationship, which means that they are correlated to each other (albeit not very much). With this in mind, let's start predicting the dislikes for each video

Let's start by making a copy of the likes_df dataframe and adding the filename as a column of the new dataframe

In [287]:
data_dir = f"{os.getcwd()}\\Data"
csv_dir = f"{data_dir}\\CSV"

csv_list = os.listdir(csv_dir)
likes_df_short = likes_df.copy(deep = True)

url_id = [url[-11:] for url in likes_df_short["URL"]]
file_id =[filename[18:-9] for filename in csv_list if filename.startswith("analysis_comments")]

test = sorted(list(set(url_id).intersection(file_id)))

print(test)
# likes_df_short["ID order"] = pd.Series(url_id)
# likes_df_short = likes_df_short.set_index("ID order")
# likes_df_short.loc[url_id]


# likes_df_short.head(10)               

['-6-AnhFG3cI', '-8VfKZCOo_I', '-BjZmE2gtdo', '-CmadmM5cOk', '-Hqp5Mwq3Zs', '-L1PrZf6OTU', '-eGEKFMHGr8', '-fWXLAkHmxA', '-s23GrMq-wA', '-tJYN-eG1zk', '0CWFO8cMxO4', '0HDdjwpPM3Y', '0zBSw5Rg41w', '1-xGerv5FOk', '1D5cxbDbKMc', '1GS7wxWPaxc', '1__CAdTJ5JU', '1_zgKRBrT0Y', '1ekZEVeXwek', '2S24-y0Ij3Y', '2Vv-BfVoq4g', '2Z0Put0teCM', '2cyzCReoNgU', '2fDVXFlHtdE', '2j9RcvNRywQ', '2sn45qRAEzE', '2zNB3CotyxM', '32si5cfrCNc', '34u9lJWBLMc', '3CxtK7-XtE0', '3R0AlwgcyDE', '3U2dNKBM28o', '3X9wEwulYhk', '3f2X_gIJ_u4', '3farSqm-AxM', '3pgWNs7lay0', '3qHhGmiAQoY', '3tmd-ClpJxA', '3vDuSr2VLmw', '4-TbQnONe_w', '4GuqB1BQVr4', '4Rgqc-9MPIY', '4UDaT5LVnmI', '4iQxG8ZjYO8', '4j725wScY0E', '58u-zkDLNPg', '5EQ8ctJmmUo', '5GJWxDKyk3A', '5TBW-uCkvtA', '5UMCrq-bBCg', '5X-Mrc2l1d0', '5eBBdBEbbXQ', '5kHq6hqgROc', '5ula1NjaHUA', '5unbAtbxvEM', '60ItHLz5WEA', '6DRkf3kZMXw', '6Dh-RL__uN4', '6G7GulQJUIU', '6Mgqbai3fKo', '6txSmc8napQ', '72UO0v5ESUo', '75PpxGzR7s0', '7Gs9zvu5FyE', '7JmprpRIsEY', '7MHlhHaLcTA', '7ZP_Fubq

In [None]:
target_files = []
id_lst = []
for file in csv_list:
    for url in urls:
        url_id = url[-11:]
        if file.startswith("analysis_comments"):
            id_ = file[18:-9]
            if id_ == url_id:
                target_files.append(file)
                id_lst.append(url_id)

