# Youtube Comments Analysis

## Imports

In [1]:
# My own modules
from models.text_models import TextModelManager
from models.llm_api import LLM
from api.youtube_api import YoutubeAPI
from analysis.classification_analysis import ClassificationAnalyzer
from analysis.statements_analysis import StatementsAnalyzer
from analysis.clustering import ClusteringAnalyzer

  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() > 0


In [2]:
# Logging
import logging
logging.basicConfig(
    level=logging.INFO,  # Set the logging level
    format='%(asctime)s.%(msecs)03d - %(name)s - %(levelname)s - %(message)s',  # Define the log format with milliseconds
    datefmt='%Y-%m-%d %H:%M:%S'  # Define the date and time format without milliseconds
)

## Load Models

In [3]:
# Initialize classification models
text_model_manager = TextModelManager()

2024-08-01 10:06:58.357 - models.text_models - INFO - Instantiating TextModelManager.


## Set up LLM

In [4]:
llm = LLM()

2024-08-01 10:06:58.365 - models.llm_api - INFO - Instantiating LLM.


## Youtube API

In [5]:
youtube = YoutubeAPI()

2024-08-01 10:06:58.400 - api.youtube_api - INFO - Instantiating YoutubeAPI.
2024-08-01 10:06:58.403 - googleapiclient.discovery_cache - INFO - file_cache is only supported with oauth2client<4.0.0


In [6]:
yt_video_test_id_tomato = "9WQnap-UAiQ"
yt_video_test_id_10k_comments = "2-XxbdR3Nik"
yt_video_test_id_4500_comments = "-ih0B9yn32Q"
yt_video_test_id_4k_comments_beard_meets_schnitzel = "qPd9qPUR2_U"
yt_video_test_id_2000_comments = "rX2tK-qSVpk"
yt_video_test_id_700_comments = "VCXqELB3UPg"
yt_video_test_id_300_comments = "yQqJafC7xv0"
yt_video_test_id_25_comments = "kiF0wgM8zGc"
yt_video_test_id_50_comments = "LHQMIuzjl48"

yt_video_id = yt_video_test_id_700_comments
youtube.set_current_video(yt_video_id)

In [7]:
youtube.get_title()

'You can mix 10 marbles until they sort themselves. Why not 100?'

In [8]:
youtube.get_creator_name()

'AlphaPhoenix'

In [9]:
# Get comments (for testing)
comments = youtube.get_comments(yt_video_id)

2024-08-01 10:06:58.619 - api.youtube_api - INFO - Starting raw comment retrieval.


Starting comments retrieval for video ID VCXqELB3UPg ('You can mix 10 marbles until they sort themselves. Why not 100?')


2024-08-01 10:06:58.827 - api.youtube_api - INFO - Received 100 top-level comments.
2024-08-01 10:06:58.828 - api.youtube_api - INFO - Requesting another page (page 2 of at most 14) ...
2024-08-01 10:06:59.005 - api.youtube_api - INFO - Received 100 top-level comments.
2024-08-01 10:06:59.006 - api.youtube_api - INFO - Requesting another page (page 3 of at most 14) ...
2024-08-01 10:06:59.169 - api.youtube_api - INFO - Received 100 top-level comments.
2024-08-01 10:06:59.170 - api.youtube_api - INFO - Requesting another page (page 4 of at most 14) ...
2024-08-01 10:06:59.339 - api.youtube_api - INFO - Received 100 top-level comments.
2024-08-01 10:06:59.340 - api.youtube_api - INFO - Requesting another page (page 5 of at most 14) ...
2024-08-01 10:06:59.504 - api.youtube_api - INFO - Received 100 top-level comments.
2024-08-01 10:06:59.507 - api.youtube_api - INFO - Requesting another page (page 6 of at most 14) ...
2024-08-01 10:06:59.673 - api.youtube_api - INFO - Received 100 top-le

## Clustering

Here, our goal is to find out trends or common themes in the comments.

In [10]:
clustering_analyzer = ClusteringAnalyzer(video_id=yt_video_id, comments=comments)

2024-08-01 10:07:01.414 - googleapiclient.discovery_cache - INFO - file_cache is only supported with oauth2client<4.0.0


In [11]:
clustering_analyzer.cluster()

  return self.fget.__get__(instance, owner)()
Calculating embeddings ...: 100%|██████████| 1399/1399 [01:28<00:00, 15.85it/s]
Clustering ...: 100%|██████████| 40/40 [01:17<00:00,  1.94s/it]
2024-08-01 10:09:47.459 - analysis.clustering - INFO - Best clustering out of 26 is with 2 clusters, with a mean Silhouette coefficient of 0.04565867781639099 (function was <function cluster_spectral_clustering at 0x7f067cdd3e20>).
Find cluster topics ...: 100%|██████████| 2/2 [00:00<00:00,  2.09it/s]
Fusing groups ...: 100%|██████████| 2/2 [00:00<00:00, 43464.29it/s]


In [12]:
clustering_analyzer.describe_clusters()

2024-08-01 10:09:48.512 - analysis.clustering - INFO - -------------------------Cluster Description (Label 0)--------------------------
2024-08-01 10:09:48.513 - analysis.clustering - INFO - - Topic: Understanding and Debating Entropy
2024-08-01 10:09:48.515 - analysis.clustering - INFO - - Cluster size: 600 (42.89%)
2024-08-01 10:09:48.515 - analysis.clustering - INFO - --------------------------------------------------------------------------------
2024-08-01 10:09:48.515 - analysis.clustering - INFO - 
2024-08-01 10:09:48.516 - analysis.clustering - INFO - -------------------------Cluster Description (Label 1)--------------------------
2024-08-01 10:09:48.517 - analysis.clustering - INFO - - Topic: Science and Humor
2024-08-01 10:09:48.518 - analysis.clustering - INFO - - Cluster size: 799 (57.11%)
2024-08-01 10:09:48.519 - analysis.clustering - INFO - --------------------------------------------------------------------------------
2024-08-01 10:09:48.521 - analysis.clustering - INF

## LLM Statement Extraction

In [13]:
statements_analyzer = StatementsAnalyzer(
    video_id=yt_video_id,
    comments=comments
)

2024-08-01 10:09:48.530 - googleapiclient.discovery_cache - INFO - file_cache is only supported with oauth2client<4.0.0


In [14]:
statements_analyzer.run_analysis(
    limit_statements=2,  # For testing, limit number of statements
    comment_top_k=2  # reduced count for testing
)

Grouping by sentiment ...: 100%|██████████| 887/887 [00:29<00:00, 30.00it/s]
Measuring statement agreement with comments ...: 100%|██████████| 12/12 [00:23<00:00,  1.95s/it]
2024-08-01 10:10:44.904 - analysis.statements_analysis - INFO - Score for statement 'The video's explanation of probability and entropy is flawed and doesn't consider certain exceptions.' -> -1.49
2024-08-01 10:10:44.905 - analysis.statements_analysis - INFO - Score for statement 'The concept of randomness and probability is fascinating and makes me wonder about the origins of life.' -> 0.59
2024-08-01 10:10:44.906 - analysis.statements_analysis - INFO - Score for statement 'The video provided a great explanation of entropy.' -> 1.49
2024-08-01 10:10:44.906 - analysis.statements_analysis - INFO - Score for statement 'The video's creative demonstrations helped to illustrate complex concepts.' -> 1.49
2024-08-01 10:10:44.908 - analysis.statements_analysis - INFO - Score for statement 'The video's premise is flawed be

## Classification Analysis

In [15]:
classification_analyzer = ClassificationAnalyzer(comments)
print(classification_analyzer.run_all_analyses())

Determining computated facts (Sentiment; argmax=False) ...: 100%|██████████| 887/887 [00:00<00:00, 626743.20it/s]
Determining computated facts (Sentiment; argmax=True) ...: 100%|██████████| 887/887 [00:00<00:00, 114924.86it/s]
Determining computated facts (Sentiment; argmax=False) ...: 100%|██████████| 1399/1399 [00:17<00:00, 79.32it/s] 
Determining computated facts (Sentiment; argmax=True) ...: 100%|██████████| 1399/1399 [00:00<00:00, 142284.95it/s]
Determining computated facts (Toxicity; argmax=False) ...: 100%|██████████| 887/887 [00:55<00:00, 15.97it/s]
Determining computated facts (Toxicity; argmax=True) ...: 100%|██████████| 887/887 [00:00<00:00, 83799.16it/s]
Determining computated facts (Toxicity; argmax=False) ...: 100%|██████████| 1399/1399 [00:32<00:00, 43.64it/s]
Determining computated facts (Toxicity; argmax=True) ...: 100%|██████████| 1399/1399 [00:00<00:00, 91906.01it/s]
Determining computated facts (Emotion; argmax=False) ...: 100%|██████████| 887/887 [00:57<00:00, 15.4

All results are weighted by comment likes.
Classification (Sentiment) analysis for top-level comments:
(Soft) Mean Sentiment for 887 comments:
negative:           36.75%
neutral:            30.22%
positive:           33.03%
(Hard) Mean Sentiment for 887 comments:
negative:           39.70%
neutral:            34.61%
positive:           25.69%

Classification (Sentiment) analysis for all comments:
(Soft) Mean Sentiment for 1399 comments:
negative:           37.90%
neutral:            28.69%
positive:           33.42%
(Hard) Mean Sentiment for 1399 comments:
negative:           42.57%
neutral:            29.44%
positive:           28.00%

10 most negative comments: 
Comment(@edwardmacnab354 @ 2024-05-26T06:02:27+00:00: 'I fgn hate reality ! where the hell are we ! Some kinda weird funhouse constructed by a lunatic ?') (0 likes; 0 replies)
Comment(@Tibyon @ 2024-05-05T13:03:56+00:00: 'Veritasium is extremely evil') (0 likes; 0 replies)
Comment(@Susul-lj2wm @ 2024-05-07T18:08:29+00:00: 'i 


