### Mission Statement
* In this simulation, a popular YouTuber has asked us to find the top 10 things talking about in the comments on one of his videos
* He wants to use this information to see what caught people's attention, if people are talking about the adverts in the video, and anything else useful we might find

General Directions from the YouTuber:
* We want the number of likes on a comment to inform the process somehow
* There are obviously more than 10 things dicussed in the comments, but he wants the top 10
    * If there's a good reason for more, he wants more

### General Process
* Pull the data from Youtube
    * Get all comments and the number of likes for each comment
* Clean the data
    * Remove any duplicates
        * The pinned comment is the only a true duplicate
        * If the same comment is posted by different users, combine it into one comment with the total number of likes
    * Translate all comments to English
* Get text embedding of each unique comment
* Perform dimensionality reduction
    * Clustering algorithms generally perform better in relatively low dimensional space
* Cluster the comments
    * Use a weighted clustering clustering algorithm with the number of likes acting as the sample weight
* Perform topicl modeling (TF-IDF) to determine the topic of each cluster
* Depending on the number of clusters/topics, we can report the "top 10 things"
    * If there are more than 10 "things"

In [None]:
import os
import time
from pyyoutube import Api
from azure.core.credentials import AzureKeyCredential
from azure.ai.translation.text import TextTranslationClient
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn

# APIs
GOOGLE_API_KEY = os.environ.get("GOOGLE_KEY")
AZURE_KEY = os.environ.get("AZURE_KEY")
AZURE_ENDPOINT = "https://api.cognitive.microsofttranslator.com"
AZURE_REGION = "southcentralus"

# Video ID
VIDEO_ID = "KOEfDvr4DcQ"

# Save the intermediate steps so we don't have to do reprocessing
CACHE_DIR = "./data/youtube_comments"
os.makedirs(CACHE_DIR, exist_ok=True)
RAW_COMMENTS_PATH = os.path.join(CACHE_DIR, VIDEO_ID + "_raw.csv")
CLEANED_COMMENTS_PATH = os.path.join(CACHE_DIR, VIDEO_ID + "_cleaned.csv")
TRANSLATED_COMMENTS_PATH = os.path.join(CACHE_DIR, VIDEO_ID + "_translated.csv")
FINAL_COMMENTS_PATH = os.path.join(CACHE_DIR, VIDEO_ID + "_final.csv")
EMBEDDINGS_PATH = os.path.join(CACHE_DIR, VIDEO_ID + "_embeddings.npy")

# Model IDs
EMBEDDING_MODEL_ID = "Alibaba-NLP/gte-Qwen2-7B-instruct"
EMBEDDING_MAX_LENGTH = 8192

In [None]:
# Collect the data
# The text to be clustered are all comments on Mr Beast's most watched video of 2024
#   His most watched video of 2024 as of 02/2025 is "Face Your Biggest Fear To Win $800,000"
# At time of data collection, the video has ~300M views and ~215k comments

# If the comments have already been pulled and saved, just load them
if os.path.exists(RAW_COMMENTS_PATH):
    raw_comments = pd.read_csv(RAW_COMMENTS_PATH)
    raw_comments["Text"] = raw_comments["Text"].fillna("")

# If they haven't been saved, pull them then save them
else:
    responses = []
    token = None

    # Get a batch of 1000 comment threads
    # response is a linked list. If response.nextPageToken is not None, there are more comments
    # Get responses until there are no more (response.nextPageToken is None)
    api = Api(api_key=GOOGLE_API_KEY)
    while True:
        response = api.get_comment_threads(
            video_id=VIDEO_ID,
            count=1000,
            page_token=token,
            text_format="plainText"
        )

        # Append and get the next page
        responses.append(response)
        token = response.nextPageToken

        # If this is the last page, break
        if not token:
            break
    
    # Get the text, number of likes, and ID of each comment. Save it to a csv
    raw_comments = [
        (thread.snippet.topLevelComment.snippet.textDisplay, thread.snippet.topLevelComment.snippet.likeCount, thread.id)
        for response in responses for thread in response.items
    ]
    raw_comments = pd.DataFrame(raw_comments, columns=["Text", "Likes", "Id"])
    raw_comments.to_csv(RAW_COMMENTS_PATH, index=False)

### Data Cleaning
* Of the ~215k comments, there are only ~144k unique comment texts (many repeated comments)
    * This seems to be mostly a combination of bots reposting comments, and trivial messages (many times just an emoji)
    * However, it is also the same (non-trival) comment being naturally posted several times
* We can combine two rows if their text is the same, and add their likes together
* Of the 144k unique comments, only ~33k have at least 1 like
    * We can assume the unliked comments are noise, so we can remove them
* Since we will be doing topic modeling with TF-IDF we want all comments to be in English
    * To translate, we can just run all the comments through Microsoft Cognitive Services Text Translation
    * Only ~70% of the comments were English
* After translating, there are duplicates. We can combine them just as before
    * "Hola" and "Hello" wouldn't originally be combined, but after translating both now say "Hello" and can be combined
    * We can combine an additional ~1.8% of the ~33k comments after translating

In [None]:
# Remove duplicated and comments with 0 likes
if os.path.exists(CLEANED_COMMENTS_PATH):
    cleaned_comments = pd.read_csv(CLEANED_COMMENTS_PATH)
    cleaned_comments["Text"] = cleaned_comments["Text"].fillna("")

else:
    # Mr Beast's pinned comment is the only true duplicated comment
    cleaned_comments = raw_comments.drop_duplicates()

    # Combine two or more comments if their text is identical. Sum the likes from all combined comments
    #   Keep the ID of the comment with more likes
    cleaned_comments = cleaned_comments.sort_values(by="Likes", ascending=False)
    cleaned_comments = cleaned_comments.groupby("Text", as_index=False).agg({
        "Likes": "sum",
        "Id": "first"
    })

    # Most comments still have 0 likes. Anything with zero noise is considered noise
    cleaned_comments = cleaned_comments[cleaned_comments["Likes"] > 0]

    # Save the cleaned comments to disk
    cleaned_comments.to_csv(CLEANED_COMMENTS_PATH, index=False)

In [None]:
# Translate all comments to English
#   We need all comments in English so we can properly perform TF-IDF later
#   We want to translate before embedding since language is likely represented in the embedding
#       Unfortunately, the translation isn't perfect, but it will be good enough
# We translate after combining so we don't translate the same exact piece of text multiple times

# Free tier of translation service has a rate limit of 33k characters per minute. The documentation only mentions a 2M
#   limit per hour, but if I send more than 33k characters per minute (2M per hour average) I get an "exceeded request
#   limits" error.
# To make the code as simple as possible since it only needs to run once, I found that batching the requests into 200
#   comments never exceeds 33k characters. I can then wait a minute between each batch to avoid the rate limit.
# If this code was going to be run repeatedly, I'd use a variable batch size and pack each batch based on the character
#   count of each comment in the batch.
def language_and_enlish_translation(texts, batch_size = 200):
    """ Given a list of texts returns their source language and the translation of each item to English """
    client = TextTranslationClient(credential=AzureKeyCredential(AZURE_KEY), region=AZURE_REGION, endpoint=AZURE_ENDPOINT)
    response = []

    for i in range(0, len(texts), batch_size):
        texts_batch = texts[i:i + batch_size]
        response += client.translate(body=texts_batch, to_language=["en"])
        time.sleep(60)

    return [
        (item["detectedLanguage"]["language"], item["translations"][0]["text"])
        for item in response
    ]

if os.path.exists(TRANSLATED_COMMENTS_PATH):
    translated_comments = pd.read_csv(TRANSLATED_COMMENTS_PATH)
    translated_comments[["Text", "English Text"]] = translated_comments[["Text", "English Text"]].fillna("")

else:
    translated_comments = cleaned_comments.copy()
    translated_comments.loc[:, ["Language", "English Text"]] = language_and_enlish_translation(translated_comments["Text"].tolist())
    translated_comments.to_csv(TRANSLATED_COMMENTS_PATH, index=False)

In [None]:
# Pie Chart of Distribution of Languages of Unique Comments with 1 or more Likes

# Top Languages to show
language_map = {
    "en": "English",
    "es": "Spanish",
    "ru": "Russian",
    "so": "Somali (Arabic)",
    "ar": "Arabic",
    "pt": "Portuguese (Brazil)",
}

# Get the count per language. Combine less common languages into "Other"
# Note:
#   This is the count of unique comments with 1+ likes
#   We already removed duplicate texts and comments wtih 0 likes
language_counts = translated_comments['Language'].replace(language_map).value_counts()
top_languages = language_counts.nlargest(len(language_map))
other_count = language_counts.iloc[len(language_map):].sum()
top_languages['Other'] = other_count

fig, ax = plt.subplots(figsize=(8, 6))
explode = [0] + [0.1]*len(language_map)
top_languages.plot.pie(
    ax=ax,
    startangle=54.36,
    explode=explode,
    labeldistance=1.03,
    autopct="%.1f%%",
    pctdistance=0.80
)
ax.set_ylabel('')
fig.suptitle("Distribution of Languages\nof Unique Comments with 1+ Likes", x=0.55, y=0.95)
ax.axis('equal')
plt.show()

In [None]:
# Pie Chart of Total Likes per Language
language_map = {
    "en": "English",
    "es": "Spanish",
}

# Get the count of likes per language. Combine less common languages into "Other"
# Note:
#   We preserved likes when removing duplicate texts and only removed comments with 0 likes
#   So the count of likes per language is the same here as it would be before any data filtering
likes_per_language = translated_comments.groupby('Language')['Likes'].sum()
likes_per_language = likes_per_language.sort_values(ascending=False)
likes_per_language = likes_per_language.rename(index=language_map)
top_likes = likes_per_language.nlargest(len(language_map))
other_likes = likes_per_language.iloc[len(language_map):].sum()
top_likes['Other'] = other_likes

# Plot the pie chart
fig, ax = plt.subplots(figsize=(8, 6))
explode = [0] + [0.1]*len(language_map)
top_likes.plot.pie(
    ax=ax,
    startangle=5.6,
    labeldistance=1.03,
    explode=explode,
    autopct="%.1f%%",
    pctdistance=0.80
)
ax.set_ylabel('')
fig.suptitle("Total Likes per Language", x=0.525, y=0.90)
ax.axis('equal')
plt.show()

In [None]:
# Combine again after translating
# Now that everything is English, there may be more repeated comments
#   If the same comment was in two different languages, by translating all to English they would now be redundant
#   Previously "Hola" and "Hello" would not be combined. After translating both would be "Hello" and would be combined

if os.path.exists(FINAL_COMMENTS_PATH):
    final_comments = pd.read_csv(FINAL_COMMENTS_PATH)
    final_comments[["Text", "English Text"]] = final_comments[["Text", "English Text"]].fillna("")

else:
    # Combine if the English Text is the same and sum the likes
    final_comments = translated_comments.copy()
    final_comments = final_comments.sort_values(by="Likes", ascending=False)
    final_comments = final_comments.groupby("English Text", as_index=False).agg({
        "Likes": "sum",
        "Text": "first",
        "Language": "first",
        "Id": "first"
    })

    # Reset the index after cleaning is complete
    final_comments = final_comments.reset_index(drop=True)
    
    # Save the final comments to disk
    final_comments.to_csv(FINAL_COMMENTS_PATH, index=False)

In [None]:
# Print top 10 most liked comments
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 1000)
final_comments[["English Text", "Likes"]].nlargest(10, "Likes")

In [None]:
# Histogram of Tokenized Lengths
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_ID)
token_lengths = [len(tokenizer.encode(comment)) for comment in final_comments["English Text"]]
plt.hist(token_lengths,
    bins=np.arange(100) - 0.5,
    density=True
)
plt.show()
print("Percentage longer than 100:", np.round(sum(1 for x in token_lengths if x > 100) / len(token_lengths) * 100, 2))

### Distribution of Likes
* There are ~1.1M likes between all 33k comments.
* Looking at the histogram, we see most of the comments have only a few likes
    * The first 100 comments are omitted from the histogram to make it more readable
    * The comment with the most likes is Mr Beast's pinned comment with ~114k likes
        * There are only 10 comments with more than 10k likes. These 10 comments have ~330k likes total (31% of all comment likes)
        * There are only 25 comments with more than 5k likes. These 25 comments have ~430k likes total (40% of all comment likes)
        * There are only 100 comments with more than 1600 likes. These 100 comments have ~630k likes total (59% of all comment likes)
* Lookin at the cumulative sum of likes vs numbner of top comments
    * The top 1 comment has ~11% of all comment likes
    * The top 1% of comments have 77% of all comment likes
    * The top 2% of comments have 86% of all comment likes
    * The top 5% of comments have 93% of all comment likes
    * The top 10% of comments have 96% of all comment likes

In [None]:
# Histogram of number of likes
plt.figure(figsize=(10, 5))
plt.hist(final_comments["Likes"], bins=np.arange(0, 1600, 16), log=True)
plt.xlabel("Number of Likes")
plt.ylabel("Frequency (Log Scale)")
plt.show()

In [None]:
# Cumulative Distribution of Likes Plot
sorted_likes = final_comments["Likes"].sort_values(ascending=False).to_numpy()
cumulative_likes = np.cumsum(sorted_likes) / sum(sorted_likes)
cumulative_likes = np.insert(cumulative_likes, 0, 0.0)

plt.plot(np.linspace(0, 1, len(cumulative_likes), endpoint=True), cumulative_likes)
plt.grid()
plt.xlabel("Number of Top Comments Considered")
plt.ylabel("Cumulative Proportion of Likes")
plt.title("Cumulative Distribution of Likes")
plt.show()

# If using a model with query/document, use default if it is document (also sometimes called passage). Probably always use document
https://huggingface.co/spaces/mteb/leaderboard
* Clustering task uses kmeans (with eucliudean distance) - so good performance there indicates good performance for this application

* MTEB is a benchmark of many embedding tasks. 
* filtered down to just the Clustering task (several datasets)
* Within clustering, the most related task is Stack Exchange as it is the only data collected from web domain (datasets from online sources such as wikipedia or similar are considered encyclopaedic, non-fiction, etc). The Stack Exchange dataset is unfortuntely entirely english. The youtube comments are primarily english, but so have other languages
* Top 2 performing models on the Stack Overflow clustering task (as of the time of writing this) are gte_Qwen1.5-7B-instruct and gte-Qwen2-7B-instruct with scores of 80.60 and 80.26 respectively
* The scores are likely close enough that the Stack Exchange Clustering along could not definitively tell us which model would be better on our dataset
* Given that, I have chosen to use the Qwen2 model as it generally has better performance than the Qwen1.5 model

MTEB just uses normal kmeans:
clustering_model = sklearn.cluster.MiniBatchKMeans(
    n_clusters=len(set(self.labels)),
    batch_size=self.clustering_batch_size,
    n_init="auto",
)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

In [None]:
import pickle
with open("./final_comments.pickle", "wb") as f:
    pickle.dump(final_comments, f, pickle.HIGHEST_PROTOCOL)

In [None]:
import pickle
with open("./final_comments.pickle", "rb") as f:
    final_comments = pickle.load(f)

In [None]:
if os.path.exists(EMBEDDINGS_PATH):
    embeddings = np.load(EMBEDDINGS_PATH)
else:
    model = SentenceTransformer(EMBEDDING_MODEL_ID, trust_remote_code=True)
    model.max_seq_length = EMBEDDING_MAX_LENGTH
    embeddings = model.encode(final_comments["English Text"].tolist(), batch_size=1, show_progress_bar=True)
    np.save(EMBEDDINGS_PATH, embeddings)
    del model

# TODO:
* PCA for variance explainability to show dimensionality vs information
    * Show that we can reduce down to relatively small dimensionality (cut 99% of dimensions) without losing same proportion of information (less than 99% of information lost, should be likse 20% of information lost)

* UMAP on original embeddings down to to 10 dimensional space
    * Make subplot image like "UMAP Parameters" parameters section of https://pair-code.github.io/understanding-umap/
        * min_dist from 0, 0.01, 0.05, 0.1, 0.5, 1
        * n_neighbors from 5, 15, 30, 50, 100

* DBSCAN
    * Sklearn's implementation of DBSCAN comes out of the box with sample weighting
    * DBSCAN will also "find" the number of clusters, so we don't have to know apriori. DBSCAN also works with noise and we expect most of the comments to be noise
    * For eps, Two methods
        * K-distance plot
            * Plot the distance to the k-th nearest neighbor and look for an elbow in the graph
            * Elbow plot
            * "Kneedle Algorithm" says elbow is where the largest change is slope of the curve occurs (largest 2nd derivative)
            * code:
            from sklearn.neighbors import NearestNeighbors
            k = 5
            nbrs = NearestNeighbors(n_neighbors=k).fit(embeddings)
            distances, indices = nbrs.kneighbors(embeddings)
            distances = np.sort(distances[:, k-1], axis=0)
            gradients = np.diff(distances)
            sharpest_gradient_index = np.argmax(np.diff(gradients))
        * silhouette_score (preferred?)
            * grid search over eps (and maybe min_samples)
    * For min_samples, maybe use some proportion of the Likes.
        * I.e. we don't want to consider a cluster if it has less than 1% of all likes

* Grid search over min_dist, n_neighbors, and eps (set min_samples to something like 5.5k which is 0.5% of all Likes)
    * Plot most informative 2 variables (best eps for a given min_dist or n_neighbors, best min_dist for a given eps and n_neighbors, etc)
        * for a given eps and n_neighbors, min_dist is likely to be optimal at low values like 0 or 0.01. So we will likely want to have the plot be eps vs n_neighbors and we can pick the best min_dist since we know we always want a low value

* Topic Modeling (Naturally Occuring Topics)
    * Weighted TF-IDF
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(max_df=0.99, min_df=2, stop_words='englihs', ngram_range=(1, 5))
    # idf to normalize occurences of words across all samples
    vectorizer.fit(final_comments["English Text"].tolist()) # to do weighting?
    feature_names = vectorizer.get_feature_names_out()
    X_cluster = vectorizer.transform(cluster["English Text"].tolist()) # for each cluster (I think this is document frequency)
    X_sum = np.asarray(X_cluster.sum(axis=0)).flatten() # for each cluster (I think this is cluster frequency)
    X_df = pd.DataFrame({'ngram': ferature_names, 'score: X_sum}) # topic and tf-idf score

* TODO: Topic Modeling
    * Shoud stemming/lemmatization be performed?
    *   Does mmr take care of this?

* MMR - Maximal Marginal Relevance
    * https://github.com/MaartenGr/BERTopic/blob/master/bertopic/representation/_mmr.py
    * Fine tuning of topic modeling
    * Take top topics (ngrams) out of TF-IDF
    * Assume there are similar topics (Ex: car and cars)
    * Find subset of topic models that maximize mmr
    * I think we want to use the query embeddings for the words here

* Topic Modeling (Artificial Topics)
    * embedding model has a query embedding
    * Get a gauge of the intra-cluster distance for each natural cluster
    *   Probably mean intra-cluster distance and mean distance to cluster centroid
    *       Maybe add 2-3 standard deviations to mean as well
    * Get query embedding for samples near "Feastables", "Feastables Chocolate", "Shopify, "Advertisement" etc
    * For each query embedding:
        * Find all comments that are within some distance from the query embedding
            * Natural intra-cluster distance is a good start for "some distance"
            * Note: a comment can be in more than one of these artifical clusters
        