## An analysis of Reddit comments about HBO’s Euphoria to understand viewers’ experiences and reactions
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3.a Topic Modeling with `BERTopic`

#### SEASON 2

*input*: corpus, embedding model  
*output*: topics, visualization

*tools*:  
`BERTopic`

*about*:  
`BERTopic` is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.


In [None]:
# %pip install bertopic
# %pip install bertopic[visualization]
# no matches found for bertopic visualization

In [1]:
import pandas as pd
import numpy as np
# going to try modeling with raw comments and cleaned comments
# data_raw = pd.read_pickle('../dat/s2_rue_comments.pkl')
# data_raw = list(data_raw[0])
data_clean = pd.read_pickle('../dat/corpus.pkl')
# data_clean = list(data_clean[0])

In [2]:
from bertopic import BERTopic

Modeling with raw data wasn't ideal. Spam ended up becoming topics.  

---
**Modeling with clean data**

In [3]:
# need to set environment variable to disable token parallelization
# see issue https://github.com/huggingface/transformers/issues/5486
TOKENIZERS_PARALLELISM = True

In [4]:
data_clean2 = data_clean.dropna()

In [None]:
model = BERTopic(calculate_probabilities=True, nr_topics=30)
topics, probabilities= model.fit_transform(list(data_clean2[0]))

In [6]:
model.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,10151
1,0,1280
2,1,477
3,2,426
4,3,412


In [7]:
model.get_topic(0)

[('elliot', 0.05302321466892508),
 ('jule', 0.03420611570571211),
 ('rue', 0.03373225387824607),
 ('be', 0.026852654990114987),
 ('she', 0.026845743070334153),
 ('he', 0.024036766953895016),
 ('and', 0.023875877911608428),
 ('to', 0.02220195014943216),
 ('that', 0.01957623195687411),
 ('with', 0.01955138713236753)]

In [None]:
# save model
model.save("../models/bertopic_clean")

In [None]:
# load model
# model_clean = BERTopic.load('../models/bertopic_clean')

**Visualization**

In [8]:
model.visualize_topics()

In [11]:
model.visualize_distribution(probabilities[0])

---

**K-MEANS**

In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')
sw_spacy = nlp.Defaults.stop_words | {'rt', 'via', '…'}
add_stopwords = ['i', 'just','did', 'ab', 'amp', 'ml', 'xb','abc', 'abcb', 'abcny', 'abd', 'abdabca', 'fs', 
                  'zpqxhxhzanapjsjbf', 'zqcsrpwsge', 'zqnuhckwdqwrhkuo', 'zs', 'zshwbhethehenozxfyqg',
                  'zsmkbrmwngzsibrntkt', 'zy', 'zwhnrmujykdxmntiub', 'afqjcnguytghbsuvixmglpwzqbg', 'ebecadcbdfcbafbdb',
                  'abfbmltmqspf', 'abfafebfbad', 'episode', 'season', 's', 'lol']
 
# using spacy stopwords instead of sklearn
stop_words = sw_spacy.union(add_stopwords)

In [29]:
# redoing with kmeans
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 2),stop_words=stop_words)
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=10)
topic_model = BERTopic(hdbscan_model=cluster_model,verbose=True,diversity=1,top_n_words=10, 
                       embedding_model="all-mpnet-base-v2", nr_topics=10,
                       vectorizer_model=vectorizer_model,calculate_probabilities=True)
topics, probabilities = topic_model.fit_transform(list(data_clean2[0]))

Batches: 100%|██████████| 596/596 [16:10<00:00,  1.63s/it] 
2022-08-10 23:03:04,709 - BERTopic - Transformed documents to Embeddings
2022-08-10 23:03:16,184 - BERTopic - Reduced dimensionality
2022-08-10 23:03:16,556 - BERTopic - Clustered reduced embeddings
2022-08-10 23:03:21,776 - BERTopic - Reduced number of topics from 10 to 10


In [30]:
topic_model.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,5,3344
1,3,2967
2,0,2829
3,7,2436
4,6,2238


In [31]:
topic_model.visualize_topics()

In [32]:
topic_model.visualize_heatmap(top_n_topics=10)

Reduce dimensionality of embeddings

In [33]:
from umap import UMAP

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(list(data_clean2[0]), show_progress_bar=False)

reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(list(data_clean2[0]), reduced_embeddings=reduced_embeddings)

Topic hierarchy

In [34]:
topic_model.visualize_hierarchy()

---

MODEL 2 - ROBERTA  
N = 20

In [56]:
cluster_model = KMeans(n_clusters=25)
topic_model_r = BERTopic(hdbscan_model=cluster_model,verbose=True,diversity=1,top_n_words=10, 
                       embedding_model="all-distilroberta-v1", nr_topics=20,
                       vectorizer_model=vectorizer_model,calculate_probabilities=True)
topics_r, probabilities_r = topic_model_r.fit_transform(list(data_clean2[0]))


Batches: 100%|██████████| 596/596 [07:36<00:00,  1.30it/s] 
2022-08-11 00:15:20,313 - BERTopic - Transformed documents to Embeddings
2022-08-11 00:15:31,255 - BERTopic - Reduced dimensionality
2022-08-11 00:15:32,225 - BERTopic - Clustered reduced embeddings
2022-08-11 00:15:43,936 - BERTopic - Reduced number of topics from 25 to 20


In [57]:
topic_model_r.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,0,1810
1,1,1622
2,2,1571
3,3,1184
4,4,1165


In [58]:
topic_model_r.visualize_topics()

In [59]:
topic_model_r.visualize_hierarchy()

**Topic Reduction**

this model has some topics that are overlapping

In [60]:
# topic reduction after training
new_topics, new_probs = topic_model_r.reduce_topics(list(data_clean2[0]), topics, probabilities, nr_topics=5)

2022-08-11 00:22:04,671 - BERTopic - Reduced number of topics from 20 to 5


In [61]:
topic_model_r.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,0,4606,0_think_people_euphoria_want
1,1,4570,1_rue_drug_like_lexi
2,2,4462,2_nate_maddy_rue_drug
3,3,2967,3_elliot_think_relationship_feel
4,4,2436,4_thank_leak_people_point


In [62]:
topic_model_r.visualize_barchart()

---

In [49]:
# topic representation - BIGRAMS
# topic_model_r.update_topics(list(data_clean2[0]), new_topics, n_gram_range=(1, 3))

In [50]:
# get new topic representation
# topic_model_r.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,0,5241
1,1,4342
2,2,3344
3,3,3147
4,4,2967


Not happy with these results...

---

Use a custom CountVectorizer instead:

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer

# cv = CountVectorizer(ngram_range=(1, 3), stop_words="english")
# model.update_topics(list(data_clean2[0]), topics, vectorizer_model=cv)

In [None]:
# model.get_topic_freq().head()

In [None]:
# model.get_topic(0)

Try using the 'auto' option for number of topics:

In [None]:
# automatically reduce topics
# model_auto = BERTopic(calculate_probabilities=True, nr_topics='auto')
# topics_auto, probabilities_auto = model_auto.fit_transform(list(data_clean2[0]))

In [None]:
# model_auto.get_topic_freq().head()

In [None]:
# model_auto.get_topic(2)

In [None]:
# visualize topics
# model_auto.visualize_distribution(probabilities_auto[0])
# bad probabilites

---

**model with raw data**

In [None]:
# need to set environment variable to disable token parallelization
# see issue https://github.com/huggingface/transformers/issues/5486
# TOKENIZERS_PARALLELISM = True

In [None]:
# according to fitting error, there is a NaN in the data
# sequence item 57: expected str instance, float found

import numpy as np
# data_raw2 = data_raw.dropna()

In [None]:
# model = BERTopic(nr_topics=30)
# topics, probabilities = model.fit_transform(list(data_raw2[0]))

In [None]:
# model.get_topic_freq().head()

-1 refers to all outliers which do not have a topic assigned. Forcing documents in a topic could lead to poor performance. Thus, we ignore Topic -1

In [None]:
# lets look at topic 0

# model.get_topic(0)

In [None]:
# save model
# model.save('../models/bertopic_model_raw')

a lot of stop words included so will repeat with clean data (ABOVE)