## An analysis of Reddit comments about HBO’s Euphoria to understand viewers’ experiences and reactions
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3.a Topic Modeling with `BERTopic`

*input*: corpus, embedding model  
*output*: topics, visualization

*tools*:  
`BERTopic`

*about*:  
`BERTopic` is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.


In [None]:
# %pip install bertopic
# %pip install bertopic[visualization]
# no matches found for bertopic visualization

In [3]:
import pandas as pd
import numpy as np
# going to try modeling with raw comments and cleaned comments
# data_raw = pd.read_pickle('../dat/s2_rue_comments.pkl')
# data_raw = list(data_raw[0])
data_clean = pd.read_pickle('../dat/corpus_s1.pkl')
# data_clean = list(data_clean[0])

In [4]:
from bertopic import BERTopic

Modeling with raw data wasn't ideal. Spam ended up becoming topics.  

---
**Modeling with clean data**

In [5]:
# need to set environment variable to disable token parallelization
# see issue https://github.com/huggingface/transformers/issues/5486
TOKENIZERS_PARALLELISM = False

In [6]:
data_clean2 = data_clean.dropna()

In [None]:
# model = BERTopic(calculate_probabilities=True, nr_topics=20)
# topics, probabilities= model.fit_transform(list(data_clean2[0]))

---

**USING K-MEANS TO DETERMINE CLUSTERS**

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')
sw_spacy = nlp.Defaults.stop_words | {'rt', 'via', '…'}
add_stopwords = ['i', 'just','did', 'ab', 'amp', 'ml', 'xb','abc', 'abcb', 'abcny', 'abd', 'abdabca', 'fs', 
                  'zpqxhxhzanapjsjbf', 'zqcsrpwsge', 'zqnuhckwdqwrhkuo', 'zs', 'zshwbhethehenozxfyqg',
                  'zsmkbrmwngzsibrntkt', 'zy', 'zwhnrmujykdxmntiub', 'afqjcnguytghbsuvixmglpwzqbg', 'ebecadcbdfcbafbdb',
                  'abfbmltmqspf', 'abfafebfbad', 'episode', 'season', 's', 'lol']
 
# using spacy stopwords instead of sklearn
stop_words = sw_spacy.union(add_stopwords)

In [18]:
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 2),stop_words=stop_words)
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=10)
model = BERTopic(hdbscan_model=cluster_model,verbose=True,diversity=1,top_n_words=10, 
                       embedding_model="all-mpnet-base-v2", nr_topics=10,
                       vectorizer_model=vectorizer_model,calculate_probabilities=True)
topics, probabilities = model.fit_transform(list(data_clean2[0]))

Batches: 100%|██████████| 115/115 [04:28<00:00,  2.33s/it]
2022-08-10 22:31:37,730 - BERTopic - Transformed documents to Embeddings
2022-08-10 22:32:15,137 - BERTopic - Reduced dimensionality
2022-08-10 22:32:15,523 - BERTopic - Clustered reduced embeddings
2022-08-10 22:32:21,172 - BERTopic - Reduced number of topics from 10 to 10


In [19]:
model.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,2,594
1,0,551
2,1,464
3,5,459
4,7,378


In [20]:
model.get_topic(0)

[('like', 0.02258075301328096),
 ('zendaya', 0.016841340131655447),
 ('dead', 0.01536669696803268),
 ('scene', 0.015157739429053317),
 ('end', 0.013516754465779988),
 ('tell', 0.011607131515903002),
 ('people', 0.010407089614879444),
 ('narate', 0.00970440468335484),
 ('unreliable', 0.00876901769536764),
 ('think rue', 0.008671067924547594)]

In [21]:
# save model
model.save("../models/bertopic_s1")

In [None]:
# load model
# model_clean = BERTopic.load('../models/bertopic_s1')

**Visualization**

In [22]:
model.visualize_topics()

In [23]:
model.visualize_heatmap(top_n_topics=10)

In [24]:
from umap import UMAP

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(list(data_clean2[0]), show_progress_bar=True)

reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
model.visualize_documents(list(data_clean2[0]), reduced_embeddings=reduced_embeddings)

**Topic Reduction**

this model a few topics that are overlapping

In [25]:
# topic reduction after training
new_topics, new_probs = model.reduce_topics(list(data_clean2[0]), topics, probabilities, nr_topics=8)

2022-08-10 22:37:08,880 - BERTopic - Reduced number of topics from 10 to 8


In [31]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,0,777,0_rue_character_know_zendaya
1,1,594,1_jule_friend_feel_sem
2,2,572,2_like_rue_music_finale
3,3,464,3_thank_yeah_hope_stil
4,4,378,4_jule_like_know_cal
5,5,348,5_addict_think_like_people
6,6,266,6_like_think_jule_people
7,7,261,7_cassie_like_think_nate


In [35]:
model.visualize_barchart(top_n_topics=8)

In [None]:
# topic representation - BIGRAMS
model.update_topics(list(data_clean2[0]), topics, n_gram_range=(1, 3))

In [None]:
# get new topic representation
model.get_topic_freq().head()

In [None]:
model.get_topic(0)

Use a custom CountVectorizer instead:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 3), stop_words="english")
model.update_topics(list(data_clean2[0]), topics, vectorizer_model=cv)

In [None]:
model.get_topic_freq().head()

In [None]:
model.get_topic(0)

Try using the 'auto' option for number of topics:

In [None]:
# automatically reduce topics
model_auto = BERTopic(calculate_probabilities=True, nr_topics='auto')
topics_auto, probabilities_auto = model_auto.fit_transform(list(data_clean2[0]))

In [None]:
model_auto.get_topic_freq().head()

In [None]:
model_auto.get_topic(2)

In [None]:
# visualize topics
model_auto.visualize_distribution(probabilities_auto[0])
# bad probabilites

---

**model with raw data**

In [None]:
# need to set environment variable to disable token parallelization
# see issue https://github.com/huggingface/transformers/issues/5486
# TOKENIZERS_PARALLELISM = True

In [None]:
# according to fitting error, there is a NaN in the data
# sequence item 57: expected str instance, float found

import numpy as np
# data_raw2 = data_raw.dropna()

In [None]:
# model = BERTopic(nr_topics=30)
# topics, probabilities = model.fit_transform(list(data_raw2[0]))

In [None]:
# model.get_topic_freq().head()

-1 refers to all outliers which do not have a topic assigned. Forcing documents in a topic could lead to poor performance. Thus, we ignore Topic -1

In [None]:
# lets look at topic 0

# model.get_topic(0)

In [None]:
# save model
# model.save('../models/bertopic_model_raw')

a lot of stop words included so will repeat with clean data (ABOVE)