## An analysis of Reddit comments about HBO’s Euphoria to understand viewers’ experiences and reactions
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3.a Topic Modeling with `BERTopic`

*input*: corpus, embedding model  
*output*: topics, visualization

*tools*:  
`BERTopic`

*about*:  
`BERTopic` is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.


In [None]:
# %pip install bertopic
# %pip install bertopic[visualization]
# no matches found for bertopic visualization

In [1]:
import pandas as pd
# going to try modeling with raw comments and cleaned comments
data_raw = pd.read_pickle('../dat/s2_rue_comments.pkl')
# data_raw = list(data_raw[0])
data_clean = pd.read_pickle('../dat/corpus.pkl')
# data_clean = list(data_clean[0])

In [2]:
from bertopic import BERTopic

---

**model with raw data**

In [10]:
# need to set environment variable to disable token parallelization
# see issue https://github.com/huggingface/transformers/issues/5486
TOKENIZERS_PARALLELISM = True

In [19]:
# according to fitting error, there is a NaN in the data
# sequence item 57: expected str instance, float found

import numpy as np
data_raw2 = data_raw.dropna()

In [None]:
model = BERTopic()
topics, probabilities = model.fit_transform(list(data_raw2[0]))

In [22]:
model.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,6487
1,0,2945
2,1,936
3,2,879
4,3,433


-1 refers to all outliers which do not have a topic assigned. Forcing documents in a topic could lead to poor performance. Thus, we ignore Topic -1

In [23]:
# lets look at topic 0

model.get_topic(0)

[('jules', 0.018488351630427384),
 ('rue', 0.010843775538094154),
 ('elliot', 0.008510807692288277),
 ('her', 0.008107649019755862),
 ('she', 0.00797861742524745),
 ('relationship', 0.006408999281948819),
 ('but', 0.005923509477956057),
 ('with', 0.005843405826163349),
 ('that', 0.005798595447183756),
 ('and', 0.005695279532346802)]

In [34]:
# save model
model.save('../models/bertopic_model_raw')

  self._set_arrayXarray(i, j, x)


a lot of stop words included so will repeat with clean data

In [9]:
data_clean2 = data_clean.dropna()

In [13]:
model_clean = BERTopic(calculate_probabilities=True)
topics_clean, probabilities_clean = model_clean.fit_transform(list(data_clean2[0]))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [14]:
model_clean.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,7832
1,0,702
2,1,559
3,2,342
4,3,338


In [15]:
model_clean.get_topic(0)

[('jules', 0.016852136491006028),
 ('rue', 0.01200333675432912),
 ('her', 0.007974592971334682),
 ('she', 0.007736034988256822),
 ('relationship', 0.006417450236841112),
 ('not', 0.005662989810768734),
 ('is', 0.005289312245192388),
 ('to', 0.004987891617658675),
 ('but', 0.004985063193370417),
 ('that', 0.004910840313542042)]

In [33]:
# save model
# model_clean.save("bertopic_clean")

  self._set_arrayXarray(i, j, x)


In [4]:
# load model
model_clean = BERTopic.load('../models/bertopic_clean')

In [None]:
# experiment with a different embedding

**Visualization**

In [None]:
# %pip install nbformat

In [5]:
model_clean.visualize_topics()

**visualize probabilities**

In [17]:
model_clean.visualize_distribution(probabilities_clean[0])

**Topic Reduction**

this model has several topics that are overlapping

In [20]:
# topic reduction after training

# Further reduce topics
new_topics, new_probs = model_clean.reduce_topics(list(data_clean2[0]), topics_clean, probabilities_clean, nr_topics=30)

In [21]:
model_clean.visualize_distribution(new_probs[0])