## Reddit Online Misogyny - BERTopic Modelling

Inspired by: 

[Tutorial - BERTopic Best Practices](https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)

[Documentation - BERTopic Tips & Tricks](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html)

In [None]:
# !pip install bertopic
import pandas as pd
from tqdm.auto import tqdm

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import AgglomerativeClustering

First we load the preprocessedd dataset:

In [1]:
import pandas as pd
df = pd.read_csv("../data/processed/data_clean.csv")

print("Loaded dataset with", len(df), "rows")
set(df['subreddit'].tolist())

Loaded dataset with 24610 rows


{'AskMen', 'MensRights', 'TheRedPill'}

We prepare the documents and pre-calculate embeddings to feed them into BERTopic. This way we avoid calculating embeddings each time, which is a timewise costly step.

In [10]:
df = df[df['subreddit'].isin(['MensRights', 'TheRedPill'])]
print("Loaded dataset with", len(df), "rows")

Loaded dataset with 10085 rows


In [11]:
print("Original rows:", len(df))
df = df.drop_duplicates(subset=["summary"])
print("After dropping duplicate summaries:", len(df))

Original rows: 10085
After dropping duplicate summaries: 9995


In [13]:
df["text_for_topics"] = df["summary"].fillna(df["text_clean"])
docs = df["text_for_topics"].astype(str).tolist()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

We use UMAP for the dimensionality reduction to reduce the size of the embeddings

In [209]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine', random_state=42)

We control the number of topics using HBDSCAN. After fine tuning we found that min_cluster_size = 30 to be ideal. 

In [None]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

We improve representation and remove stop words. We do it with CountVectorizer because removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context in order to create accurate embeddings.

CountVectorizer preprocess the documents after having generated embeddings and clustered our documents.

In [237]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=5, ngram_range=(1, 2))

Some words appear quite often in every topic but are generally not considered stop words as found in the CountVectorizer(stop_words="english") list. To further reduce these frequent words, we can use reduce_frequent_words

In [238]:
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True,reduce_frequent_words=True)

We define additional representations, such as through the KeyBERT-inspired model, which reduces the appeareance of stop wowrds and can iprove topic representation.

We also use MMR (Maximal Marginal Relevance), because in our top words htere might be many words that mean the same. Therefore we allow for more diversity in topics by using diversity=2.

In [None]:
import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# KeyBERT
keybert_model = KeyBERTInspired()

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
client = openai.OpenAI(api_key="sk-...")
openai_model = OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "OpenAI": openai_model,
    "MMR": mmr_model
}

Now we train the model using the pre-calculated embeddings to speed up the processing.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,
  ctfidf_model=ctfidf_model,
  nr_topics=30,
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

2025-12-06 19:25:30,912 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-06 19:25:55,392 - BERTopic - Dimensionality - Completed ✓
2025-12-06 19:25:55,395 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-06 19:25:56,009 - BERTopic - Cluster - Completed ✓
2025-12-06 19:25:56,010 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-06 19:25:56,506 - BERTopic - Representation - Completed ✓
2025-12-06 19:25:56,507 - BERTopic - Topic reduction - Reducing number of topics
2025-12-06 19:25:56,533 - BERTopic - Representation - Fine-tuning topics using representation models.
100%|██████████| 30/30 [00:22<00:00,  1.36it/s]
2025-12-06 19:26:20,504 - BERTopic - Representation - Completed ✓
2025-12-06 19:26:20,508 - BERTopic - Topic reduction - Reduced number of topics from 64 to 30


We add the OpenAI generate labels based on representative documents

In [None]:
chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["OpenAI"].items()}
chatgpt_topic_labels[-1] = "Outlier Topic"
topic_model.set_topic_labels(chatgpt_topic_labels)

Getting topic information:

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,KeyBERT,OpenAI,MMR,Representative_Docs
0,-1,4867,-1_shit_female_power_really,Outlier Topic,"[shit, female, power, really, woman, did, girl...","[feminists, relationship, girlfriend, woman, f...",[Gender dynamics and relationships],"[female, power, girls, post, gender, relations...",[My boyfriend has become a victim of abominabl...
1,0,975,0_feminism_rights_mra_feminists,Men's Rights vs Feminism,"[feminism, rights, mra, feminists, feminist, m...","[feminism, feminists, feminist, masculinity, g...",[Men's Rights vs Feminism],"[feminism, rights, mra, feminists, feminist, e...",[men and women are not interchangeable. \n Yea...
2,1,534,1_child_child support_father_mother,Family Custody and Support Battles,"[child, child support, father, mother, kids, s...","[child support, custody, mothers, divorce, dau...",[Family Custody and Support Battles],"[child support, support, custody, children, co...",[Served ex wife with paperwork to modify child...
3,2,508,2_confidence_self_change_games,Self-improvement for Confidence and Success,"[confidence, self, change, games, confident, g...","[self improvement, self esteem, self, motivati...",[Self-improvement for Confidence and Success],"[confidence, self, goals, improve, value, succ...",[is that cultivating your ability to have a g...
4,3,432,3_alpha_pill_beta_red pill,Gender dynamics in social interactions,"[alpha, pill, beta, red pill, red, betas, redp...","[red pill, blue pill, pill, redpill, philosoph...",[Gender dynamics in social interactions],"[alpha, red pill, betas, redpill, blue pill, s...",[Red pill theory explains historical observati...
5,4,352,4_rape_victims_violence_victim,Understanding violence and rape issues,"[rape, victims, violence, victim, culture, fal...","[rape, raped, accused, accusations, feminists,...",[Understanding violence and rape issues],"[rape, victims, violence, accusations, raped, ...",[is people don't think critically and watch to...
6,5,260,5_marriage_married_slut_marry,Marriage Dynamics and Hypergamy,"[marriage, married, slut, marry, hypergamy, sl...","[marry, marriage, divorced, divorce, married, ...",[Marriage Dynamics and Hypergamy],"[marriage, married, marry, hypergamy, sluts, d...",[Women don't love us. \n I don't know where li...
7,6,247,6_read_title_read thing_lazy,Critical Reading and Writing Skills,"[read, title, read thing, lazy, just read, par...","[read, reading, read thing, just read, written...",[Critical Reading and Writing Skills],"[read, read thing, lazy, just read, reading, s...",[don't be a lazy cunt. Read the whole thing I ...
8,7,244,7_ltr_girls_girl_hot,Dating and Self-Improvement,"[ltr, girls, girl, hot, fun, meet, value, weir...","[advice, gf, relationship, dating, comfort, do...",[Dating and Self-Improvement],"[ltr, girls, meet, value, effort, smv, chicks,...","[be a beast, never stop working, never put up ..."
9,8,229,8_trp_rp_bp_helped,TRP and Self-Improvement Journey,"[trp, rp, bp, helped, knowledge, works, truth,...","[trp, self improvement, improving, improve, rp...",[TRP and Self-Improvement Journey],"[trp, bp, knowledge, drug, theory, success, me...",[10 months into TRP I get laid after a very lo...


In [None]:
topic_info = topic_model.get_topic_info()
topic_info.to_csv("../data/processed/BERTopic_info.csv", index=False)

In [None]:
fig = topic_model.visualize_hierarchy(custom_labels=True)
fig.write_html("../dashboard/data/topics_hierarchy.html")
fig.show()

In [206]:
topic_model.visualize_topics()

We use the topic informationfrom `topic_model.get_topic_info()` to feed our original dataframe with Topics' IDs and Topics Names for further analysis.

In [None]:
df['Topic'] = topics
topics_info = topic_model.get_topic_info()
topic_names = topics_info[["Topic", "CustomName"]]
df = df.merge(topic_names, on="Topic", how="left")
df.to_csv("../data/processed/data_with_topics.csv", index=False)
df.head()

Unnamed: 0,author,body,normalizedBody,subreddit,subreddit_id,id,content,summary,text,text_clean,text_for_topics,Topic,CustomName
0,mythin,"edit: this is long, I tried to explain in a po...","edit: this is long, I tried to explain in a po...",MensRights,t5_2qhk3,c6kg614,"edit: this is long, I tried to explain in a po...","what I'm trying to say: \n \n Currently, my ex...","edit: this is long, I tried to explain in a po...","edit: this is long, i tried to explain in a po...","what I'm trying to say: \n \n Currently, my ex...",-1,Outlier Topic
1,Always_Doubtful,i gave up after one line.\n\nTLDR please,i gave up after one line. \n TLDR please \n,MensRights,t5_2qhk3,c8ojg6e,i gave up after one line.,please,i gave up after one line.,i gave up after one line.,please,6,Critical Reading and Writing Skills
2,jpaul3211,It's really wonderful to hear that you're prou...,It's really wonderful to hear that you're prou...,MensRights,t5_2qhk3,c8oonwp,It's really wonderful to hear that you're prou...,you're awesome.,It's really wonderful to hear that you're prou...,it's really wonderful to hear that you're prou...,you're awesome.,-1,Outlier Topic
3,dr_pepper_35,Get a lawyer. Get a lawyer. Get a lawyer. G...,Get a lawyer. Get a lawyer. Get a lawyer. G...,MensRights,t5_2qhk3,c8t624p,Get a lawyer. Get a lawyer. Get a lawyer. G...,Get a lawyer. \n *edit-GET. A. LAWYER. -thank...,Get a lawyer. Get a lawyer. Get a lawyer. G...,get a lawyer. get a lawyer. get a lawyer. g...,Get a lawyer. \n *edit-GET. A. LAWYER. -thank...,20,Legal Representation and Advice
4,Mitschu,It's still a good idea to read the sidebar bef...,It's still a good idea to read the sidebar bef...,MensRights,t5_2qhk3,c9aiodw,It's still a good idea to read the sidebar bef...,Read the sidebar before asking commonly asked ...,It's still a good idea to read the sidebar bef...,it's still a good idea to read the sidebar bef...,Read the sidebar before asking commonly asked ...,-1,Outlier Topic


Network visualization of documents and authorship per topic 

In [None]:
authors = list(df['author'])
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings) # reducing dimensionality of embeddings helps visualizing iteratively
fig = topic_model.visualize_documents(authors, reduced_embeddings=reduced_embeddings, custom_labels=False)
fig.write_html("../dashboard/data/topics_visualize_documents.html")

In [280]:
# Count the occurrences of each topic
df_plot = df[df['CustomName'] != 'Outlier Topic']
topic_counts = df_plot['CustomName'].value_counts()

# Calculate the ratio of each topic
topic_ratios = topic_counts / topic_counts.sum()

Finally, we create a pie chart of Topic distribution accross r/MensRigths and r/TheRedPill

In [None]:
import plotly.express as px

plot_data = topic_ratios.reset_index()
plot_data.columns = ['Topic', 'Ratio']

plot_data = plot_data.sort_values(by=['Ratio'], key=lambda x: x != 'Other').reset_index(drop=True)

colors = px.colors.qualitative.Pastel + px.colors.qualitative.Pastel2 + px.colors.qualitative.Pastel1 + px.colors.qualitative.Set3

fig = px.pie(plot_data, names='Topic', values='Ratio', title='Topic Distribution',
             color_discrete_sequence=colors)

fig.update_traces(textposition='inside', textinfo='percent')
fig.update_layout(
    legend_title_text='Topics',
    legend=dict(
        orientation="v",
        yanchor="top",
        y=-5,
        xanchor="left",
        x=0,
        traceorder="normal"
    ),
    width=800,
    height=1200

)
fig.write_html("../dashboard/data/topic_distribution.html")

fig.show()