# Topic Modeling – Twitter Case Study

This notebook performs topic modeling on the Twitter dataset using BERTopic, as described in the paper:
**"A Network-Driven Framework for Bidimensional Analysis of Information Dissemination on Social Media Platforms"**

The pipeline includes preprocessing, embedding generation, dimensionality reduction, clustering, and topic refinement.


In [None]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Standard libraries
import collections
import itertools
import json
import pickle as pkl
import random
import re
from multiprocessing import Pool

# NLP libraries
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import spacy
from spacy.lang.pt import Portuguese
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Topic modeling and vectorization
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import silhouette_score
from hdbscan import HDBSCAN

# Data manipulation and visualization
import scipy
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, trange

# Utility
import unidecode
import os


## Data Preprocessing – Twitter

We perform basic text cleaning to prepare tweets for topic modeling. This includes:
- Removing user mentions (e.g., @username),
- Stripping line breaks, tabs, and extra whitespace,
- Filtering out very short tweets (less than 15 characters),
- Optionally excluding spam or irrelevant terms (list empty in this case).

These steps ensure cleaner inputs for sentence embedding and improve topic coherence.


In [None]:
import pandas as pd
import re

# Load raw Twitter data
file_path = './data/Twitter_data_frequency_2023-01-08.csv'
messages = pd.read_csv(file_path)

# Function to remove unwanted characters and mentions
def remove_special_characters(text):
    if not isinstance(text, str):
        return str(text)

    # Remove emojis
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emoticons
        "\U0001F300-\U0001F5FF"  # Symbols & pictographs
        "\U0001F680-\U0001F6FF"  # Transport & map symbols
        "\U0001F700-\U0001F77F"  # Alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric shapes extended
        "\U0001F800-\U0001F8FF"  # Supplemental arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental symbols and pictographs
        "\U0001FA00-\U0001FA6F"  # Chess symbols, etc.
        "\U00002700-\U000027BF"  # Dingbats
        "\U0001F1E0-\U0001F1FF"  # Flags
        "]+",
        flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    # Remove @mentions
    text = re.sub(r'@\w+', '', text)

    # Remove line breaks, tabs, literal \n, and extra whitespace
    text = re.sub(r'[\r\n\t]|(?:\\n)|(?:\s\s+)', ' ', text).strip()

    return text

# Optional: list of noisy terms to remove (empty for now)
unwanted_terms = []

def contains_unwanted_terms(text, terms):
    return any(term.lower() in text.lower() for term in terms)

# Apply text filtering
messages = messages[~messages['texto'].apply(lambda x: contains_unwanted_terms(x, unwanted_terms))]

print("Before cleaning:", messages.shape)

# Apply cleaning
messages['texto'] = messages['texto'].astype(str)
messages['texto'] = messages['texto'].apply(remove_special_characters)

# Debugging info
print("Sample cleaned texts:")
print(messages['texto'].head().to_string())
print("\nChecking for remaining newlines:")
print(f"Found {messages['texto'].str.contains('\n').sum()} newlines")

# Remove very short tweets
messages = messages[messages['texto'].str.len() >= 15]
print("After filtering short tweets:", messages.shape)

# Save cleaned data
messages.to_csv("./data/Twitter_message_pre_processed.csv", index=False)


## Embedding Generation with BERTimbau (GPU)

To prepare for topic modeling, we generate dense sentence embeddings using a pre-trained transformer model.

- We use the `"neuralmind/bert-large-portuguese-cased"` model (BERTimbau-large), suitable for Brazilian Portuguese.
- The embedding process is performed using GPU acceleration (`device="cuda"`) due to the high memory requirements of the BERT-large architecture.
- Although deduplication is optional, embeddings are generated based on all available tweets at this stage.

All generated embeddings and corresponding documents are saved for reuse.


In [15]:
import os
import pickle as pkl
import pandas as pd
from sentence_transformers import SentenceTransformer

# Load preprocessed Twitter messages
file_path = "./data/Twitter_message_pre_processed.csv"
messages = pd.read_csv(file_path)
messages.head(20)

# Define embedding model
embedding_model_name = "neuralmind/bert-large-portuguese-cased"

# Optional: deduplicate if needed
unique_messages = messages  # .drop_duplicates(subset=['texto'], keep='first')
docs = list(unique_messages['texto'])

# Generate embeddings using GPU
sentence_model = SentenceTransformer(embedding_model_name, device="cuda")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Save embeddings and corresponding documents
os.makedirs('topic_analysis', exist_ok=True)

with open('topic_analysis/twitter_embeddings.pkl', 'wb') as f:
    pkl.dump(embeddings, f)

with open('topic_analysis/twitter_docs.pkl', 'wb') as f:
    pkl.dump(docs, f)

# Optionally reload later
with open('topic_analysis/twitter_embeddings.pkl', 'rb') as f:
    embeddings = pkl.load(f)

with open('topic_analysis/twitter_docs.pkl', 'rb') as f:
    docs = pkl.load(f)


## BERTopic Model Configuration and Topic Assignment

We now configure and apply BERTopic to the Twitter dataset.

- We use the BERTimbau-large model (`neuralmind/bert-large-portuguese-cased`) for embeddings.
- UMAP and HDBSCAN are used to project the embeddings and identify topic clusters.
- The vectorizer is limited to 1024 frequent terms, excluding very rare or overly common words.
- The BERTopic model is configured to allow fine-grained topics by setting `min_topic_size = 10` and `diversity = 0.9`.
- A custom BERTopic subclass is used to remove non-serializable components (locks, SSL contexts) for saving the model.

Once fitted, we assign the resulting topic labels to the full dataset and persist the model and outputs.


In [None]:
import ssl
import threading
import pickle as pkl
import pandas as pd
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic import BERTopic
from nltk.corpus import stopwords

# Custom BERTopic class to allow model serialization
class CustomBERTopic(BERTopic):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __getstate__(self):
        state = self.__dict__.copy()
        for key in ['representation_model', 'ssl_context', 'lock']:
            state.pop(key, None)
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        self.ssl_context = ssl.create_default_context()
        self.lock = threading.RLock()

# Set parameters
embedding_model_name = "neuralmind/bert-large-portuguese-cased"

# UMAP configuration
umap_model = UMAP(
    n_neighbors=3,
    n_components=10,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)

# HDBSCAN configuration
hdbscan_model = HDBSCAN(
    min_cluster_size=10,
    min_samples=10,
    metric='euclidean',
    prediction_data=True
)

# Vectorizer configuration
stop_words_list = list(set(stopwords.words('portuguese')).union(set(stopwords.words('english'))))
vectorizer_model = CountVectorizer(
    max_features=1024,
    min_df=0.01,
    max_df=0.99,
    stop_words=stop_words_list
)

# TF-IDF transformer
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Sentence transformer model (reused)
sentence_model = SentenceTransformer(embedding_model_name)

# Load previously saved docs and embeddings
with open('topic_analysis/twitter_docs.pkl', 'rb') as f:
    docs = pkl.load(f)
with open('topic_analysis/twitter_embeddings.pkl', 'rb') as f:
    embeddings = pkl.load(f)

# Instantiate and fit BERTopic model
model = CustomBERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    min_topic_size=10,
    top_n_words=30,
    verbose=True
)

topics, probabilities = model.fit_transform(docs, embeddings)

# Map topics back to full message dataset
messages = pd.read_csv("./data/Twitter_message_pre_processed.csv")
topic_map = dict(zip(docs, topics))
messages['topic'] = messages['texto'].map(topic_map)

# Save outputs
with open('topic_analysis/twitter_model.pkl', 'wb') as f:
    pkl.dump(model, f, protocol=pkl.HIGHEST_PROTOCOL)

with open('topic_analysis/twitter_messages_with_topics.pkl', 'wb') as f:
    pkl.dump(messages, f, protocol=pkl.HIGHEST_PROTOCOL)

with open('topic_analysis/twitter_probabilities.pkl', 'wb') as f:
    pkl.dump(probabilities, f, protocol=pkl.HIGHEST_PROTOCOL)

with open('topic_analysis/twitter_topics.pkl', 'wb') as f:
    pkl.dump(topics, f, protocol=pkl.HIGHEST_PROTOCOL)


## Topic Refinement and Final Outputs

After initial topic modeling, we refine the results in two stages:

1. **Outlier Reduction:**  
   Using the `reduce_outliers()` function with the c-TF-IDF strategy, we reassign semantically ambiguous documents to better-fitting clusters. This improves topic cohesion by removing noise.

2. **Topic Merging:**  
   After outlier reassignment, `reduce_topics()` is used to consolidate overlapping or highly similar topics.

Once updated, we map the refined topics back to all tweets and persist the outputs:
- Final labeled dataset,
- Updated BERTopic model,
- Topic probability distributions,
- A CSV file summarizing each topic.


In [None]:
import pickle as pkl
import pandas as pd
import ssl
import threading

# Load initial model and topic outputs
with open('topic_analysis/twitter_model.pkl', 'rb') as f:
    model = pkl.load(f)

with open('topic_analysis/twitter_messages_with_topics.pkl', 'rb') as f:
    messages = pkl.load(f)

with open('topic_analysis/twitter_probabilities.pkl', 'rb') as f:
    probabilities = pkl.load(f)

with open('topic_analysis/twitter_topics.pkl', 'rb') as f:
    topics = pkl.load(f)

# Inspect original topics
model.get_topic_info().head(10)

# --- Step 1: Outlier reduction ---
new_topics = model.reduce_outliers(docs, topics, strategy="c-tf-idf", threshold=0.1)

# Update model with reassigned topics
model_updated = model
model_updated.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model)
model_updated.get_topic_info().head(20)

# --- Step 2: Topic merging ---
new_topics = model_updated.reduce_topics(docs)
topics_updated = model_updated.topics_
model_updated.update_topics(docs, topics=topics_updated, vectorizer_model=vectorizer_model)
model_updated.get_topic_info().head(20)

# --- Step 3: Reassign final topics to full dataset ---
topic_map = dict(zip(docs, topics_updated))
messages['topic'] = messages['texto'].map(topic_map)

# --- Step 4: Save final outputs ---
with open('topic_analysis/twitter_messages_FINAL.pkl', 'wb') as f:
    pkl.dump(messages, f, protocol=pkl.HIGHEST_PROTOCOL)

with open('topic_analysis/twitter_model_FINAL.pkl', 'wb') as f:
    pkl.dump(model_updated, f, protocol=pkl.HIGHEST_PROTOCOL)
    pkl.dump(probabilities, f, protocol=pkl.HIGHEST_PROTOCOL)

with open('topic_analysis/twitter_topics_FINAL.pkl', 'wb') as f:
    pkl.dump(topics_updated, f, protocol=pkl.HIGHEST_PROTOCOL)

# Save topic overview as CSV
model_updated.get_topic_info().to_csv("topic_analysis/twitter_TopicDescription.csv", index=False)

## Topic Characterization and Class Distribution Analysis

We finalize the topic modeling process by analyzing:

- The proportion of classified vs. outlier messages,
- The distribution of message classes across topics.

The heatmap shows normalized (z-score) class occurrence for each topic, highlighting which semantic clusters are most associated with each class.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import pandas as pd
import pickle as pkl

ignored_topics = [-1]

# Loading the pickle file
with open('topic_analysis/twitter_messages_FINAL.pkl', 'rb') as file:
    df = pkl.load(file)

# Defining the original class columns and their new names
class_columns_original = ['classe_1', 'classe_2', 'classe_3', 'classe_4']
class_columns_renamed = ['Class 1', 'Class 2', 'Class 3', 'Class 4']
rename_map = dict(zip(class_columns_original, class_columns_renamed))

# Filtering out topics that are in the ignored list
df_filtered = df[~df['topic'].isin(ignored_topics)]

# Topic IDs always start from one
df_filtered['topic'] = df_filtered['topic'] + 1

# Grouping by topic and summing the class counts
df_pivot = df_filtered.groupby('topic')[class_columns_original].sum()

# Renaming the columns after filtering
df_pivot = df_pivot.rename(columns=rename_map)

# Normalizing values using z-score per row
df_pivot_normalized = df_pivot.apply(scipy.stats.zscore, axis=1)

# Generating the heatmap
plt.figure(figsize=(4.5, 4))
ax = sns.heatmap(
    df_pivot_normalized,
    linewidths=0.5,
    linecolor='black',
    cmap='coolwarm',
    vmin=-1.3,
    vmax=1.3,
    cbar_kws=dict(
        use_gridspec=False,
        location="top",
        label='# messages (normalized)',
        shrink=0.5
    )
)

# Axis configuration
ax.set_xlabel('Classes')
ax.set_ylabel('Topic ID')
ax.set_xticklabels(class_columns_renamed)

# Saving and displaying the plot
plt.savefig('twitter_heatmap_classes_topic.pdf', bbox_inches='tight')
plt.show()