# Lab 7: Topic Modeling

---
## 1. Notebook Overview

### 1.1 Objective
Build and interpret topic models on the tweet dataset using Gensim's LDA implementation, then visualize and evaluate the discovered topics.

### 1.2 Prerequisites
This notebook assumes you have already executed:
- **Lab 2**: Data preprocessing → `../Data/multi_label/tweets_preprocessed_*.parquet`
- **Lab 3**: Language modeling
- **Lab 4**: Feature extraction → `../Data/top_1000_vocabulary.json`
- **Lab 5**: Neural network classification pipeline 
- **Lab 5 & 6**: Understanding of the dataset labels to compare topics with classes

### 1.3 Resources
- Gensim LDA documentation: https://radimrehurek.com/gensim/models/ldamodel.html
- pyLDAvis usage guide: https://pyldavis.readthedocs.io/en/latest/

### 1.4 Task Summary
- Train one or more LDA topic models with Gensim (consider preprocessing choices and reusing prior pipelines).
- Visualize the learned topics via pyLDAvis (`pyLDAvis.gensim_models.prepare(...)`).
- Interpret and critique the topics: which terms describe them? How do they align (or not) with the tweet label taxonomy?
- Experiment with multiple topic counts to see how granularity affects interpretability.

### 1.5 Section Roadmap
1. Section 2: Environment setup, data access, and preprocessing recap
2. Section 3: Build and train the Gensim LDA model
3. Section 4: Visualize topics with pyLDAvis
4. Section 5: Interpret topics, compare topic counts, and discuss insights

---
## 2. Data Loading & Preparation

### 2.1 Libraries
- `gensim` for LDA (`LdaModel`, `Dictionary`, `corpora`)
- `pyLDAvis.gensim_models` for interactive topic visualization
- `pandas`, `numpy`, `datasets` (HuggingFace) for loading the tweet splits

### 2.2 Installation Steps
1. Install/confirm `gensim`, `pyLDAvis`, and any preprocessing dependencies in the environment.
2. Ensure the preprocessed tweet file `../Data/multi_label/tweets_preprocessed_train.parquet` and HuggingFace API access (`cardiffnlp/tweet_topic_multi`) are available.

In [13]:
%pip install --quiet gensim pyLDAvis numpy pandas 

Note: you may need to restart the kernel to use updated packages.


In [14]:
import warnings
import pandas as pd
import numpy as np
import typing
import json
import re
import pyLDAvis
import pyLDAvis.gensim_models


import gensim
from pathlib import Path
from typing import Sequence, Union, List
from datasets import load_dataset
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel, CoherenceModel

warnings.filterwarnings("ignore")

TRAIN_DATA_PATH = "../Data/multi_label/tweets_preprocessed_train.parquet"
TEST_DATA_PATH = "../Data/multi_label/tweets_preprocessed_test.parquet"
VAL_DATA_PATH = "../Data/multi_label/tweets_preprocessed_validation.parquet"
VOCABULARY_PATH = "../Data/top_1000_vocabulary.json"
RANDOM_STATE = 42

In [15]:
# Load the Top 1000 vocabulary from Lab 4
print("Loading vocabulary from Lab 4...")
with open(VOCABULARY_PATH, 'r', encoding='utf-8') as f:
    vocab_data = json.load(f)

TOP_VOCABULARY = vocab_data['tokens']

print(f"✓ Loaded vocabulary from: {VOCABULARY_PATH}")
print(f"✓ Vocabulary size: {len(TOP_VOCABULARY)}")
print(f"✓ First 10 tokens: {TOP_VOCABULARY[:10]}")
print(f"✓ Description: {vocab_data. get('description', 'N/A')}")

Loading vocabulary from Lab 4...
✓ Loaded vocabulary from: ../Data/top_1000_vocabulary.json
✓ Vocabulary size: 1000
✓ First 10 tokens: ['new', 'game', 'day', 'good', 'year', 'love', 'time', 'win', 'come', 'happy']
✓ Description: Top 1000 most frequent tokens from preprocessed tweets (Lab 4)


In [16]:
# Load preprocessed training data from Lab 2
print("\nLoading preprocessed training data from Lab 2...")
df_train = pd.read_parquet(TRAIN_DATA_PATH)
print(f"Training samples: {len(df_train):,}")
print(f"Columns: {df_train.columns.tolist()}")
df_train.head(3)


Loading preprocessed training data from Lab 2...
Training samples: 5,465
Columns: ['text', 'label_name', 'label']


Unnamed: 0,text,label_name,label
0,lumber beat rapid game western division final ...,['sports'],[0 0 0 0 0 1]
1,hear eli gold announce auburn game dumbass,['sports'],[0 0 0 0 0 1]
2,phone away try look home game ticket october,['sports'],[0 0 0 0 0 1]


In [17]:
print("Loading preprocessed test and validation data...")
df_test = pd.read_parquet(TEST_DATA_PATH)
df_val = pd.read_parquet(VAL_DATA_PATH)
print(f"Test samples: {len(df_test):,}")
print(f"Validation samples: {len(df_val):,}")

Loading preprocessed test and validation data...
Test samples: 1,511
Validation samples: 178


---
## 3. Train LDA Topic Models

### 3.1 Dictionary and Corpus
Tweets are re-tokenized with the Lab 2 preprocessing steps and restricted to the Lab 4 vocabulary so LDA concentrates on the most meaningful tokens. The `Dictionary` is built directly from these filtered tokens, preserving the entire vocabulary, including rare terms for the later interpretation work.

### 3.2 Topic Search Grid
To satisfy the requirements we sweep over a `topic_grid`. When the validation split retains enough tweets after vocabulary filtering we prioritize the `CoherenceModel (c_v)` score; otherwise we fall back to validation perplexity plus manual inspection because the validation subset is relatively small.


In [18]:
def ensure_tokens(sentence: Union[Sequence[str], str]) -> List[str]:
    """Whitespace tokenizer reused from previous labs."""
    if isinstance(sentence, str):
        sentence = sentence.split()
    return list(sentence)

def tokenize_column(series, vocabulary=None):
    """Tokenize tweets and optionally restrict tokens to a curated vocabulary."""
    docs, kept_idx = [], []
    for idx, row in enumerate(series):
        tokens = ensure_tokens(row)
        if vocabulary is not None:
            tokens = [tok for tok in tokens if tok in vocabulary]
        if tokens:
            docs.append(tokens)
            kept_idx.append(idx)
    return docs, kept_idx

def normalize_label(entry):
    """Reduce multi-hot or string labels to a single representative tag."""
    if isinstance(entry, list) and entry:
        return entry[0]
    if isinstance(entry, str):
        matches = re.findall(r"[A-Za-z0-9_&]+", entry)
        if matches:
            return matches[0]
        return entry.strip()
    return "unknown"

In [19]:
# Restrict documents to the curated Lab 4 vocabulary for cleaner topics.
top_vocab = set(TOP_VOCABULARY)

train_df = pd.read_parquet(TRAIN_DATA_PATH)
val_df = pd.read_parquet(VAL_DATA_PATH)

train_tokens, train_idx = tokenize_column(train_df["text"], vocabulary=top_vocab)
val_tokens, val_idx = tokenize_column(val_df["text"], vocabulary=top_vocab)

# Align labels with the filtered token lists so later evaluation uses matching rows.
train_labels = [normalize_label(train_df["label_name"].iloc[i]) for i in train_idx]
val_labels = [normalize_label(val_df["label_name"].iloc[i]) for i in val_idx]

print(f"Docs kept → train: {len(train_tokens)}, val: {len(val_tokens)}")
print(f"Sample tokens: {train_tokens[0][:15]}")



Docs kept → train: 5458, val: 178
Sample tokens: ['beat', 'game', 'final', 'hit']


In [20]:
dictionary = gensim.corpora.Dictionary(train_tokens)

train_corpus = [dictionary.doc2bow(text) for text in train_tokens]
val_corpus = [dictionary.doc2bow(text) for text in val_tokens]

print(f"Dictionary size: {len(dictionary)}")
print(f"Train documents: {len(train_corpus)}, Validation documents: {len(val_corpus)}")


Dictionary size: 1000
Train documents: 5458, Validation documents: 178


In [21]:
def train_lda(num_topics, passes=12):
    return LdaModel(
        corpus=train_corpus,
        num_topics=num_topics,
        id2word=dictionary,
        random_state=RANDOM_STATE,
        chunksize=2048,
        passes=passes,
        alpha="auto",
        eta="auto",
        minimum_probability=0.0,
    )

topic_grid = [15, 19, 35, 45] # label-aligned, mid, and high-granularity k
lda_runs, rows = {}, []

for k in topic_grid:
    model = train_lda(k)
    lda_runs[k] = model

    coherence_model = CoherenceModel(
        model=model,
        texts=train_tokens, # val tokens too sparse for stable c_v
        dictionary=dictionary,
        coherence="c_v",
    )
    coherence = coherence_model.get_coherence()
    perplexity = model.log_perplexity(val_corpus)

    rows.append(
        {
            "topics": k,
            "coherence_c_v": coherence,
            "val_perplexity": perplexity,
        }
    )
    print(f"k={k:2d} → coherence={coherence:.3f}, perplexity={perplexity:.3f}")




k=15 → coherence=0.324, perplexity=-13.723
k=19 → coherence=0.367, perplexity=-14.324
k=35 → coherence=0.412, perplexity=-15.080
k=45 → coherence=0.451, perplexity=-14.982


### Topic sweep (k = 19, 35, 45) with coherence + perplexity
- 19 topics mirror the 19 supervised labels in the dataset, so we can compare them directly.  
- 35 topics act as a mid-granularity level that already splits subthemes.  
- 45 topics gave the best trade-off between perplexity and interpretable coherence in earlier runs.  
We compute coherence on `train_tokens` because the validation split shrinks too much after the vocabulary filter and would otherwise return NaN; the test set stays untouched.

In [22]:
metrics_df = (
    pd.DataFrame(rows)
    .sort_values(["coherence_c_v", "val_perplexity"], ascending=[False, True])
    .reset_index(drop=True)
)
display(metrics_df)

best_topic_count = int(metrics_df.iloc[0]["topics"])
best_model = lda_runs[best_topic_count]
print(f"Chosen topic count: {best_topic_count}")

Unnamed: 0,topics,coherence_c_v,val_perplexity
0,45,0.450866,-14.982145
1,35,0.411792,-15.080116
2,19,0.366885,-14.324221
3,15,0.323755,-13.723411


Chosen topic count: 45


---
## 4. Visualize and Inspect Topics

### 4.1 Interactive pyLDAvis
Use `pyLDAvis.gensim.prepare(...)` to render the intertopic distance map (bubble plot) and the top-term bar chart for any trained model. Bubble size encodes how prevalent a topic is in the corpus, distances highlight similarity, and the λ slider controls whether the term list emphasizes raw frequency (λ ≈ 1) or exclusivity (λ ≈ 0). Exported HTML dashboards saved in `../Data/pyldavis_k19.html` and `../Data/pyldavis_k45.html`.

### 4.2 Tabular Keyword Summary
Because notebooks are often shared as static reports, a compact table of the top-N keywords per topic is generated alongside the visualizations. It’s handy for quick scans (e.g., pasting into slides) or when pyLDAvis can’t be opened.


In [23]:
print("Preparing pyLDAvis visualization for k=19 topics ...")
panel_k19 = pyLDAvis.gensim_models.prepare(
    lda_runs[19],
    train_corpus,
    dictionary,
    sort_topics=False,
)
pyLDAvis.enable_notebook()
pyLDAvis.display(panel_k19)
panel_k19

Preparing pyLDAvis visualization for k=19 topics ...


In [25]:
print("Preparing pyLDAvis visualization for k=45 topics ...")
panel_k45 = pyLDAvis.gensim_models.prepare(
    lda_runs[45],
    train_corpus,
    dictionary,
    sort_topics=False,
)
pyLDAvis.display(panel_k45)
panel_k45

Preparing pyLDAvis visualization for k=45 topics ...


In [26]:
# Save interactive pyLDAvis visualizations as html files
pyLDAvis.save_html(panel_k19, "../Data/pyldavis_k19.html")
pyLDAvis.save_html(panel_k45, "../Data/pyldavis_k45.html")

def summarize_topics(model: LdaModel, num_words: int = 12) -> pd.DataFrame:
    """Collect top-n keywords per topic for quick, non-interactive inspection."""
    rows = []
    for topic_id, word_info in model.show_topics(
        num_topics=-1, num_words=num_words, formatted=False
    ):
        # Join the per-topic (word, weight) tuples into a readable comma-separated string.
        tokens = ", ".join(word for word, _ in word_info)
        rows.append({"topic": topic_id, "keywords": tokens})
    return pd.DataFrame(rows).sort_values("topic")


print("Top keywords for k=19 model")
topics_k19 = summarize_topics(lda_runs[19], num_words=12)
display(topics_k19)

print("Top keywords for k=45 model")
topics_k45 = summarize_topics(lda_runs[45], num_words=12)
display(topics_k45)

Top keywords for k=19 model


Unnamed: 0,topic,keywords
0,0,"change, climate, check, brown, mask, country, ..."
1,1,"vote, fire, break, bad, movie, bring, set, mon..."
2,2,"team, win, game, good, go, time, league, year,..."
3,3,"red, heart, mark, family, love, double, sparkl..."
4,4,"love, man, look, get, watch, harry, fall, want..."
5,5,"hour, late, bay, wait, coach, daily, head, nfl..."
6,6,"week, social, half, tweet, lead, round, second..."
7,7,"day, happy, st, good, love, wish, enjoy, lockd..."
8,8,"news, open, say, trump, people, hear, god, rea..."
9,9,"star, power, war, month, ask, year, understand..."


Top keywords for k=45 model


Unnamed: 0,topic,keywords
0,0,"say, don, people, school, bill, government, le..."
1,1,"week, stream, find, home, away, way, try, flor..."
2,2,"go, look, good, boy, weekend, lose, to, forwar..."
3,3,"red, heart, sparkle, love, huge, good, true, r..."
4,4,"song, dance, twitter, app, tweet, party, hit, ..."
5,5,"thank, open, late, place, daily, didn, excited..."
6,6,"state, call, ohio, pick, playoff, protest, pay..."
7,7,"st, wish, coach, summer, service, learn, cowbo..."
8,8,"th, wait, head, hour, leave, follow, club, la,..."
9,9,"trump, power, president, easter, understand, b..."


---
## 5. Interpret Topics and Align Them with Dataset Labels

### 5.1 Topic → Label Alignment

Each tweet is assigned to the topic with the highest posterior probability and paired with its supervised label from the dataset. Aggregating these `(topic, label)` pairs lets us inspect, per topic, which labels dominate, how often a label appears, and how confident the model was when it picked that topic. The resulting table exposes:

- **Primary labels per topic** – high `count` + `topic_share` values show that, say, Topic 10 is mostly `sports`, while Topic 5 blends `diaries_&_daily_life` and `news_&_social_concern`.
- **Mixed topics** – topics whose top three labels all have similar shares indicate overlapping vocabulary or label definitions (useful for revisiting the taxonomy).
- **Confidence signals** – `avg_confidence` highlights whether the dominant label-topic pairing is strong (>0.4) or tentative (<0.25), guiding where further preprocessing or more topics might help.

In [27]:
def topic_label_alignment(model: LdaModel, corpus, labels, top_n: int = 3) -> pd.DataFrame:
    assignments = []
    for bow, label in zip(corpus, labels):
        topic_id, topic_prob = max(
            model.get_document_topics(bow, minimum_probability=0.0),
            key=lambda item: item[1],
        )
        assignments.append((topic_id, label, topic_prob))

    alignment = (
        pd.DataFrame(assignments, columns=["topic", "label", "probability"])
        .groupby(["topic", "label"])
        .agg(count=("probability", "size"),
             avg_confidence=("probability", "mean"))
        .reset_index()
    )
    alignment["topic_share"] = alignment.groupby("topic")["count"].transform(
        lambda counts: counts / counts.sum()
    )
    return alignment.sort_values(["topic", "count"], ascending=[True, False]) \
                    .groupby("topic").head(top_n)


print("Topic ↔ label alignment (k=19)")
alignment_k19 = topic_label_alignment(lda_runs[19], train_corpus, train_labels)
display(alignment_k19)

print("Topic ↔ label alignment (k=45)")
alignment_k45 = topic_label_alignment(lda_runs[45], train_corpus, train_labels)
display(alignment_k45)

Topic ↔ label alignment (k=19)


Unnamed: 0,topic,label,count,avg_confidence,topic_share
4,0,news_&_social_concern,171,0.468392,0.546326
0,0,celebrity_&_pop_culture,46,0.456204,0.146965
1,0,diaries_&_daily_life,35,0.403486,0.111821
10,1,news_&_social_concern,75,0.439523,0.270758
8,1,film_tv_&_video,72,0.486784,0.259928
7,1,diaries_&_daily_life,41,0.446835,0.148014
17,2,sports,337,0.475933,0.575085
16,2,news_&_social_concern,70,0.396207,0.119454
13,2,diaries_&_daily_life,57,0.46066,0.09727
19,3,diaries_&_daily_life,36,0.464097,0.290323


Topic ↔ label alignment (k=45)


Unnamed: 0,topic,label,count,avg_confidence,topic_share
4,0,news_&_social_concern,118,0.291747,0.548837
0,0,celebrity_&_pop_culture,27,0.305466,0.125581
1,0,diaries_&_daily_life,25,0.269342,0.116279
6,1,celebrity_&_pop_culture,24,0.295133,0.242424
7,1,diaries_&_daily_life,22,0.316984,0.222222
...,...,...,...,...,...
258,43,film_tv_&_video,51,0.360419,0.300000
259,43,music,49,0.406213,0.288235
266,44,news_&_social_concern,71,0.343144,0.533835
263,44,diaries_&_daily_life,25,0.319519,0.187970


### 5.2 Discussion and Observations

- **k = 19 (label-aligned)**: Most latent topics map cleanly to a single supervised class: `sports`, `music`, and `news_&_social_concern` dominate individual topics with ~70–80 % share. This makes the model easy to explain to stakeholders, yet some clusters still mix politics with general news or lifestyle chatter, showing that 19 topics can’t perfectly isolate every class.
- **k = 45 (metric best)**: Coherence improves and perplexity drops. The pyLDAvis view shows tightly packed bubbles for climate activism, NBA trade rumors, celebrity obituaries, pandemic updates, etc. These subtopics are great for exploratory analysis but fragment the supervised classes (multiple sports/politics bubbles), so this model trades simplicity for detail.
- **Topic divergences**: Several topics capture conversational filler (“good morning”, “happy birthday”) or sustained climate discussions that have no dedicated label, suggesting blind spots in the taxonomy. Climate-related tweets drifting away from general news at k=45 hint that an “environment” class could be useful.

### 5.3 How the Topics Inform the Dataset

- **Alignment with classes**: When k equals the class count, topics mostly reflect the label definitions, confirming that the preprocessing from earlier labs preserved the label structure.
- **Where to use which model**: Use the k=19 model when you need consistency with the supervised classifier or want to annotate new tweets with the existing classes. Use the k=45 model for editorial research, content audits, or to surface emerging subthemes ahead of creating new labels.
- **Practical takeaway**: Topic modeling highlights both well-covered areas (sports, entertainment) and underrepresented concepts (climate, casual chatter). This helps prioritize where to extend the label inventory or build specialized classifiers.

### 5.4 Visual Exploration

Inspect the exported pyLDAvis dashboards to drill into word distributions and bubble overlaps:

```bash
open grundlagen-des-nlp-ws25_26/Abgabe/Data/pyldavis_k19.html
open grundlagen-des-nlp-ws25_26/Abgabe/Data/pyldavis_k45.html
```

