# 🔠 Topic Modeling with Gensim LDA

This notebook demonstrates how to perform topic modeling using Gensim's LDA. It includes data preprocessing, training the model, extracting topics, and evaluating the quality of those topics.

---
## 🧭 How to Use This Notebook

Some cells require **user input**, such as setting file paths or choosing how many topics the model should output. These are clearly commented with:

```python
# 🔧 USER: Change this value if needed
```

Check and update these parts before running the notebook. Everything else can be run without changes.

---
## 📂 Input File:
- Either an **Excel file** (e.g. `Data.xlsx`) or a **CSV file** (e.g. `Data.csv`) with a column named `'content'`

## 📁 Output Files:
- `top_docs_df.csv`: Shows top documents per topic
- `processed_data_with_topics.csv`: Shows all documents with their topic proportions
- `top_keywords_per_topic.csv`: Top words that define each topic
- `lda_visualization.html`: Interactive topic visualization
---

## 1. Import Required Libraries

In [1]:
# We import all necessary libraries for text processing, topic modeling, and visualization.
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
from gensim.models import CoherenceModel
import pyLDAvis.gensim_models
import pyLDAvis

# Download required NLTK resources (needed for tokenization and lemmatization)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mb582\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mb582\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mb582\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 2. Set Number of Topics

In [None]:
# 🔧 USER: Set how many topics the model should discover.
# This determines how many themes will be generated from the text.
NUM_TOPICS = 5

## 3. Load Dataset (Choose One: Excel or CSV)

In [None]:
# 🔧 USER: Use only one of the two options below. Comment out the one you don't use.
# Make sure the file has a column named 'content'.

# Option A: Load from Excel
df = pd.read_excel('Data.xlsx')  # 🔧 Change file path if needed

# Option B: Load from CSV
# df = pd.read_csv('Data.csv')   # 🔧 Uncomment this line if using a CSV file

# Remove rows where the 'content' field is empty or missing
df = df[df['content'].notna()]

## 4. Preprocess the Text

In [None]:
# This step tokenizes the text, converts it to lowercase,
# removes stopwords (both standard and custom), and lemmatizes each word.

tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

# Define stopwords from NLTK and custom list
nltk_stopwords = set(stopwords.words('english'))
custom_stopwords = {
    'one', 'know', 'could', 'like'
}
all_stopwords = nltk_stopwords.union(custom_stopwords)

# Function to clean each document
def preprocess(text):
    if not isinstance(text, str):
        return []
    tokens = tokenizer.tokenize(text.lower())
    return [lemmatizer.lemmatize(token) for token in tokens if token not in all_stopwords and len(token) > 2]

# Apply preprocessing function
df['processed'] = df['content'].apply(preprocess)

# Remove documents with less than 3 tokens after cleaning
df = df[df['processed'].apply(lambda x: len(x) >= 3)]

## 5. Create Dictionary and Corpus

In [None]:
# Gensim requires input in the form of a dictionary and a corpus.
# The dictionary maps words to IDs, and the corpus maps documents to word frequencies.
dictionary = corpora.Dictionary(df['processed'])
corpus = [dictionary.doc2bow(text) for text in df['processed']]

## 6. Train the LDA Model

In [None]:
# This is where the topic modeling happens.
# We specify the number of topics and let the model learn patterns in the text.

lda_model = models.LdaModel(
    corpus=corpus,
    num_topics=NUM_TOPICS,
    id2word=dictionary,
    passes=50,
    iterations=400,
    alpha='auto',
    eta='auto',
    random_state=42
)

## 7. Assign Topics to Each Document

In [None]:
# This gives each document a distribution of topics with associated probabilities.
df['Document Topics'] = [lda_model.get_document_topics(doc) for doc in corpus]

## 8. View Topic Proportions in the Whole Dataset

In [None]:
# Calculate the average proportion of each topic across all documents.
topic_counts = np.zeros(NUM_TOPICS)
for doc_topics in df['Document Topics']:
    for topic_num, prop in doc_topics:
        topic_counts[topic_num] += prop
topic_proportions = topic_counts / len(corpus)

# Print topic proportions
for topic_num, proportion in enumerate(topic_proportions):
    print(f"Topic {topic_num}: {proportion:.5f}")

## 9. Extract Top Documents for Each Topic

In [None]:
# For each topic, find the top 4 documents that represent it the most.
top_docs_per_topic = {i: [] for i in range(NUM_TOPICS)}
for idx, topics in enumerate(df['Document Topics']):
    for topic_num, prop in topics:
        top_docs_per_topic[topic_num].append((idx, prop))
for topic_num in top_docs_per_topic:
    top_docs_per_topic[topic_num] = sorted(top_docs_per_topic[topic_num], key=lambda x: x[1], reverse=True)[:4]

top_docs_df = pd.DataFrame([
    (topic_num, df.iloc[idx]['content'], prop)
    for topic_num, docs in top_docs_per_topic.items()
    for idx, prop in docs
], columns=['Topic', 'Content', 'Proportion'])

# Save result to CSV
top_docs_df.to_csv('top_docs_df.csv', index=False)

## 10. Save Full Topic Proportions for Each Document

In [None]:
# Save each document with a full list of topic probabilities (even very small ones).
df['Topic Proportions'] = lda_model.get_document_topics(corpus, minimum_probability=0)
df.to_csv('processed_data_with_topics.csv', index=False)

## 11. Show Top Keywords per Topic

In [None]:
# For each topic, print and save the top 10 most representative words.
top_keywords_per_topic = {i: lda_model.show_topic(i, 10) for i in range(NUM_TOPICS)}
for topic_num, keywords in top_keywords_per_topic.items():
    print(f"\nTopic {topic_num}:")
    for word, weight in keywords:
        print(f"{word}: {weight:.4f}")

top_keywords_df = pd.DataFrame([
    (topic, word, weight)
    for topic, keywords in top_keywords_per_topic.items()
    for word, weight in keywords
], columns=['Topic', 'Word', 'Weight'])

top_keywords_df.to_csv('top_keywords_per_topic.csv', index=False)

## 12. Visualize Topics Using pyLDAvis

In [None]:
# Create an interactive visualization that shows topic relationships and top words.
lda_visualization = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(lda_visualization, 'lda_visualization.html')
pyLDAvis.display(lda_visualization)

## 13. Evaluate Model Quality

In [None]:
# Evaluate the model using two common metrics:
# - Perplexity (lower is better)
# - Coherence (higher is better)
print("Perplexity:", lda_model.log_perplexity(corpus))
coherence_model = CoherenceModel(model=lda_model, texts=df['processed'], dictionary=dictionary, coherence='c_v')
print("Coherence Score:", coherence_model.get_coherence())