##### Topic Modeling with LDA

In this notebook, we perform unsupervised topic modeling using Gensim's LDA.  
We also visualize the discovered topics using pyLDAvis.

Steps:
- Preprocess documents
- Create dictionary + corpus
- Train LDA model
- Visualize topics

**Sample Documents**

We use a small set of synthetic texts to understand how LDA groups words into topics.

In [11]:
documents = [
    "I love love watching football and cricket.",
    "The player scored a stunning goal in the final match.",
    "Messi and Ronaldo are football legends.",
    "I just bought a new iPhone and a MacBook.",
    "Apple devices are expensive but premium quality.",
    "I love photography with my new DSLR and lens."
]

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim
from gensim import corpora


**Preprocessing**

We tokenize, remove stopwords, and clean words before modeling.


In [13]:
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    return [word for word in tokens if word.isalpha() and word not in stop_words]

processed_docs = [preprocess(doc) for doc in documents]

In [14]:
processed_docs

[['love', 'love', 'watching', 'football', 'cricket'],
 ['player', 'scored', 'stunning', 'goal', 'final', 'match'],
 ['messi', 'ronaldo', 'football', 'legends'],
 ['bought', 'new', 'iphone', 'macbook'],
 ['apple', 'devices', 'expensive', 'premium', 'quality'],
 ['love', 'photography', 'new', 'dslr', 'lens']]

**Dictionary and Corpus**

LDA uses:
- Dictionary: mapping of word → ID
- Corpus: list of documents as bag-of-word frequency vectors


In [15]:
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]


In [16]:
print(dictionary.token2id)

{'cricket': 0, 'football': 1, 'love': 2, 'watching': 3, 'final': 4, 'goal': 5, 'match': 6, 'player': 7, 'scored': 8, 'stunning': 9, 'legends': 10, 'messi': 11, 'ronaldo': 12, 'bought': 13, 'iphone': 14, 'macbook': 15, 'new': 16, 'apple': 17, 'devices': 18, 'expensive': 19, 'premium': 20, 'quality': 21, 'dslr': 22, 'lens': 23, 'photography': 24}


In [17]:
corpus

[[(0, 1), (1, 1), (2, 2), (3, 1)],
 [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)],
 [(1, 1), (10, 1), (11, 1), (12, 1)],
 [(13, 1), (14, 1), (15, 1), (16, 1)],
 [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1)],
 [(2, 1), (16, 1), (22, 1), (23, 1), (24, 1)]]

**Train LDA Model**

We use `gensim.models.LdaModel` to extract topics from the corpus.


In [21]:
from gensim.models import LdaModel

lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=2,  # try changing this to 3 or 4
                     random_state=42,
                     passes=10)

# Print topics
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx + 1}:", topic)
    


Topic 1: 0.123*"love" + 0.114*"football" + 0.068*"cricket" + 0.068*"watching" + 0.068*"ronaldo" + 0.068*"messi" + 0.068*"legends" + 0.024*"lens" + 0.024*"photography" + 0.024*"dslr"
Topic 2: 0.077*"new" + 0.046*"quality" + 0.046*"expensive" + 0.046*"premium" + 0.046*"apple" + 0.046*"devices" + 0.046*"stunning" + 0.046*"macbook" + 0.046*"iphone" + 0.046*"bought"


**Visualize Topics**

We use `pyLDAvis` to explore topics interactively.

In [19]:
import pyLDAvis.gensim_models
import pyLDAvis

pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
vis