================================================================================
 TOPIC MODELING SETUP - ENVIRONMENT DESCRIPTION
================================================================================

 Purpose:
 --------
 This setup prepares a complete Python environment for topic modeling using 
 popular libraries such as Gensim, scikit-learn, and pyLDAvis.

 Module Breakdown:
 -----------------
 ▸ TEXT PREPROCESSING
   - Provides tools to clean and normalize raw text data.
   - Operations include removing HTML tags, punctuation, numbers, stopwords, 
     and short words.

 ▸ TOPIC MODELING (Gensim)
   - Dictionary: Creates a mapping of word tokens to unique IDs.
   - LDA Model: Builds a Latent Dirichlet Allocation model to extract topics
     from a text corpus.

 ▸ DATA SOURCES & DIMENSIONALITY REDUCTION (Scikit-learn)
   - fetch_20newsgroups: Loads a classic benchmark text dataset.
   - TruncatedSVD: Optionally reduces dimensions of sparse matrices using 
     Latent Semantic Analysis (LSA).

 ▸ MATRIX SUPPORT (SciPy)
   - Enables efficient storage and manipulation of large sparse matrices
     generated from text corpora.

 ▸ GENERAL UTILITIES
   - pandas & numpy: Essential for data manipulation, transformation, and 
     numerical operations.

 ▸ VISUALIZATION
   - pyLDAvis: Generates interactive visualizations to explore and interpret 
     the topics discovered by LDA.

 Usage:
 ------
 Run this environment as the foundation for any topic modeling pipeline.
 Recommended: Integrate with a preprocessing → corpus creation → LDA training 
 → visualization workflow.

================================================================================


In [1]:
from gensim.utils import tokenize
from gensim.parsing.preprocessing import (
    preprocess_string,
    strip_tags,
    strip_punctuation,
    strip_numeric,
    remove_stopwords,
    strip_short)

from gensim.corpora.dictionary import Dictionary
from gensim import models
from gensim.models.ldamodel import LdaModel

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD

from scipy.sparse import csr_matrix
import pandas as pd
import numpy as np

import pyLDAvis
import pyLDAvis.gensim_models


================================================================================
📦  DATASET FETCH: 20 Newsgroups (scikit-learn)
================================================================================

▶️ PURPOSE
---------
Loads the full 20 Newsgroups dataset — a collection of approximately 18,000 
newsgroup posts across 20 topics. Ideal for tasks like text classification, 
topic modeling (e.g. LDA), and document clustering.

--------------------------------------------------------------------------------
🔧 PARAMETERS
--------------------------------------------------------------------------------

subset = 'all'
  ▪ Loads both the training and test datasets.
  ▪ Options: 'train', 'test', or 'all'.

shuffle = False
  ▪ Keeps document order as-is.
  ▪ Set to True for randomized document order.

random_state = 32
  ▪ Ensures consistent shuffling (if enabled).
  ▪ Use any integer for reproducibility.

remove = ('headers', 'footers', 'qutes')
  ▪ Removes metadata from emails to reduce noise.
  ▪ ⚠️ Typo Detected: 'qutes' should be 'quotes'.
     Incorrect spelling leads to quoted content **not** being removed.
     Valid options: 'headers', 'footers', 'quotes'

--------------------------------------------------------------------------------
🧠 WHY USE REMOVE?
--------------------------------------------------------------------------------
- headers: Removes email addresses, subject lines, and server info.
- footers: Removes user signatures or disclaimers.
- quotes: Removes previous email replies (starts with '>').

This is crucial for **clean topic extraction** and to avoid dominant tokens 
that aren’t representative of true document content.

--------------------------------------------------------------------------------
📝 NOTES
--------------------------------------------------------------------------------
• This setup is typically the **first step** in any NLP pipeline.
• Cleaning irrelevant text leads to more interpretable topic clusters.
• Misspelled entries in `remove` will be silently ignored — always verify.

================================================================================
✅ STATUS CHECK
================================================================================
Typo present in 'remove' → replace 'qutes' with 'quotes' for proper execution.


In [2]:
dataset = fetch_20newsgroups(subset = 'all',shuffle= False, random_state=32,remove=('headers', 'footers', 'qutes'))

In [3]:
dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
print(dataset.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
len(dataset.target_names)

20

In [6]:
dataset_df = pd.DataFrame({'News':dataset.data, 'Label' : dataset.target})

In [7]:
dataset_df.head()

Unnamed: 0,News,Label
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9
1,"Well, I just got my Centris 610 yesterday. It...",4
2,Archive-name: cryptography-faq/part10\nLast-mo...,11
3,> ATTENTION: Mac Quadra owners: Many storage i...,4
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0


================================================================================
🧾 OPERATION DESCRIPTION: Label Mapping to Class Names
================================================================================

▶️ Purpose:
Converts numeric category labels into their corresponding string class names
from the 20 Newsgroups dataset. Adds a new column called 'Label_name' to the 
existing DataFrame.

--------------------------------------------------------------------------------
🔁 How It Works:
- Takes each numeric label from the 'Label' column.
- Uses it as an index to retrieve the corresponding class name from 
  'dataset.target_names'.
- Applies this mapping row-by-row using a lambda function.
- The result is stored in a new column 'Label_name'.

--------------------------------------------------------------------------------
🎯 Use Case:
Essential for interpretability — numeric labels like `4`, `7`, or `15` are 
not informative for analysis or visualization. Mapping them to human-readable 
topic names like `'comp.sys.ibm.pc.hardware'` or `'talk.politics.mideast'` 
makes downstream tasks like grouping, filtering, or topic modeling easier to 
understand.

--------------------------------------------------------------------------------
📝 Notes:
- Assumes that 'dataset_df' already contains a 'Label' column with valid 
  integer indices matching the order in 'dataset.target_names'.
- If labels are misaligned or contain out-of-bound indices, an IndexError 
  may occur.
================================================================================


In [8]:
dataset_df['Label_name'] = dataset_df['Label'].apply(lambda x: dataset.target_names[x])

In [9]:
dataset_df.head()

Unnamed: 0,News,Label,Label_name
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism


In [10]:
dataset_df['Clean_news'] = dataset_df['News'].apply(preprocess_string)

In [11]:
dataset_df.head()

Unnamed: 0,News,Label,Label_name,Clean_news
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball,"[gajarski, pilot, njin, net, write, morgan, gu..."
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware,"[got, centri, yesterdai, took, week, place, or..."
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt,"[archiv, cryptographi, faq, modifi, faq, sci, ..."
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware,"[attent, mac, quadra, owner, storag, industri,..."
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism,"[bobb, vice, ico, tek, com, robert, beauchain,..."


================================================================================
🧹 TEXT CLEANING PROCESS: Applying Gensim Filters with preprocess_string()
================================================================================

▶️ Purpose:
Cleans raw news text by applying a custom sequence of text filters using 
Gensim’s `preprocess_string()` function. The result is stored in a new column 
called 'Clean_news1'.

--------------------------------------------------------------------------------
🧪 Filters Applied:
  1. Lowercasing:
     → Converts all text to lowercase to ensure consistency.

  2. strip_tags:
     → Removes any HTML or XML tags present in the text.

  3. strip_punctuation:
     → Eliminates punctuation marks (e.g., ., !, ?, :, etc.).

  4. strip_numeric:
     → Removes all numeric characters (e.g., 2023, 100%).

  5. remove_stopwords:
     → Filters out common words like “the”, “is”, “and” that carry little 
        semantic weight in NLP tasks.

  6. strip_short:
     → Removes short tokens (default length < 3), which are often noise.

--------------------------------------------------------------------------------
📤 Output:
The resulting text is a cleaned list of meaningful words, stored under the 
column 'Clean_news1'. Each row contains a list of preprocessed tokens.

Example:
Raw:     "Breaking News: NASA launches rocket in 2023!"
Cleaned: ['breaking', 'news', 'nasa', 'launches', 'rocket']

--------------------------------------------------------------------------------
📌 Use Case:
This preprocessing step is critical for tasks like topic modeling, word 
embedding training, and classification. It helps reduce dimensionality and 
improves model performance by stripping irrelevant content.

--------------------------------------------------------------------------------
📝 Notes:
- Token output is in list format.
- No stemming or lemmatization is applied here.
- Can be extended with custom functions if needed.

================================================================================


In [12]:
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_numeric,remove_stopwords,strip_short]
dataset_df['Clean_news1'] = dataset_df['News'].apply(lambda x: preprocess_string(x,filters))

In [13]:
dataset_df.head()

Unnamed: 0,News,Label,Label_name,Clean_news,Clean_news1
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball,"[gajarski, pilot, njin, net, write, morgan, gu...","[gajarsky, pilot, njin, net, writes, morgan, g..."
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware,"[got, centri, yesterdai, took, week, place, or...","[got, centris, yesterday, took, weeks, placing..."
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt,"[archiv, cryptographi, faq, modifi, faq, sci, ...","[archive, cryptography, faq, modified, faq, sc..."
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware,"[attent, mac, quadra, owner, storag, industri,...","[attention, mac, quadra, owners, storage, indu..."
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism,"[bobb, vice, ico, tek, com, robert, beauchain,...","[bobbe, vice, ico, tek, com, robert, beauchain..."


In [14]:
dataset_dictionary = Dictionary(dataset_df['Clean_news1'])

In [15]:
len(dataset_dictionary)

96459

In [16]:
print(list(dataset_dictionary.token2id.items())[:10])

[('castillo', 0), ('cubs', 1), ('era', 2), ('gajarsky', 3), ('good', 4), ('guzman', 5), ('harkey', 6), ('hibbard', 7), ('higher', 8), ('idiots', 9)]


================================================================================
🧮 BAG-OF-WORDS VECTORIZATION: Creating Gensim Corpus
================================================================================

▶️ Purpose:
Transforms the tokenized and cleaned text in 'Clean_news1' into a corpus of 
Bag-of-Words (BoW) vectors using a Gensim dictionary. This prepares the data 
for topic modeling algorithms like LDA.

--------------------------------------------------------------------------------
📤 Input:
- dataset_df['Clean_news1']: 
  A column containing lists of preprocessed tokens for each document.

- dataset_dictionary:
  A Gensim Dictionary object mapping unique words to integer IDs.

--------------------------------------------------------------------------------
🔁 Process:
- Iterates through each list of tokens (each document).
- Converts the list into a BoW representation using `doc2bow()`.
- Each document is now represented as a sparse list of (word_id, count) tuples.

Example:
Original Tokens → ['computer', 'hardware', 'driver']
BoW Vector      → [(10, 1), (45, 1), (128, 1)]

This means:
- Word ID 10 appears once.
- Word ID 45 appears once.
- Word ID 128 appears once.

--------------------------------------------------------------------------------
📦 Output:
- `dataset_corpus_bow`: 
  A list of BoW vectors — one for each document.
  This list forms the input for LDA or other topic modeling techniques.

--------------------------------------------------------------------------------
📝 Notes:
- Only words that exist in `dataset_dictionary` will be included in the BoW.
- Words not present in the dictionary (e.g., due to filtering) are ignored.
- The resulting corpus is typically passed into models like `LdaModel`.

================================================================================


In [17]:
dataset_corpus_bow = [dataset_dictionary.doc2bow(text) for text in dataset_df['Clean_news1']] #create a dataset corpus with bag of word vectorization

In [18]:
len(dataset_corpus_bow)

18846

In [19]:
print(dataset_corpus_bow[1])

[(22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 3), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 2), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1)]


================================================================================
📈 TF-IDF TRANSFORMATION: Converting BoW Corpus to Weighted Representation
================================================================================

▶️ Purpose:
Transforms the Bag-of-Words (BoW) corpus into a Term Frequency–Inverse Document 
Frequency (TF-IDF) representation. This helps highlight words that are important 
within a document but not overly common across all documents.

--------------------------------------------------------------------------------
🔁 Process Overview:

1. TfidfModel Initialization:
   - Scans through the BoW corpus (`dataset_corpus_bow`) and calculates the 
     inverse document frequency (IDF) for each term.
   - Builds a model that can convert BoW vectors to TF-IDF weighted vectors.

2. Transformation:
   - Applies the TF-IDF model to the BoW corpus.
   - Each document is transformed into a list of (term_id, tf-idf weight) tuples.
   - This representation emphasizes terms that are unique and downweights 
     commonly occurring ones.

--------------------------------------------------------------------------------
🎯 Why Use TF-IDF?
- Reduces the impact of high-frequency, low-value words.
- Enhances the significance of rare but meaningful terms.
- Particularly useful when feeding into topic modeling algorithms that are 
  sensitive to term frequency distributions (e.g., LSA).

--------------------------------------------------------------------------------
📦 Output:
- `dataset_corpus_tfidf`:
  A transformed version of the original BoW corpus where each document is 
  represented by its TF-IDF vector.

--------------------------------------------------------------------------------
📝 Notes:
- The original dictionary (`dataset_dictionary`) remains unchanged.
- The output format is still compatible with Gensim topic models like LDA, LSI, etc.
- Can be serialized for re-use or saved using Gensim’s corpus I/O utilities.

================================================================================


In [20]:
tfidf = models.TfidfModel(dataset_corpus_bow)
dataset_corpus_tfidf = tfidf[dataset_corpus_bow]

In [21]:
len(dataset_corpus_tfidf)

18846

In [22]:
print(dataset_corpus_tfidf[1])

[(22, 0.12794312043780054), (23, 0.1032933602529823), (24, 0.1437906445046912), (25, 0.19446130648981633), (26, 0.09972437101248886), (27, 0.19446130648981633), (28, 0.056593976938038), (29, 0.09712742378308543), (30, 0.11391287593244794), (31, 0.08578843198010519), (32, 0.19446130648981633), (33, 0.10142837384435295), (34, 0.09687548915310225), (35, 0.10125120507490903), (36, 0.09133742977605598), (37, 0.061307837508357034), (38, 0.10891119725810347), (39, 0.06606330836365701), (40, 0.08855334717656915), (41, 0.07391023272465086), (42, 0.19446130648981633), (43, 0.11381832135728932), (44, 0.3110653713924378), (45, 0.11566217615948503), (46, 0.0969847936661337), (47, 0.0694228263237942), (48, 0.09383842074796034), (49, 0.09300779209433964), (50, 0.0455655162646588), (51, 0.10391838009790301), (52, 0.07696272067182856), (53, 0.10763740852872361), (54, 0.05222277834588352), (55, 0.05433687654103351), (56, 0.05404373403325169), (57, 0.15338363279035241), (58, 0.08979736303341844), (59, 0.

# Topic Modelling with Latent Dirichlet Allocation(LDA)

================================================================================
🧠 LDA TOPIC MODEL TRAINING (BOW): Gensim LdaModel Initialization
================================================================================

▶️ Purpose:
Trains a Latent Dirichlet Allocation (LDA) topic model using the Bag-of-Words 
representation of the document corpus. This model identifies abstract topics 
present across the collection of documents.

--------------------------------------------------------------------------------
⚙️ Parameters:

- dataset_corpus_bow:
  ▪ The input corpus where each document is represented as a list of 
    (word_id, frequency) pairs.
  ▪ Generated from Gensim's Dictionary and doc2bow process.

- num_topics=20:
  ▪ The model will extract 20 distinct topics.
  ▪ Each topic is a probability distribution over words.

- id2word=dataset_dictionary:
  ▪ A Gensim Dictionary mapping word IDs to actual words.
  ▪ Required to interpret topic output in human-readable form.

- random_state=42:
  ▪ Ensures reproducibility of the model output across runs.
  ▪ Controls the random initialization of topic distributions.

--------------------------------------------------------------------------------
📦 Output:
- lda_bow:
  ▪ A trained Gensim LdaModel object.
  ▪ Capable of:
     → Inferring topic distribution for new documents.
     → Displaying top words in each topic.
     → Calculating coherence and perplexity.
     → Visualizing topics using pyLDAvis.

--------------------------------------------------------------------------------
📌 Use Case:
- Ideal for exploring hidden thematic structures in large text datasets.
- Frequently used in NLP pipelines, content clustering, and document 
  classification.

--------------------------------------------------------------------------------
📝 Notes:
- Input corpus must match the dictionary used for model initialization.
- The model assumes that documents are mixtures of multiple topics.
- Topics are multinomial distributions over the vocabulary.

================================================================================


In [23]:

lda_bow = LdaModel(dataset_corpus_bow,num_topics=20,id2word=dataset_dictionary,random_state=42)

In [24]:
lda_topics_bow = lda_bow.print_topics(num_words=8)
for topic in lda_topics_bow:
  print(topic)

(0, '0.021*"edu" + 0.019*"writes" + 0.015*"article" + 0.006*"like" + 0.006*"appears" + 0.006*"com" + 0.005*"new" + 0.004*"think"')
(1, '0.018*"file" + 0.007*"khz" + 0.007*"pov" + 0.006*"gems" + 0.006*"livesey" + 0.006*"bibliography" + 0.005*"incoming" + 0.005*"writes"')
(2, '0.015*"writes" + 0.014*"article" + 0.010*"edu" + 0.008*"like" + 0.007*"know" + 0.007*"com" + 0.005*"time" + 0.004*"good"')
(3, '0.018*"god" + 0.010*"people" + 0.007*"believe" + 0.007*"think" + 0.006*"writes" + 0.006*"know" + 0.005*"jesus" + 0.005*"article"')
(4, '0.014*"key" + 0.013*"government" + 0.009*"chip" + 0.009*"clipper" + 0.009*"encryption" + 0.007*"keys" + 0.007*"use" + 0.007*"security"')
(5, '0.008*"health" + 0.006*"medical" + 0.006*"care" + 0.005*"think" + 0.005*"disease" + 0.005*"like" + 0.005*"money" + 0.004*"drug"')
(6, '0.014*"image" + 0.012*"file" + 0.011*"edu" + 0.011*"graphics" + 0.009*"ftp" + 0.008*"available" + 0.007*"use" + 0.007*"jpeg"')
(7, '0.019*"israel" + 0.013*"jewish" + 0.012*"jews" + 0.

In [25]:
lda_tfidf = LdaModel(dataset_corpus_tfidf, id2word=dataset_dictionary, num_topics=20)

In [26]:
lda_topics_tfidf = lda_tfidf.print_topics(num_words=8)
for topic in lda_topics_tfidf:
  print(topic)

(0, '0.002*"infante" + 0.001*"idle" + 0.001*"ulf" + 0.001*"diamond" + 0.001*"twin" + 0.001*"cruel" + 0.001*"hawk" + 0.001*"ranger"')
(1, '0.003*"font" + 0.003*"fonts" + 0.002*"livesey" + 0.002*"siggraph" + 0.002*"bradley" + 0.002*"timer" + 0.002*"lee" + 0.002*"pluto"')
(2, '0.003*"espn" + 0.003*"mhz" + 0.003*"sky" + 0.003*"baseball" + 0.002*"motherboard" + 0.002*"cpu" + 0.002*"bios" + 0.002*"pin"')
(3, '0.002*"gant" + 0.002*"eliot" + 0.002*"tires" + 0.002*"icon" + 0.002*"nixon" + 0.002*"mormon" + 0.002*"strike" + 0.001*"nissan"')
(4, '0.002*"fluid" + 0.002*"cds" + 0.001*"ics" + 0.001*"sco" + 0.001*"cylinder" + 0.001*"novell" + 0.001*"autodesk" + 0.001*"ericsson"')
(5, '0.005*"hst" + 0.002*"bmp" + 0.002*"iisi" + 0.002*"ranck" + 0.002*"array" + 0.001*"pds" + 0.001*"baud" + 0.001*"bony"')
(6, '0.002*"zionism" + 0.002*"ads" + 0.002*"laserwriter" + 0.001*"map" + 0.001*"pregnancy" + 0.001*"hart" + 0.001*"quicktime" + 0.001*"cbr"')
(7, '0.003*"subscribe" + 0.001*"mask" + 0.001*"hernandez" + 0

# Topic Modelling with Latent Semantic Analysis/Indexing(LSA/LSI)

In [27]:
# Convert Gensim BoW corpus to sparse matrix
data, rows, cols = [], [], []
for doc_idx, doc in enumerate(dataset_corpus_bow):
    for term_idx, value in doc:
        rows.append(doc_idx)
        cols.append(term_idx)
        data.append(value)
sparse_matrix = csr_matrix((data, (rows, cols)), shape=(len(dataset_corpus_bow), len(dataset_dictionary)))

# Train TruncatedSVD (LSI equivalent)
svd = TruncatedSVD(n_components=20, random_state=42)
svd.fit(sparse_matrix)

# Print top words for each topic
terms = np.array(list(dataset_dictionary.values()))
for topic_idx, topic in enumerate(svd.components_):
    top_indices = topic.argsort()[-8:][::-1]
    top_words = terms[top_indices]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

Topic 0: max, giz, bhj, qax, biz, nrhj, bxn, nuy
Topic 1: jpeg, file, edu, image, dos, use, available, ftp
Topic 2: dos, windows, microsoft, tcp, mouse, amiga, software, higher
Topic 3: jpeg, image, file, gif, images, format, color, quality
Topic 4: jpeg, dos, gif, people, god, said, know, color
Topic 5: god, jehovah, lord, elohim, christ, jesus, father, mcconkie
Topic 6: file, gun, jehovah, god, output, control, congress, lord
Topic 7: use, wire, wiring, window, subject, new, like, space
Topic 8: new, hockey, space, league, team, nhl, season, mail
Topic 9: edu, people, file, com, armenians, went, said, armenian
Topic 10: hockey, team, league, edu, nhl, stephanopoulos, season, year
Topic 11: image, space, data, earth, planet, program, venus, spacecraft
Topic 12: edu, disk, drive, scsi, writes, hard, article, drives
Topic 13: disk, drive, scsi, hard, drives, said, hockey, jehovah
Topic 14: output, entry, program, rules, oname, contest, build, eof
Topic 15: wire, wiring, ground, neutral,

================================================================================
📊 TOPIC MODELING PIPELINE: TruncatedSVD + pyLDAvis Visualization (LSI Approach)
================================================================================

▶️ Goal:
Perform Latent Semantic Indexing (LSI) using `TruncatedSVD` on a sparse 
Bag-of-Words matrix and visualize the resulting topics using `pyLDAvis`.

--------------------------------------------------------------------------------
🧮 STAGE 1: BoW to Sparse Matrix Conversion
- Iterates through each document and term pair in the BoW corpus.
- Converts the corpus into a compressed sparse row (CSR) matrix format.
- Rows represent documents; columns represent terms.
- Each cell contains the frequency of a term in a document.

--------------------------------------------------------------------------------
🔬 STAGE 2: TruncatedSVD for Topic Extraction
- Applies dimensionality reduction using `TruncatedSVD` (LSA/LSI equivalent).
- `n_components=20` extracts 20 latent topics.
- Produces:
  • `doc_topic`: Document-topic matrix.
  • `topic_term`: Topic-term matrix.

--------------------------------------------------------------------------------
🧽 STAGE 3: Normalization
- Applies non-negativity constraints by clipping all negative values to 0.
- Normalizes both `doc_topic` and `topic_term` matrices row-wise so that 
  each row sums to 1 (i.e., probability distribution format).
- Ensures the result is interpretable and usable for visualization.

--------------------------------------------------------------------------------
📦 STAGE 4: Metadata Preparation for pyLDAvis
- `term_frequency`: Total occurrence of each term across all documents.
- `doc_lengths`: Total number of words per document.
- `vocab`: List of terms (in dictionary order) used in the corpus.

--------------------------------------------------------------------------------
📈 STAGE 5: Visualization with pyLDAvis
- Constructs a `vis_data` object using the normalized matrices and metadata.
- Renders an interactive topic visualization using MMDS (Multi-Dimensional Scaling).
- `enable_notebook(local=True)` ensures that visualization works inline 
  (especially important for environments like Kaggle or Jupyter).

--------------------------------------------------------------------------------
📝 Notes:
- Unlike LDA, TruncatedSVD is a linear algebra-based topic model and does 
  not assume probabilistic word distributions.
- LSI works well for dimensionality reduction and document similarity tasks, 
  but topics may be harder to interpret than in LDA.
- Negative values are expected in SVD; clipping them improves compatibility 
  with tools like pyLDAvis which expect non-negative distributions.

================================================================================


In [36]:
# Convert Gensim BoW corpus to sparse matrix
data, rows, cols = [], [], []
for doc_idx, doc in enumerate(dataset_corpus_bow):
    for term_idx, value in doc:
        rows.append(doc_idx)
        cols.append(term_idx)
        data.append(value)
sparse_matrix = csr_matrix((data, (rows, cols)), shape=(len(dataset_corpus_bow), len(dataset_dictionary)))

# Train TruncatedSVD (LSI equivalent)
svd = TruncatedSVD(n_components=20, random_state=42)
doc_topic = svd.fit_transform(sparse_matrix)  # Document-topic matrix
topic_term = svd.components_  # Topic-term matrix

# Normalize topic-term distribution
topic_term = np.maximum(topic_term, 0)  # Clip negative values
topic_term /= topic_term.sum(axis=1, keepdims=True) + 1e-10  # Normalize to sum to 1

# Normalize document-topic distribution, handling zero-sum rows
doc_topic = np.maximum(doc_topic, 0)  # Clip negative values
row_sums = doc_topic.sum(axis=1, keepdims=True)
# Replace zero sums with uniform distribution
doc_topic = np.where(row_sums > 0, doc_topic / row_sums, 1.0 / doc_topic.shape[1])
# Ensure rows sum to 1
doc_topic /= doc_topic.sum(axis=1, keepdims=True) + 1e-10

# Prepare pyLDAvis data
term_frequency = np.array(sparse_matrix.sum(axis=0)).flatten()  # Term frequencies
doc_lengths = np.array(sparse_matrix.sum(axis=1)).flatten()  # Document lengths
vocab = list(dataset_dictionary.values())

# Create pyLDAvis visualization
vis_data = pyLDAvis.prepare(
    topic_term_dists=topic_term,
    doc_topic_dists=doc_topic,
    doc_lengths=doc_lengths,
    vocab=vocab,
    term_frequency=term_frequency,
    mds='mmds'  # Use MMDS instead of t-SNE for faster rendering in Kaggle
)

# Ensure inline display in Kaggle
pyLDAvis.enable_notebook(local=True)  # Force local JavaScript rendering
pyLDAvis.display(vis_data)

  doc_topic = np.where(row_sums > 0, doc_topic / row_sums, 1.0 / doc_topic.shape[1])
  doc_topic = np.where(row_sums > 0, doc_topic / row_sums, 1.0 / doc_topic.shape[1])
  result = func(self.values, **kwargs)
  result = func(self.values, **kwargs)
  result = func(self.values, **kwargs)


# Topic Modelling Visualization with pyLDAvis

In [31]:
pyLDAvis.enable_notebook()

In [32]:
vis_bow = pyLDAvis.gensim_models.prepare(lda_bow, dataset_corpus_bow, dataset_dictionary)
vis_bow

In [33]:
vis_tfidf = pyLDAvis.gensim_models.prepare(lda_tfidf, dataset_corpus_tfidf, dataset_dictionary)
vis_tfidf

# Model evaluation for Topic Modelling

Topic coherence is a quantitative method to measure the quality of topics, how similar the top words are similar to each other and how interpretable topics are to humans.Coherence is expressed as the sum of pairwise scores on the words w1, …, wn used to describe the topic . Coherence is usually an intrinsic or extrinsic measure. For the purpose of the session, two options for coherence will be implemented using the coherence model in gensim. u_mass(a measure of how often two words were seen together with a range of-14 and 14) and c_v (0 and 1)

In [38]:
from gensim.models import CoherenceModel
cm_lda_bow_umass = CoherenceModel(model=lda_bow,texts=dataset_df['Clean_news1'], corpus=dataset_corpus_bow, coherence='u_mass')
cm_lda_bow_umass.get_coherence()

-4.161846954541558

In [39]:
# Create topics with all words, weighted by their topic-term distribution
topics = []
for topic in topic_term:
    # Include all words with non-zero weights
    word_weights = [(vocab[idx], weight) for idx, weight in enumerate(topic) if weight > 0]
    # Sort by weight (descending) to prioritize important words
    word_weights.sort(key=lambda x: x[1], reverse=True)
    # Extract only the words (CoherenceModel expects a list of words)
    topic_words = [word for word, weight in word_weights]
    topics.append(topic_words)

# Compute u_mass coherence score
cm_svd_umass = CoherenceModel(
    topics=topics,
    texts=dataset_df['Clean_news1'],
    dictionary=dataset_dictionary,
    coherence='u_mass'
)
coherence_score = cm_svd_umass.get_coherence()
print(f"U_mass Coherence Score (All Words): {coherence_score}")

U_mass Coherence Score (All Words): -3.947714470235133


In [40]:
texts= dataset_df['Clean_news1']
texts = [x for x in texts if x]

In [41]:
cm_lda_bow_cv = CoherenceModel(model=lda_bow,texts=texts,dictionary=dataset_dictionary,coherence='c_v')
cm_lda_bow_cv.get_coherence()

0.5010949388755367

In [42]:
# Compute c_v coherence score
cm_svd_cv = CoherenceModel(
    topics=topics,
    texts=dataset_df['Clean_news1'],
    dictionary=dataset_dictionary,
    coherence='c_v'
)
coherence_score = cm_svd_cv.get_coherence()
print(f"C_v Coherence Score (All Words): {coherence_score}")

C_v Coherence Score (All Words): 0.49307350070929357
