# Topic Modeling

In today’s digital world, huge amounts of text are created every moment — from news and social media to customer feedback and research articles. `Topic Modelling` is a technique that identify and extract topics from a collection of texts or documents. It is an `unsupervised NLP technique` that automatically discovers hidden themes or topics within a large collection of documents. This allows us to understand the main ideas in big text datasets without reading everything manually.

Topic Modeling has a wide raneg of use cases in regards to NLP, such as document tagging, social media monitoring, news categorization, amongst others. It utilizes unsupervised learning techniques, making it a very cost-effective that reduces resources required to collect human-annotated data. It answers the question: "What is this text collection about?" without requiring any labeled data. Upon identifying its patterns within the collection of texts, topic modeling can group words into clusters -*topics*- that frequently appear together. These topics then helps us understand the text data's main theme.

# Key Topic Modeling Techniques

1. **Latent Dirichlet Allocation (LDA)**
   - LDA is a probabilistic generative model for documents. it assumes:
     * Each Document is a mixture of topics
     * Each topic is a probability distribution over words
   - It aims to find the distribution of topics in each document and the distribution of words in each topic.
   - Basically, it figures out:
     - Which topics belong to each document
     - Which words belong to each topic<br><br>
   * If many documents contain "game", "score", "team", it might form a Sports topic.
2. **Non-Negative Matrix Factorization (NMF)**:
   - Non-Negative Matrix Factorization (NMF) is a method for topic modeling that works by taking the document-term matrix (which shows the frequency of words in documents) and breaking it down, or decomposing it, into two smaller matrices.
   - ***In simpler terms, NMF is a math-based method that breaks down the document-word table into two smaller tables***
     * One table shows topics and the words that describe them.
     * The other table shows how much each document belongs to each topic.<br><br>
3. **Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI)**
   - LSA is a technique that uses Singular Value Decomposition (SVD) to simplify the term-document matrix. <br>
     This process reduces the original high-dimensional data into a smaller, "latent semantic space." In this reduced space, the model identifies and captures the underlying relationships (co-occurrence patterns) between words and documents.<br>
     The result is that documents that are positioned close together in this new low-dimensional space are considered to be related because they share similar underlying topics or concepts.
   - You can think of it as:
        * Compressing the text data,
        * Keeping only the most important patterns,
        * Removing noise or unimportant details.
   - If two documents talk about similar ideas, LSA places them near each other even if they don’t share exact words.

# Use Cases of Topic Modeling

1. <b><u>Customer Feedback Analysis</u></b>
    - Companies receive thousands of reviews, survey responses, and complaints.<br>
    Topic modeling helps group them into themes such as delivery issues, product quality, pricing concerns, etc., allowing businesses to understand customer pain points quickly.<br><br>

2. <b><u>Social Media Monitoring</u></b>
    - Brands can analyze tweets, posts, or comments to detect trending topics — for example, what users are saying about a product launch or a public event.<br><br>

3. <b><u>Research & Academic Paper Analysis</u></b>
    - Researchers can process thousands of scientific papers and group them into topics such as machine learning, climate change, genomics, etc., making literature review much easier.<br><br>

4. <b><u>News Categorization</u></b>
    - News organizations can automatically organize articles into topics like politics, sports, technology, or entertainment without manual labeling.<br><br>

5. <b><u>Improving Search Engines</u></b>
    - Search engines can identify the themes in documents, improving search relevance by showing results based on topics instead of just keywords.<br><br>

6. <b><u>Document Summarization</u></b>
    - Topic modeling can highlight the main themes of long reports, making it easier to summarize and understand the content.<br><br>

7. <b><u>Recommender Systems</u></b>
    - Platforms like YouTube, Netflix, or e-commerce stores can recommend items based on topics extracted from user behavior or descriptions of products.<br><br>

8. <b><u>Fraud Detection in Text</u></b>
    - Banks and insurance companies can detect unusual patterns or topics in large collections of written claims, reports, or emails.<br><br>

9. <b><u>Healthcare & Clinical Notes Analysis</u></b>
    - Hospitals can analyze doctor’s notes to identify recurring themes like symptoms, diagnoses, or medications to improve patient care.<br><br>

10. <b><u>Legal Document Organization</u></b>
    - Law firms can scan thousands of case files and automatically cluster them by topics like contracts, intellectual property, or criminal cases.<br><br>

In [6]:
import pandas as pd
import numpy as np

import re
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller
import contractions

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD
from gensim.models import LdaModel
from gensim import corpora

import warnings
warnings.filterwarnings('ignore')

  "cipher": algorithms.TripleDES,
  "class": algorithms.Blowfish,
  "class": algorithms.TripleDES,


In [7]:
docs = [
    "Machine learning algorithms use statistical methods to improve performance through experience",
    "Deep learning utilizes neural networks with multiple layers for complex pattern recognition",
    "Predictive analytics uses historical data to forecast future outcomes and trends",
    "Natural language processing enables computers to understand and process human languages",
    "Supervised learning trains models using labeled datasets with known outcomes",
    "Unsupervised learning discovers hidden patterns in data without predefined labels",
    "Regression analysis predicts continuous values based on input variables",
    "Classification algorithms categorize data into distinct classes or groups",
    "Clustering techniques group similar data points together based on features",
    "Data preprocessing involves cleaning and transforming raw data for analysis",
    "basketball players shoot hoops and score points in games",
    "football team throws passes and runs for touchdowns",
    "soccer players kick ball and score goals in matches",
    "pizza has cheese and tomato sauce with delicious crust",
    "burger with beef patty and lettuce tomato on bun",
    "tacos with meat and cheese and vegetables in shell"
]

for i,text in enumerate(docs):
    print(f'{i+1}. {text}')

1. Machine learning algorithms use statistical methods to improve performance through experience
2. Deep learning utilizes neural networks with multiple layers for complex pattern recognition
3. Predictive analytics uses historical data to forecast future outcomes and trends
4. Natural language processing enables computers to understand and process human languages
5. Supervised learning trains models using labeled datasets with known outcomes
6. Unsupervised learning discovers hidden patterns in data without predefined labels
7. Regression analysis predicts continuous values based on input variables
8. Classification algorithms categorize data into distinct classes or groups
9. Clustering techniques group similar data points together based on features
10. Data preprocessing involves cleaning and transforming raw data for analysis
11. basketball players shoot hoops and score points in games
12. football team throws passes and runs for touchdowns
13. soccer players kick ball and score go

In [8]:
def preprocess_text(corpus):
    corpus = re.sub(r'[^a-zA-Z\s]', '', corpus.lower())
    tokens = word_tokenize(corpus)
    sw = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in sw and len(word) > 2]
    stem = WordNetLemmatizer()
    tokens = [stem.lemmatize(word) for word in tokens]
    # tokens = [contractions.fix(word) for word in tokens] ## Optional
    # spellcheck = Speller(lang='en') ## Optional
    # tokens = [spellcheck(word) for word in tokens] ## Optional
    return ' '.join(tokens)

new_doc = [preprocess_text(doc) for doc in docs]
new_doc

['machine learning algorithm use statistical method improve performance experience',
 'deep learning utilizes neural network multiple layer complex pattern recognition',
 'predictive analytics us historical data forecast future outcome trend',
 'natural language processing enables computer understand process human language',
 'supervised learning train model using labeled datasets known outcome',
 'unsupervised learning discovers hidden pattern data without predefined label',
 'regression analysis predicts continuous value based input variable',
 'classification algorithm categorize data distinct class group',
 'clustering technique group similar data point together based feature',
 'data preprocessing involves cleaning transforming raw data analysis',
 'basketball player shoot hoop score point game',
 'football team throw pass run touchdown',
 'soccer player kick ball score goal match',
 'pizza cheese tomato sauce delicious crust',
 'burger beef patty lettuce tomato bun',
 'taco meat 

# Using LDA

## Sklearn

In [11]:
cv = CountVectorizer(max_features=20, stop_words='english')
doc_term_matrix = cv.fit_transform(new_doc)

feature_names = cv.get_feature_names_out()
feature_names

array(['algorithm', 'analysis', 'based', 'cheese', 'data', 'group',
       'language', 'learning', 'outcome', 'pattern', 'player', 'point',
       'predefined', 'predictive', 'predicts', 'preprocessing', 'process',
       'score', 'tomato', 'unsupervised'], dtype=object)

In [12]:
lda = LatentDirichletAllocation(
    n_components=3,          # Number of topics
    random_state=42,         # For reproducible results
    max_iter=15              # Number of iterations
)

lda.fit(doc_term_matrix)

In [13]:
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[-5:][::-1]  # Get top 5 words
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

Topic 1: data, group, algorithm, outcome, learning
Topic 2: learning, score, player, pattern, cheese
Topic 3: language, tomato, based, analysis, process


In [14]:
doc_topic_lda = lda.transform(doc_term_matrix)
df_lda = pd.DataFrame(
    doc_topic_lda.round(3),
    columns=[f'Topic {i+1}' for i in range(3)],
    index=[f'Doc {i+1}' for i in range(len(new_doc))]
)
df_lda.T

Unnamed: 0,Doc 1,Doc 2,Doc 3,Doc 4,Doc 5,Doc 6,Doc 7,Doc 8,Doc 9,Doc 10,Doc 11,Doc 12,Doc 13,Doc 14,Doc 15,Doc 16
Topic 1,0.762,0.119,0.832,0.084,0.762,0.085,0.089,0.832,0.857,0.86,0.087,0.333,0.111,0.112,0.167,0.167
Topic 2,0.126,0.769,0.084,0.084,0.126,0.859,0.084,0.084,0.071,0.068,0.829,0.333,0.777,0.43,0.167,0.665
Topic 3,0.112,0.112,0.084,0.833,0.112,0.056,0.827,0.084,0.072,0.072,0.084,0.333,0.112,0.459,0.666,0.168


## gensim

In [16]:
def preprocess_text(corpus):
    corpus = re.sub(r'[^a-zA-Z\s]', '', corpus.lower())
    tokens = word_tokenize(corpus)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    stem = WordNetLemmatizer()
    tokens = [stem.lemmatize(word) for word in tokens]
    # tokens = [contractions.fix(word) for word in tokens] ## Optional
    # spellcheck = Speller(lang='en') ## Optional
    # tokens = [spellcheck(word) for word in tokens] ## Optional
    return tokens

new_doc = [preprocess_text(doc) for doc in docs]
new_doc

[['machine',
  'learning',
  'algorithm',
  'use',
  'statistical',
  'method',
  'improve',
  'performance',
  'experience'],
 ['deep',
  'learning',
  'utilizes',
  'neural',
  'network',
  'multiple',
  'layer',
  'complex',
  'pattern',
  'recognition'],
 ['predictive',
  'analytics',
  'us',
  'historical',
  'data',
  'forecast',
  'future',
  'outcome',
  'trend'],
 ['natural',
  'language',
  'processing',
  'enables',
  'computer',
  'understand',
  'process',
  'human',
  'language'],
 ['supervised',
  'learning',
  'train',
  'model',
  'using',
  'labeled',
  'datasets',
  'known',
  'outcome'],
 ['unsupervised',
  'learning',
  'discovers',
  'hidden',
  'pattern',
  'data',
  'without',
  'predefined',
  'label'],
 ['regression',
  'analysis',
  'predicts',
  'continuous',
  'value',
  'based',
  'input',
  'variable'],
 ['classification',
  'algorithm',
  'categorize',
  'data',
  'distinct',
  'class',
  'group'],
 ['clustering',
  'technique',
  'group',
  'similar',
 

In [17]:
dictionary = corpora.Dictionary(new_doc)
corpus = [dictionary.doc2bow(doc) for doc in new_doc]

In [51]:
lda_gensim = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,
    random_state=42,
    alpha='auto',      # Let model learn topic distribution
    eta='auto',        # Let model learn word distribution
    passes=10,         # Number of passes through corpus
    per_word_topics=True
)

In [53]:
for idx, topic in lda_gensim.print_topics(-1, num_words=5):
    print(f"Topic {idx + 1}: {topic}")

Topic 1: 0.043*"hoop" + 0.043*"basketball" + 0.043*"game" + 0.043*"point" + 0.043*"score"
Topic 2: 0.024*"trend" + 0.024*"score" + 0.024*"player" + 0.024*"forecast" + 0.024*"us"
Topic 3: 0.083*"data" + 0.035*"learning" + 0.035*"group" + 0.019*"pattern" + 0.019*"point"
Topic 4: 0.045*"learning" + 0.045*"language" + 0.025*"statistical" + 0.025*"performance" + 0.025*"use"
Topic 5: 0.058*"tomato" + 0.058*"cheese" + 0.032*"delicious" + 0.032*"lettuce" + 0.032*"burger"


In [55]:
doc_topic_distributions = []
for i, doc_bow in enumerate(corpus):
    doc_topics = lda_gensim.get_document_topics(doc_bow, minimum_probability=0)
    topic_probs = [prob for _, prob in doc_topics]
    doc_topic_distributions.append(topic_probs)
    # print(f"Doc {i+1}: {[f'{p:.3f}' for p in topic_probs]}")

df_gensim = pd.DataFrame(
    doc_topic_distributions,
    columns=[f'Topic {i+1}' for i in range(5)],
    index=[f'Doc {i+1}' for i in range(len(new_doc))]
)
df_gensim.round(3)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
Doc 1,0.005,0.008,0.01,0.97,0.007
Doc 2,0.004,0.008,0.009,0.973,0.006
Doc 3,0.005,0.971,0.01,0.007,0.007
Doc 4,0.005,0.008,0.01,0.97,0.007
Doc 5,0.005,0.008,0.973,0.007,0.007
Doc 6,0.005,0.008,0.973,0.007,0.007
Doc 7,0.005,0.968,0.011,0.008,0.008
Doc 8,0.006,0.011,0.965,0.009,0.009
Doc 9,0.005,0.008,0.973,0.007,0.007
Doc 10,0.005,0.01,0.97,0.008,0.008


# Using NMF

In [58]:
processed_docs_text = [' '.join(tokens) for tokens in new_doc]
processed_docs_text

['machine learning algorithm use statistical method improve performance experience',
 'deep learning utilizes neural network multiple layer complex pattern recognition',
 'predictive analytics us historical data forecast future outcome trend',
 'natural language processing enables computer understand process human language',
 'supervised learning train model using labeled datasets known outcome',
 'unsupervised learning discovers hidden pattern data without predefined label',
 'regression analysis predicts continuous value based input variable',
 'classification algorithm categorize data distinct class group',
 'clustering technique group similar data point together based feature',
 'data preprocessing involves cleaning transforming raw data analysis',
 'basketball player shoot hoop score point game',
 'football team throw pass run touchdown',
 'soccer player kick ball score goal match',
 'pizza cheese tomato sauce delicious crust',
 'burger beef patty lettuce tomato bun',
 'taco meat 

In [60]:
tfidf_vectorizer = TfidfVectorizer(max_features=20, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_docs_text)
tfidf_features = tfidf_vectorizer.get_feature_names_out()

nmf = NMF(
    n_components=3,
    random_state=42,
    max_iter=200
)

nmf.fit(tfidf_matrix)

In [62]:
for topic_idx, topic in enumerate(nmf.components_):
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [tfidf_features[i] for i in top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

doc_topic_nmf = nmf.transform(tfidf_matrix)
df_nmf = pd.DataFrame(
    doc_topic_nmf.round(3),
    columns=[f'Topic {i+1}' for i in range(3)],
    index=[f'Doc {i+1}' for i in range(len(new_doc))]
)
df_nmf

Topic 1: learning, data, algorithm, pattern, outcome
Topic 2: cheese, tomato, based, analysis, predicts
Topic 3: score, player, point, based, group


Unnamed: 0,Topic 1,Topic 2,Topic 3
Doc 1,0.423,0.0,0.0
Doc 2,0.405,0.0,0.0
Doc 3,0.32,0.0,0.0
Doc 4,0.0,0.0,0.0
Doc 5,0.4,0.0,0.0
Doc 6,0.443,0.0,0.0
Doc 7,0.097,0.0,0.043
Doc 8,0.389,0.0,0.017
Doc 9,0.258,0.0,0.229
Doc 10,0.294,0.0,0.0


# Using LSA

In [65]:
lsa = TruncatedSVD(
    n_components=3,
    random_state=42
)

lsa.fit(tfidf_matrix)

In [67]:
for topic_idx, topic in enumerate(lsa.components_):
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [tfidf_features[i] for i in top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

doc_topic_lsa = lsa.transform(tfidf_matrix)
df_lsa = pd.DataFrame(
    doc_topic_lsa.round(3),
    columns=[f'Topic {i+1}' for i in range(3)],
    index=[f'Doc {i+1}' for i in range(len(new_doc))]
)

df_lsa

Topic 1: learning, data, algorithm, pattern, outcome
Topic 2: tomato, cheese, based, learning, predictive
Topic 3: score, player, point, data, group


Unnamed: 0,Topic 1,Topic 2,Topic 3
Doc 1,0.603,0.0,-0.253
Doc 2,0.568,-0.0,-0.385
Doc 3,0.473,0.0,0.028
Doc 4,0.0,-0.0,0.0
Doc 5,0.563,-0.0,-0.318
Doc 6,0.637,-0.0,-0.216
Doc 7,0.172,-0.0,0.226
Doc 8,0.598,-0.0,0.202
Doc 9,0.472,0.0,0.5
Doc 10,0.453,0.0,0.24


# Comparisons of Models

In [70]:
print("\nTOPIC COMPARISON ACROSS METHODS:")
print("\nGensim LDA Topics:")
for idx, topic in lda_gensim.print_topics(-1, num_words=5):
    print(f"  Topic {idx + 1}: {topic}")

print("\nNMF Topics:")
for topic_idx, topic in enumerate(nmf.components_):
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [tfidf_features[i] for i in top_words_idx]
    print(f"  Topic {topic_idx + 1}: {', '.join(top_words)}")

print("\nLSA Topics:")
for topic_idx, topic in enumerate(lsa.components_):
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [tfidf_features[i] for i in top_words_idx]
    print(f"  Topic {topic_idx + 1}: {', '.join(top_words)}")


TOPIC COMPARISON ACROSS METHODS:

Gensim LDA Topics:
  Topic 1: 0.043*"hoop" + 0.043*"basketball" + 0.043*"game" + 0.043*"point" + 0.043*"score"
  Topic 2: 0.024*"trend" + 0.024*"score" + 0.024*"player" + 0.024*"forecast" + 0.024*"us"
  Topic 3: 0.083*"data" + 0.035*"learning" + 0.035*"group" + 0.019*"pattern" + 0.019*"point"
  Topic 4: 0.045*"learning" + 0.045*"language" + 0.025*"statistical" + 0.025*"performance" + 0.025*"use"
  Topic 5: 0.058*"tomato" + 0.058*"cheese" + 0.032*"delicious" + 0.032*"lettuce" + 0.032*"burger"

NMF Topics:
  Topic 1: learning, data, algorithm, pattern, outcome
  Topic 2: cheese, tomato, based, analysis, predicts
  Topic 3: score, player, point, based, group

LSA Topics:
  Topic 1: learning, data, algorithm, pattern, outcome
  Topic 2: tomato, cheese, based, learning, predictive
  Topic 3: score, player, point, data, group


In [30]:
print("""
1. GENSIM LDA:
   - Probabilistic model using Bayesian inference
   - Shows topic probabilities for each document
   - Provides coherence scores for evaluation
   - Better for understanding uncertainty

2. SCIKIT-LEARN NMF:
   - Linear algebra approach with non-negativity constraints
   - Often produces clearer, more distinct topics
   - Works well with TF-IDF
   - Faster training than LDA

3. SCIKIT-LEARN LSA:
   - Uses Singular Value Decomposition (SVD)
   - Fastest method
   - Good for dimensionality reduction
   - Can produce negative values in components

PRACTICAL ADVICE:
- Use Gensim LDA when you need probabilistic interpretations
- Use NMF when you want clear, distinct topics quickly  
- Use LSA for fast dimensionality reduction
- Always preprocess your text data carefully
- Experiment with different numbers of topics
""")


1. GENSIM LDA:
   - Probabilistic model using Bayesian inference
   - Shows topic probabilities for each document
   - Provides coherence scores for evaluation
   - Better for understanding uncertainty

2. SCIKIT-LEARN NMF:
   - Linear algebra approach with non-negativity constraints
   - Often produces clearer, more distinct topics
   - Works well with TF-IDF
   - Faster training than LDA

3. SCIKIT-LEARN LSA:
   - Uses Singular Value Decomposition (SVD)
   - Fastest method
   - Good for dimensionality reduction
   - Can produce negative values in components

PRACTICAL ADVICE:
- Use Gensim LDA when you need probabilistic interpretations
- Use NMF when you want clear, distinct topics quickly  
- Use LSA for fast dimensionality reduction
- Always preprocess your text data carefully
- Experiment with different numbers of topics



https://medium.com/@piyushkashyap045/topic-modeling-with-latent-dirichlet-allocation-lda-d2ab3fcfba68<br>
https://medium.com/@mokabbirbhuiyan/topic-modeling-in-nlp-discovering-hidden-themes-in-text-data-ba7b866ee5fa<br>
https://medium.com/@ganeshchamp39/understanding-topic-modelling-in-nlp-a-detailed-guide-eccb6381ee2e