# What is Topic Modeling?

Topic modeling is a powerful unsupervised Machine Learning technique that allows us to analyze large volumes of text data by automatically discovering latent themes or topics within a collection of documents. It’s a way to extract hidden semantic structures from text data that might be otherwise difficult or time-consuming to identify manually.

## Why is Topic Modeling Important?

- **Handling Unstructured Data**: A major challenge in working with text data is that it’s often unlabeled and unstructured, making it unsuitable for traditional supervised learning approaches. Topic modeling provides a solution to this problem.
- **Scalability**: It allows for the analysis of massive text collections that would be impractical to process manually.
- **Insight Discovery**: Topic modeling can reveal unexpected patterns and relationships in the data, leading to new insights and research directions.
- **Document Organization**: It provides a way to automatically organize and summarize large document collections.
- **Content Recommendation**: Topic models can be used to build recommendation systems based on content similarity.

## Common Topic Modeling Algorithms

- **Latent Dirichlet Allocation (LDA)**: The most popular topic modeling algorithm
- **Non-Negative Matrix Factorization (NMF)**: Often used for short texts
- **Latent Semantic Analysis (LSA)**: An older method based on singular value decomposition

## Text Analysis and Preprocessing with spaCy

Before we dive into topic modeling, it’s crucial to properly prepare our text data. In this section, we’ll use spaCy, an industrial-strength Natural Language Processing library, to clean and preprocess our text data.


In [1]:
# Importing Libraries
import os

import spacy 
from spacy import displacy

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel, LsiModel, HdpModel

In [2]:
# Gathering Data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
print(test_data_dir)

lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
print(lee_train_file)
text = open(lee_train_file).read()

c:\Users\itexp\anaconda3\Lib\site-packages\gensim\test\test_data
c:\Users\itexp\anaconda3\Lib\site-packages\gensim\test\test_data\lee_background.cor


In [4]:
# Textual Data Cleaning
nlp = spacy.load('en_core_web_sm')

In [5]:
# Stop words for newspaper corpus
my_stop_words = ['say', '\s', 'mr', 'Mr', 'said', 'says', 'saying', 'today', 'be']

for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True
    
doc = nlp(text)

  my_stop_words = ['say', '\s', 'mr', 'Mr', 'said', 'says', 'saying', 'today', 'be']


In [6]:
# Computational Linguistics

# Sample sentence
sent = nlp('Last Thursday, Manchester United defeated AC Milan at San Siro.')

### POS-Tagging — (Part Of Speech)

POS tagging identifies the grammatical parts of speech for each word in a sentence. This is crucial for understanding the role each word plays in the sentence structure.


In [7]:
# POS (Part-of-Speech) Tagging

for token in sent:
    print(token.text, token.pos_, token.tag_)

Last ADJ JJ
Thursday PROPN NNP
, PUNCT ,
Manchester PROPN NNP
United PROPN NNP
defeated VERB VBD
AC PROPN NNP
Milan PROPN NNP
at ADP IN
San PROPN NNP
Siro PROPN NNP
. PUNCT .


### NER-Tagging — (Named Entity Recognition)

NER identifies and classifies named entities (like persons, organizations, locations) in text. This is valuable for information extraction and can provide context for topic modeling.


In [8]:
# NER (Named Entity Recognition) Tagging

for token in sent:
    print(token.text, token.ent_type_)

Last DATE
Thursday DATE
, 
Manchester ORG
United ORG
defeated 
AC ORG
Milan ORG
at 
San GPE
Siro GPE
. 


In [9]:
for ent in sent.ents:
    print(ent.text, ent.label_)

Last Thursday DATE
Manchester United ORG
AC Milan ORG
San Siro GPE


In [10]:
displacy.render(sent, style='ent', jupyter=True)

NER can be particularly useful when you want to focus on specific types of entities in your topic modeling or when you want to exclude certain entity types.


### Dependency Parsing

Dependency parsing analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words that modify those heads.


In [11]:
# Dependency parsing

for chunk in sent.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Manchester United United nsubj defeated
AC Milan Milan dobj defeated
San Siro Siro pobj at


In [12]:
for token in sent:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
         [child for child in token.children])

Last amod Thursday PROPN []
Thursday npadvmod defeated VERB [Last]
, punct defeated VERB []
Manchester compound United PROPN []
United nsubj defeated VERB [Manchester]
defeated ROOT defeated VERB [Thursday, ,, United, Milan, at, .]
AC compound Milan PROPN []
Milan dobj defeated VERB [AC]
at prep defeated VERB [Siro]
San compound Siro PROPN []
Siro pobj at ADP [San]
. punct defeated VERB []


In [13]:
displacy.render(sent, style='dep', jupyter=True, options={'distance':90})

Dependency parsing can be useful for more advanced text analysis tasks, such as relation extraction or sentiment analysis that takes sentence structure into account.

These linguistic features provided by spaCy offer a deeper understanding of text structure and meaning. While we won’t directly use all of these in our topic modeling process, understanding these concepts can help in interpreting results and developing more sophisticated NLP pipelines in the future.


In [14]:
# Data cleaning

texts, article = [], []

for word in doc:
    
    if word.text != '\n' and not word.is_stop and not word.is_punct\
                         and not word.like_num and word.text != 'I':
        article.append(word.lemma_)
        
    if word.text == '\n':
        texts.append(article)
        article = []
        
print(texts[0])

['hundred', 'people', 'force', 'vacate', 'home', 'Southern', 'Highlands', 'New', 'South', 'Wales', 'strong', 'wind', 'push', 'huge', 'bushfire', 'town', 'Hill', 'new', 'blaze', 'near', 'Goulburn', 'south', 'west', 'Sydney', 'force', 'closure', 'Hume', 'Highway', '4:00pm', 'AEDT', 'mark', 'deterioration', 'weather', 'storm', 'cell', 'move', 'east', 'Blue', 'Mountains', 'force', 'authority', 'decision', 'evacuate', 'people', 'home', 'outlying', 'street', 'Hill', 'New', 'South', 'Wales', 'southern', 'highland', 'estimated', 'resident', 'leave', 'home', 'nearby', 'Mittagong', 'New', 'South', 'Wales', 'Rural', 'Fire', 'Service', 'weather', 'condition', 'cause', 'fire', 'burn', 'finger', 'formation', 'ease', 'fire', 'unit', 'Hill', 'optimistic', 'defend', 'property', 'blaze', 'burn', 'New', 'Year', 'Eve', 'New', 'South', 'Wales', 'fire', 'crew', 'call', 'new', 'fire', 'Gunning', 'south', 'Goulburn', 'detail', 'available', 'stage', 'fire', 'authority', 'close', 'Hume', 'Highway', 'direction',

Sometimes, treating certain word pairs as single tokens can improve the interpretability of our topics. For example, “New York” is more meaningful as a single entity than “New” and “York” separately. We can use Gensim’s Phrases model to detect common bigrams:


In [16]:
# Creating Bigrams

bigram = gensim.models.phrases.Phrases(texts)
texts = [bigram[line] for line in texts]
print(texts[0])

['hundred', 'people', 'force', 'vacate', 'home', 'Southern', 'Highlands', 'New_South_Wales', 'strong', 'wind', 'push', 'huge', 'bushfire', 'town', 'Hill', 'new', 'blaze', 'near', 'Goulburn', 'south_west', 'Sydney', 'force', 'closure', 'Hume', 'Highway', '4:00pm', 'AEDT', 'mark', 'deterioration', 'weather', 'storm', 'cell', 'move', 'east', 'Blue_Mountains', 'force', 'authority', 'decision', 'evacuate', 'people', 'home', 'outlying', 'street', 'Hill', 'New_South_Wales', 'southern', 'highland', 'estimated', 'resident', 'leave', 'home', 'nearby', 'Mittagong', 'New_South_Wales', 'Rural_Fire_Service', 'weather_condition', 'cause', 'fire_burn', 'finger', 'formation', 'ease', 'fire', 'unit', 'Hill', 'optimistic', 'defend', 'property', 'blaze', 'burn', 'New', 'Year', 'Eve', 'New_South_Wales', 'fire', 'crew', 'call', 'new', 'fire', 'Gunning', 'south', 'Goulburn', 'detail', 'available', 'stage', 'fire', 'authority', 'close', 'Hume', 'Highway', 'direction', 'new', 'fire', 'Sydney', 'west', 'long', 

Notice how “New South” has been combined into “New_South”.

### Creating the Dictionary and Corpus

Gensim requires two main components for topic modeling:

- A Dictionary: mapping between words and their integer ids
- A Corpus: a list of documents, where each document is represented as a bag-of-words


In [17]:
# Creating dictionary and corpus

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus[1])

[(69, 1), (81, 1), (89, 1), (91, 1), (92, 1), (106, 1), (107, 1), (108, 1), (109, 4), (110, 1), (111, 1), (112, 1), (113, 1), (114, 2), (115, 1), (116, 1), (117, 3), (118, 1), (119, 1), (120, 1), (121, 2), (122, 3), (123, 1), (124, 2), (125, 2), (126, 1), (127, 1), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 3), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 3), (149, 3), (150, 1), (151, 1), (152, 2), (153, 1), (154, 1), (155, 2), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1), (166, 1), (167, 1), (168, 2), (169, 1), (170, 1), (171, 1), (172, 1), (173, 1), (174, 1)]


Each tuple in this output represents (word_id, word_count). For example, (71, 1) means the word with id 71 appears once in this document.

### Why This Preparation Matters

This data preparation is vital for several reasons:

- **Efficiency**: Representing words as integers is much more memory-efficient than using strings.
- **Compatibility**: Many text analysis algorithms, including Gensim’s topic models, expect input in this format.
- **Information Retention**: Despite the seemingly cryptic format, all necessary information for topic modeling is preserved.
- **Noise Reduction**: By removing stopwords, lemmatizing, and creating bigrams, we’ve reduced noise and increased the semantic value of our tokens.

## Topic Modeling

Now that we have prepared our corpus, we can apply various topic modeling techniques. We’ll explore three popular methods: Latent Semantic Indexing (LSI), Hierarchical Dirichlet Process (HDP), and Latent Dirichlet Allocation (LDA).

### LSI — Latent Semantic Indexing

LSI, also known as Latent Semantic Analysis (LSA), is a technique that uncovers latent topics by decomposing the term-document matrix using Singular Value Decomposition (SVD).


In [18]:
# LSI Model

lsi_model = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsi_model.show_topics(num_topics=5)

[(0,
  '-0.237*"israeli" + -0.213*"Arafat" + -0.180*"force" + -0.174*"palestinian" + -0.160*"kill" + -0.153*"attack" + -0.146*"people" + -0.137*"official" + -0.125*"day" + -0.117*"Afghanistan"'),
 (1,
  '0.331*"israeli" + 0.323*"Arafat" + 0.258*"palestinian" + 0.173*"Sharon" + -0.168*"Australia" + 0.164*"Israel" + -0.160*"Afghanistan" + -0.118*"day" + 0.114*"Hamas" + -0.113*"year"'),
 (2,
  '-0.285*"Afghanistan" + -0.240*"force" + 0.176*"fire" + -0.172*"bin_Laden" + -0.167*"Pakistan" + 0.142*"win" + 0.137*"Sydney" + 0.127*"Australia" + -0.127*"Taliban" + -0.114*"afghan"'),
 (3,
  '0.397*"fire" + 0.282*"area" + 0.213*"Sydney" + -0.205*"Australia" + 0.182*"firefighter" + 0.167*"north" + 0.157*"wind" + 0.137*"New_South_Wales" + 0.136*"south" + 0.116*"storm"'),
 (4,
  '0.281*"company" + 0.167*"union" + 0.152*"Qantas" + -0.149*"test" + -0.134*"match" + -0.134*"wicket" + -0.126*"win" + -0.121*"day" + 0.120*"cent" + 0.120*"Australian"')]

### HDP — Hierarchical Dirichlet Process

HDP is a nonparametric Bayesian approach to topic modeling. Unlike LSI and LDA, HDP can automatically determine the number of topics.


In [19]:
# HDP Model

hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()[:5]

[(0,
  '0.004*israeli + 0.003*airport + 0.003*Arafat + 0.003*Taliban + 0.003*Sharon + 0.002*military + 0.002*kill + 0.002*night + 0.002*force + 0.002*official + 0.002*target + 0.002*palestinian + 0.002*civilian + 0.002*Kandahar + 0.002*early + 0.002*hit + 0.001*wound + 0.001*choose + 0.001*warplane + 0.001*Lali'),
 (1,
  '0.003*match + 0.003*Krishna + 0.002*ashe + 0.002*Hare + 0.002*team + 0.002*member + 0.002*day + 0.002*Rafter + 0.002*Harrison + 0.002*Australia + 0.002*ask + 0.002*play + 0.002*israeli + 0.002*devotee + 0.002*tennis + 0.002*France + 0.002*hour + 0.002*river + 0.002*Ganges + 0.001*benare'),
 (2,
  '0.006*company + 0.003*director + 0.003*Friedli + 0.002*staff + 0.002*reply + 0.002*entitlement + 0.002*Austar + 0.002*receive + 0.002*know + 0.002*holy + 0.002*review + 0.002*end + 0.002*administrator + 0.002*Foley + 0.002*Australians + 0.002*job + 0.001*responsibility + 0.001*royal + 0.001*payment + 0.001*trip'),
 (3,
  '0.003*India + 0.002*Adventure_World + 0.002*guide + 0

### LDA — Latent Dirichlet Allocation

LDA is perhaps the most popular topic modeling technique. It views documents as mixtures of topics and topics as mixtures of words.


In [20]:
# LDA Model

lda_model = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
lda_model.show_topics()

[(0,
  '0.005*"israeli" + 0.004*"force" + 0.004*"people" + 0.003*"fire" + 0.003*"go" + 0.003*"Australia" + 0.003*"area" + 0.003*"give" + 0.003*"time" + 0.003*"report"'),
 (1,
  '0.005*"day" + 0.004*"area" + 0.004*"new" + 0.003*"official" + 0.003*"attack" + 0.003*"wind" + 0.003*"fire" + 0.003*"firefighter" + 0.003*"Sydney" + 0.003*"year"'),
 (2,
  '0.006*"year" + 0.004*"Arafat" + 0.004*"world" + 0.004*"economy" + 0.003*"tell" + 0.003*"Australia" + 0.003*"end" + 0.003*"good" + 0.003*"set" + 0.003*"israeli"'),
 (3,
  '0.009*"Australia" + 0.007*"people" + 0.005*"man" + 0.004*"day" + 0.004*"force" + 0.004*"think" + 0.003*"Afghanistan" + 0.003*"United_States" + 0.003*"attack" + 0.003*"report"'),
 (4,
  '0.005*"group" + 0.005*"israeli" + 0.005*"year" + 0.004*"come" + 0.004*"attack" + 0.004*"arrest" + 0.003*"palestinian" + 0.003*"know" + 0.003*"believe" + 0.003*"call"'),
 (5,
  '0.006*"report" + 0.005*"company" + 0.004*"people" + 0.004*"year" + 0.003*"good" + 0.003*"fire" + 0.003*"HIH" + 0.003

### Interpreting the Results

Looking at the outputs, we can see some clear themes emerging:

- **Afghanistan war**: “Afghanistan”, “Taliban”, “Al_Qaeda”, and “bin_Laden” are prominent.
- **Australian news**: “Australia”, “Sydney”, “New_South Wales” appear in multiple topics.
- **Business news**: Words like “company”, “union”, and “worker” suggest business-related topics.

Each model offers a different perspective on the underlying topics in our corpus. LSI provides a mathematical decomposition, HDP offers a flexible number of topics, and LDA gives a probabilistic distribution of topics.


## Visualizing Topics with pyLDAvis

After creating our topic models, it’s crucial to interpret and communicate the results effectively. pyLDAvis, a Python port of the R LDAvis package, offers an excellent interactive visualization tool for this purpose.


In [22]:
# Visualizing Topic Modeling Results

import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

### Interpreting the Visualization

The pyLDAvis visualization consists of two main parts:

**Left panel**: A global view of the topic model

- Circles represent topics
- Size of circles represents the prevalence of topics
- Distance between circles represents the similarity between topics

**Right panel**: A detailed view of selected topic

- Bar chart shows the most relevant terms for the selected topic
- You can adjust the λ (lambda) value to change the relevance metric

### Key aspects to explore:

- **Topic prevalence**: Look at the size of the circles to understand which topics are most common in your corpus.
- **Topic similarity**: Closely positioned circles indicate similar topics.
- **Term relevance**: Examine the most relevant terms for each topic to understand its theme.
- **Intertopic distances**: The 2D projection helps visualize how topics relate to each other.Links:

By using pyLDAvis, we can gain deeper insights into our topic model, understand the relationships between topics, and effectively communicate the results to others. This visualization complements the numerical outputs we saw earlier, providing a more intuitive understanding of the topics present in our corpus.


### Key takeaways from this journey include:

- The importance of thorough text preprocessing using spaCy’s advanced NLP capabilities.
- The versatility of Gensim in implementing various topic modeling algorithms like LSI, HDP, and LDA.
- The interpretability gains from visualizing topic models with pyLDAvis.


## Interpreting results from topic modeling using LLMs

The topics returned are just words, but what do they mean? Let’s use a language model to interpret the topics.


In [52]:
# Get the top words for a specific topic
for topic_id in range(10):
  top_words = lda_model.show_topic(topic_id, topn=10)  # Get the top 10 words
  words = [word for word, _ in top_words]
  print(words)

['israeli', 'force', 'people', 'fire', 'go', 'Australia', 'area', 'give', 'time', 'report']
['day', 'area', 'new', 'official', 'attack', 'wind', 'fire', 'firefighter', 'Sydney', 'year']
['year', 'Arafat', 'world', 'economy', 'tell', 'Australia', 'end', 'good', 'set', 'israeli']
['Australia', 'people', 'man', 'day', 'force', 'think', 'Afghanistan', 'United_States', 'attack', 'report']
['group', 'israeli', 'year', 'come', 'attack', 'arrest', 'palestinian', 'know', 'believe', 'call']
['report', 'company', 'people', 'year', 'good', 'fire', 'HIH', 'boat', 'early', 'continue']
['tell', 'think', 'force', 'Taliban', 'people', 'year', 'day', 'metre', 'child', 'kill']
['year', 'israeli', 'people', 'attack', 'day', 'month', 'new', 'force', 'launch', 'meeting']
['Australia', 'Government', 'kill', 'official', 'australian', 'day', 'israeli', 'attack', 'force', 'people']
['Afghanistan', 'day', 'force', 'Pakistan', 'Australia', 'union', 'claim', 'company', 'take', 'year']


In [78]:
print("Document Id: 0")
print("Topics in Document 0:")
doc_id = 0
doc_topics = lda_model.get_document_topics(corpus[doc_id])
print(doc_topics)

# convert the corpus from word id to word
vocab = {k: v for k, v in dictionary.items()}

doc_words = corpus[doc_id]
doc_words = [vocab[id] for id, _ in doc_words]

# get top 10 words for each doc_topics
def get_top_words(topic_id, words_length=10):
  return [word for word, _ in lda_model.show_topic(topic_id, topn=words_length)]

print("\nTop 10 words for each topic:")
for topic_id, topic_words in doc_topics:
  print(f"Topic {topic_id}: {get_top_words(topic_id)}")

print("\nWords in Document 0:")
print(doc_words)

Document Id: 0
Topics in Document 0:
[(1, 0.27983212), (4, 0.7144448)]

Top 10 words for each topic:
Topic 1: ['day', 'area', 'new', 'official', 'attack', 'wind', 'fire', 'firefighter', 'Sydney', 'year']
Topic 4: ['group', 'israeli', 'year', 'come', 'attack', 'arrest', 'palestinian', 'know', 'believe', 'call']

Words in Document 0:
['4:00pm', 'AEDT', 'Blue_Mountains', 'Bureau', 'Claire', 'Cranebrook', 'Eve', 'Goulburn', 'Gunning', 'Highlands', 'Highway', 'Hill', 'Hume', 'Hunter', 'Illawarra', 'Meteorology', 'Mittagong', 'New', 'New_South_Wales', 'Richards', 'Rural_Fire_Service', 'Southern', 'Sydney', 'Valley', 'Year', 'area', 'associate', 'authority', 'available', 'blaze', 'burn', 'bushfire', 'call', 'cause', 'cell', 'close', 'closure', 'coast', 'concern', 'crew', 'decision', 'defend', 'detail', 'deterioration', 'direction', 'ease', 'east', 'effort', 'estimated', 'evacuate', 'fact', 'fall', 'far', 'finger', 'fire', 'fire_burn', 'firefighter', 'force', 'formation', 'generally', 'gust', 

## Using openai gpt-4o mini model to interpret topic modeling results

The promompt used to generate the results is:

```
Document Id: 0
Topics in Document 0:
[(1, 0.27983212), (4, 0.7144448)]

Top 10 words for each topic:
Topic 1: ['day', 'area', 'new', 'official', 'attack', 'wind', 'fire', 'firefighter', 'Sydney', 'year']
Topic 4: ['group', 'israeli', 'year', 'come', 'attack', 'arrest', 'palestinian', 'know', 'believe', 'call']

Words in Document 0:
['4:00pm', 'AEDT', 'Blue_Mountains', 'Bureau', 'Claire', 'Cranebrook', 'Eve', 'Goulburn', 'Gunning', 'Highlands', 'Highway', 'Hill', 'Hume', 'Hunter', 'Illawarra', 'Meteorology', 'Mittagong', 'New', 'New_South_Wales', 'Richards', 'Rural_Fire_Service', 'Southern', 'Sydney', 'Valley', 'Year', 'area', 'associate', 'authority', 'available', 'blaze', 'burn', 'bushfire', 'call', 'cause', 'cell', 'close', 'closure', 'coast', 'concern', 'crew', 'decision', 'defend', 'detail', 'deterioration', 'direction', 'ease', 'east', 'effort', 'estimated', 'evacuate', 'fact', 'fall', 'far', 'finger', 'fire', 'fire_burn', 'firefighter', 'force', 'formation', 'generally', 'gust', 'hamper', 'highland', 'home', 'huge', 'hundred', 'isolated', 'leave', 'little', 'long', 'mark', 'millimetre', 'move', 'near', 'nearby', 'new', 'north', 'optimistic', 'outlying', 'part', 'people', 'place', 'probably', 'property', 'push', 'rain', 'relief', 'resident', 'significant', 'south', 'south_west', 'southern', 'stage', 'state', 'storm', 'street', 'strong', 'threaten', 'thunderstorm', 'town', 'unit', 'vacate', 'weather', 'weather_condition', 'west', 'wind']

generate only detailed topic titles for each topic
```

The results are:

- **Fire Danger and Emergency Response in New South Wales**: Weather Conditions, Evacuation Measures, and Firefighting Efforts
- **Israeli-Palestinian Conflict**: Arrests, Attacks, and Political Dynamics in the Current Year
