# Topic Modeling on News Snippets Using LDA
**Author:** Virginia Herrero

## Import Libraries and Download Resources

Import essential libraries for text preprocessing, topic modeling, and download required NLTK resources.

In [12]:
# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from gensim.utils import simple_preprocess

# Download NLTK resources
nltk.download("stopwords")
nltk.download("wordnet")

# Topic Modeling
import gensim
import gensim.corpora as corpora
from gensim.models import LdaMulticore, CoherenceModel

# Utilities
from pprint import pprint

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Define the Corpus

In natural language processing, a corpus is a collection of written or spoken texts that serves as the dataset for language-related tasks. The corpus is analyzed to identify language patterns and typically requires preprocessing and transformation into a format suitable for machine learning models.

In [13]:
corpus = [
    "The stock market closed higher today as tech shares rallied amid strong earnings reports.",
    "A major earthquake struck the coastal city early this morning, causing widespread damage.",
    "The government announced new policies aimed at reducing carbon emissions by 2030.",
    "Scientists discovered a new species of dinosaur in the remote mountains of Argentina.",
    "The local football team won the championship after a thrilling final match.",
    "Health officials urge citizens to get vaccinated as flu season approaches.",
    "A breakthrough in renewable energy technology promises cheaper solar panels.",
    "International leaders met to discuss trade agreements and economic cooperation.",
    "A popular film festival opened this weekend, showcasing independent movies from around the world.",
    "The city council approved plans for a new public park to promote green spaces."
]

## Text processing

After defining the corpus, the next step is text preprocessing. This step involves cleaning and preparing the raw text data to make it suitable for modeling. 

In [14]:
# Set stopwords
stop_w = set(stopwords.words("english"))

In [15]:
# Tokenize the corpus
def doc_to_tokens(texts):
    """
    Tokenize a list of documents into clean lowercase words.

    Parameters:
    ----------
    texts (list of str): List of raw text documents.

    Yields:
    ----------
    list of str: Tokenized and lowercased words from each document,
                 with punctuation removed.
    """
    for doc in texts:
        yield simple_preprocess(doc, deacc=True)

tokens = list(doc_to_tokens(corpus))

In [16]:
# Remove stopwords
def rm_stopwords(docs):
    """
    Remove English stopwords from tokenized documents.

    Parameters:
    ----------
    docs (list of list of str): Tokenized documents (list of words).

    Returns:
    ----------
    list of list of str: Tokenized documents with stopwords removed.
    """
    return [[word for word in doc if word not in stop_w] for doc in docs]

tokens = rm_stopwords(tokens)

In [17]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Lemmatize and stem tokens
docs = []
for doc in tokens:
    word_list = []
    for token in doc:
        lemm = lemmatizer.lemmatize(token)
        stem = stemmer.stem(lemm)
        word_list.append(stem)
    docs.append(word_list)

print("Sample processed documents:")
print(docs[:2])

Sample processed documents:
[['stock', 'market', 'close', 'higher', 'today', 'tech', 'share', 'ralli', 'amid', 'strong', 'earn', 'report'], ['major', 'earthquak', 'struck', 'coastal', 'citi', 'earli', 'morn', 'caus', 'widespread', 'damag']]


## Create Dictionary

A dictionary in natural language processing is a mapping between unique words (tokens) in the corpus and their integer IDs. It serves as a vocabulary reference that converts text data into numerical formats required by machine learning models. In topic modeling, the dictionary helps translate words into a consistent numeric representation used to build the corpus and train models like LDA.

In [18]:
# Create a dictionary representation of the documents
word_dict = corpora.Dictionary(docs)

# Print the first 10 token-id
print("Sample dictionary token-id pairs:")
print(list(word_dict.items())[:10])

Sample dictionary token-id pairs:
[(0, 'amid'), (1, 'close'), (2, 'earn'), (3, 'higher'), (4, 'market'), (5, 'ralli'), (6, 'report'), (7, 'share'), (8, 'stock'), (9, 'strong')]


## Create Bag-of-Words

A bag of words (BoW) is a simple and commonly used method for representing text data in natural language processing. It treats a document as a "bag" of individual words, ignoring grammar and word order, but keeping track of how many times each word appears. Each document is converted into a vector of word counts based on a predefined vocabulary. In conclusion, a bag of words is a numerical representation of text that captures word frequency, used to feed text data into machine learning models.


In [19]:
# Create the bag-of-words corpus
bow_corpus = [word_dict.doc2bow(doc) for doc in docs]

# Print the bag-of-words for the first document
print("Sample bag-of-words representation for first document:")
print(bow_corpus[0])

Sample bag-of-words representation for first document:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]


## LDA Model

Topic modeling is the process of uncovering hidden thematic structures in a collection of documents. LDA (Latent Dirichlet Allocation) is one of the most commonly used algorithms for this task. It identifies groups of words that frequently occur together and uses them to define topics, allowing each document to be represented as a mixture of these topics.

In [20]:
# Train the LDA model
lda_model = LdaMulticore(
    corpus = bow_corpus,        # The BoW representation of the documents
    id2word = word_dict,        # The dictionary mapping of word IDs
    num_topics = 3,             # Number of topics to extract
    random_state = 42,          # For reproducibility
    passes = 10,                # Number of passes through the corpus during training
)

# Display the discovered topics
from pprint import pprint
pprint(lda_model.print_topics(num_words = 10))

[(0,
  '0.038*"new" + 0.022*"argentina" + 0.022*"mountain" + 0.022*"scientist" + '
  '0.022*"remot" + 0.022*"discov" + 0.022*"dinosaur" + 0.022*"speci" + '
  '0.021*"reduc" + 0.021*"emiss"'),
 (1,
  '0.034*"citi" + 0.020*"space" + 0.020*"public" + 0.020*"promot" + '
  '0.020*"approv" + 0.020*"plan" + 0.020*"green" + 0.020*"council" + '
  '0.020*"park" + 0.020*"strong"'),
 (2,
  '0.028*"film" + 0.028*"weekend" + 0.028*"showcas" + 0.028*"open" + '
  '0.028*"popular" + 0.028*"world" + 0.028*"independ" + 0.028*"around" + '
  '0.028*"movi" + 0.028*"festiv"')]


## Evaluate the Model

Model evaluation measures how well a topic model produces coherent and meaningful topics. This process ensures the model’s results are reliable and interpretable. 

In [21]:
# Build coherence model
coherence_model_lda = CoherenceModel(
    model = lda_model, 
    texts = docs,          
    dictionary = word_dict,
    coherence = "c_v"      
)

# Compute coherence score
coherence_score = coherence_model_lda.get_coherence()

print(f"Coherence Score: {coherence_score:.4f}")

Coherence Score: 0.2541


The model’s coherence score is quite low, so the next step is to train multiple LDA models using different numbers of topics and calculate their coherence scores. By comparing these scores, it finds the optimal number of topics that improves the model’s coherence and overall quality.

In [22]:
# Train multiple LDA models to find the best
def compute_coherence_values(dictionary, corpus, texts, start = 2, limit = 10, step = 1):
    """
    Train LDA models with varying number of topics and compute their coherence scores.

    Parameters:
    ----------
    - dictionary (gensim.corpora.Dictionary): Mapping of word IDs to words.
    - corpus (list of list of (int, int)): Bag-of-words representation of documents.
    - texts (list of list of str): Preprocessed tokenized documents.
    - start (int): Minimum number of topics to try (inclusive).
    - limit (int): Maximum number of topics to try (inclusive).
    - step (int): Step size between topic numbers.

    Returns:
    ----------
    - model_list (list): List of trained LDA models.
    - coherence_values (list): List of coherence scores corresponding to each model.
    """
    coherence_values = []
    model_list = []
    
    for num_topics in range(start, limit + 1, step):
        print(f"Training LDA with {num_topics} topics...")
        model = LdaMulticore(
            corpus = corpus,
            id2word = dictionary,
            num_topics = num_topics,
            random_state = 42,
            passes = 10,
        )
        model_list.append(model)
        
        coherence_model = CoherenceModel(
            model = model,
            texts = texts,
            dictionary = dictionary,
            coherence = "c_v"
        )
        coherence_score = coherence_model.get_coherence()
        coherence_values.append(coherence_score)
        
        print(f"Coherence Score for {num_topics} topics: {coherence_score:.4f}\n")
    
    return model_list, coherence_values

In [23]:
# Run coherence evaluation for topics 2 to 10
model_list, coherence_values = compute_coherence_values(word_dict, bow_corpus, docs, start = 2, limit = 10, step = 1)

# Print summary
for num, score in zip(range(2, 11), coherence_values):
    print(f"Num Topics = {num} => Coherence Score = {score:.4f}")

Training LDA with 2 topics...
Coherence Score for 2 topics: 0.2105

Training LDA with 3 topics...
Coherence Score for 3 topics: 0.2541

Training LDA with 4 topics...
Coherence Score for 4 topics: 0.2814

Training LDA with 5 topics...
Coherence Score for 5 topics: 0.2298

Training LDA with 6 topics...
Coherence Score for 6 topics: 0.3579

Training LDA with 7 topics...
Coherence Score for 7 topics: 0.3620

Training LDA with 8 topics...
Coherence Score for 8 topics: 0.4415

Training LDA with 9 topics...
Coherence Score for 9 topics: 0.4346

Training LDA with 10 topics...
Coherence Score for 10 topics: 0.4009

Num Topics = 2 => Coherence Score = 0.2105
Num Topics = 3 => Coherence Score = 0.2541
Num Topics = 4 => Coherence Score = 0.2814
Num Topics = 5 => Coherence Score = 0.2298
Num Topics = 6 => Coherence Score = 0.3579
Num Topics = 7 => Coherence Score = 0.3620
Num Topics = 8 => Coherence Score = 0.4415
Num Topics = 9 => Coherence Score = 0.4346
Num Topics = 10 => Coherence Score = 0.400

In [24]:
# The best model
# Find the index of the best coherence score
best_model_index = coherence_values.index(max(coherence_values))

# Select the best model
best_model = model_list[best_model_index]

# Print summary of the best model
print(f"\nBest LDA Model has {best_model.num_topics} topics with coherence score {coherence_values[best_model_index]:.4f}\n")

# Pretty-print the topics of the best model
pprint(best_model.print_topics(num_words = 20))


Best LDA Model has 8 topics with coherence score 0.4415

[(0,
  '0.011*"new" + 0.011*"local" + 0.011*"remot" + 0.011*"renew" + 0.011*"match" '
  '+ 0.011*"scientist" + 0.011*"citi" + 0.011*"team" + 0.011*"govern" + '
  '0.011*"breakthrough" + 0.011*"polici" + 0.011*"discu" + 0.011*"citizen" + '
  '0.011*"approach" + 0.011*"reduc" + 0.011*"emiss" + 0.011*"final" + '
  '0.011*"carbon" + 0.011*"argentina" + 0.011*"promis"'),
 (1,
  '0.011*"new" + 0.011*"team" + 0.011*"citi" + 0.011*"final" + 0.011*"local" + '
  '0.011*"championship" + 0.011*"thrill" + 0.011*"panel" + 0.011*"renew" + '
  '0.011*"season" + 0.011*"scientist" + 0.011*"match" + 0.011*"govern" + '
  '0.011*"remot" + 0.011*"approach" + 0.011*"footbal" + 0.011*"emiss" + '
  '0.011*"promis" + 0.011*"polici" + 0.011*"breakthrough"'),
 (2,
  '0.054*"weekend" + 0.054*"open" + 0.054*"independ" + 0.054*"film" + '
  '0.054*"around" + 0.054*"showcas" + 0.054*"popular" + 0.054*"movi" + '
  '0.054*"world" + 0.054*"festiv" + 0.006*"new" + 

The model was tested with different numbers of topics, and the quality of the topics improved as the number increased, peaking at 8 topics with the best coherence score of 0.4415. This means the model found the most meaningful and distinct themes when using 8 topics. The topics include groups of related words representing different themes like environment, finance, and health. While the score shows the model captures some clear patterns, there is still room to improve the results with further tuning or preprocessing.

## Extract Document Topic Distributions

The topic distribution shows how much each topic contributes to a given document. After training the LDA model, each document is represented as a mixture of topics with associated probabilities. This helps identify the dominant themes in each document and understand how content is distributed across topics.

In [None]:
# Get topic distributions for all documents in the BoW corpus
doc_topics = [best_model.get_document_topics(doc) for doc in bow_corpus]

# Print original document with its topic distribution
for i in range(len(bow_corpus)):
    print(f"\nDocument {i + 1}:\n{corpus[i]}")
    print("Topic Distribution:")
    for topic_id, prob in doc_topics[i]:
        print(f"  Topic {topic_id}: {prob:.4f}")


Document 1:
The stock market closed higher today as tech shares rallied amid strong earnings reports.
Topic Distribution:
  Topic 6: 0.9327

Document 2:
A major earthquake struck the coastal city early this morning, causing widespread damage.
Topic Distribution:
  Topic 0: 0.0114
  Topic 1: 0.0114
  Topic 2: 0.0114
  Topic 3: 0.0114
  Topic 4: 0.0114
  Topic 5: 0.9205
  Topic 6: 0.0114
  Topic 7: 0.0114

Document 3:
The government announced new policies aimed at reducing carbon emissions by 2030.
Topic Distribution:
  Topic 0: 0.0139
  Topic 1: 0.0139
  Topic 2: 0.0139
  Topic 3: 0.0139
  Topic 4: 0.0139
  Topic 5: 0.0139
  Topic 6: 0.0139
  Topic 7: 0.9028

Document 4:
Scientists discovered a new species of dinosaur in the remote mountains of Argentina.
Topic Distribution:
  Topic 0: 0.0139
  Topic 1: 0.0139
  Topic 2: 0.0139
  Topic 3: 0.0139
  Topic 4: 0.0139
  Topic 5: 0.0139
  Topic 6: 0.0139
  Topic 7: 0.9028

Document 5:
The local football team won the championship after a thri

## Identify Dominant Topics

In [None]:
# Identify the dominant topic in each document
dominant_topics = []
for i, topics in enumerate(doc_topics):
    if topics:
        # Get the topic with the highest probability
        dominant_topic = max(topics, key=lambda x: x[1])
        dominant_topics.append((i, dominant_topic[0], dominant_topic[1]))
        print(f"Document {i+1}: Dominant Topic = {dominant_topic[0]}, Score = {dominant_topic[1]:.4f}")
    else:
        print(f"Document {i+1}: No dominant topic found")

Document 1: Dominant Topic = 6, Score = 0.9327
Document 2: Dominant Topic = 5, Score = 0.9205
Document 3: Dominant Topic = 7, Score = 0.9028
Document 4: Dominant Topic = 7, Score = 0.9028
Document 5: Dominant Topic = 7, Score = 0.8906
Document 6: Dominant Topic = 7, Score = 0.9125
Document 7: Dominant Topic = 7, Score = 0.9028
Document 8: Dominant Topic = 6, Score = 0.9028
Document 9: Dominant Topic = 2, Score = 0.9205
Document 10: Dominant Topic = 5, Score = 0.9205
