# Topic Modeling using Gensim-LDA in Python

Topic modeling is technique to extract the hidden topics from large volumes of text. Topic model is a probabilistic model which contain information about the text.

Ex: If it is a news paper corpus it may have topics like economics, sports, politics, weather.

Topic models are useful for purpose of document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm.

LDA’s approach to topic modeling is, it considers each document as a collection of topics and each topic as collection of keywords. Once you provide the algorithm with number of topics all it does is to rearrange the topic distribution within documents and key word distribution within the topics to obtain good composition of topic-keyword distribution.

Topics are nothing but collection of prominent keywords or words with highest probability in topic , which helps to identify what the topics are about.

In [54]:
import newspaper
import re
import numpy as np
import pandas as  pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim_models as gensimvis
from gensim.models import Phrases
from gensim.models.phrases import Phraser

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet


In [55]:
import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Getting the article from the url

In [56]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [57]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [58]:
article = get_article_info(url)

# Removing emails, newline characters and punctuations

In [59]:
# Remove emails
text_data = re.sub(r'\S+@\S+', '', article)

# Remove newline characters
text_data = re.sub(r'\n', ' ', text_data)

# Remove punctuations
text_data = re.sub(r'[^\w\s]', '', text_data)

print(text_data)

Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled  Members of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled  Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for blocking Congress and the public from learning more about Biden family members business deals with China  The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration The whistleblower also alleges clear conflicts of interest in the investigation  Its deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Biden for tax violations House Committee on Oversight and Accountability Chairman James Comer told Fox News on Wednesday  Comer RKy also said deceptive shad

# Tokenize words and cleanup the text, removing stop words

In [60]:
# Tokenize the input string
tokens = nltk.word_tokenize(text_data)

# Define the stop words to be removed
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)


['Republicans', 'respond', 'IRS', 'whistleblower', 'says', 'Hunter', 'Biden', 'investigation', 'mishandled', 'Members', 'Congress', 'calling', 'transparency', 'Biden', 'administration', 'IRS', 'whistleblower', 'said', 'investigation', 'Hunter', 'Biden', 'mishandled', 'Lawmakers', 'Capitol', 'Hill', 'calling', 'Biden', 'administration', 'held', 'accountable', 'blocking', 'Congress', 'public', 'learning', 'Biden', 'family', 'members', 'business', 'deals', 'China', 'congressional', 'outcries', 'come', 'whistleblower', 'within', 'Internal', 'Revenue', 'Service', 'alleges', 'investigation', 'Hunter', 'Biden', 'mishandled', 'Biden', 'administration', 'whistleblower', 'also', 'alleges', 'clear', 'conflicts', 'interest', 'investigation', 'deeply', 'concerning', 'Biden', 'Administration', 'may', 'obstructing', 'justice', 'blocking', 'efforts', 'charge', 'Hunter', 'Biden', 'tax', 'violations', 'House', 'Committee', 'Oversight', 'Accountability', 'Chairman', 'James', 'Comer', 'told', 'Fox', 'News

# Bigram and Trigram models

A bigram model is a language model that uses a history of one preceding word to predict the next word. It is a type of n-gram model, where n is the number of words in the history. For example, a bigram model would predict the word "dog" given the preceding word "the" as "the dog".

A trigram model, on the other hand, uses a history of two preceding words to predict the next word. For example, a trigram model would predict the word "jumps" given the preceding two words "the quick" as "the quick jumps".

Both bigram and trigram models are used to improve the accuracy of natural language processing tasks such as text classification, sentiment analysis, and machine translation. By incorporating more context into the model, they are able to better capture the meaning of the text and make more accurate predictions.

In [61]:
bigrams = list(nltk.bigrams(filtered_tokens))
trigrams = list(nltk.trigrams(filtered_tokens))

# Print the results
print("Bigrams:")
print(bigrams)
print("Trigrams:")
print(trigrams)

Bigrams:
[('Republicans', 'respond'), ('respond', 'IRS'), ('IRS', 'whistleblower'), ('whistleblower', 'says'), ('says', 'Hunter'), ('Hunter', 'Biden'), ('Biden', 'investigation'), ('investigation', 'mishandled'), ('mishandled', 'Members'), ('Members', 'Congress'), ('Congress', 'calling'), ('calling', 'transparency'), ('transparency', 'Biden'), ('Biden', 'administration'), ('administration', 'IRS'), ('IRS', 'whistleblower'), ('whistleblower', 'said'), ('said', 'investigation'), ('investigation', 'Hunter'), ('Hunter', 'Biden'), ('Biden', 'mishandled'), ('mishandled', 'Lawmakers'), ('Lawmakers', 'Capitol'), ('Capitol', 'Hill'), ('Hill', 'calling'), ('calling', 'Biden'), ('Biden', 'administration'), ('administration', 'held'), ('held', 'accountable'), ('accountable', 'blocking'), ('blocking', 'Congress'), ('Congress', 'public'), ('public', 'learning'), ('learning', 'Biden'), ('Biden', 'family'), ('family', 'members'), ('members', 'business'), ('business', 'deals'), ('deals', 'China'), ('Ch

# Lemmatization

Lemmatization is a text preprocessing technique that involves reducing words to their base or dictionary form, which is known as a lemma. In contrast to stemming, which just chops off the ends of words to create a root form, lemmatization uses a vocabulary and morphological analysis to determine the lemma of each word based on its context. For example, the lemma of "am", "are", and "is" is "be". 

The lemmatization process reduces inflected words to their base form, which helps to group together related words and reduce the dimensionality of the feature space in natural language processing tasks.

In [62]:
def lemmatize_bigrams(bigrams):
    lemmatizer = WordNetLemmatizer()
    
    lemmatized_bigrams = []
    for bigram in bigrams:
        lemma1 = lemmatizer.lemmatize(bigram[0])
        lemma2 = lemmatizer.lemmatize(bigram[1])
        lemmatized_bigrams.append([lemma1, lemma2])  # Append as a list instead of a tuple, for the dictionary (list of lists)
    return lemmatized_bigrams


In [63]:
lemmatized_bigrams = lemmatize_bigrams(bigrams)

print(lemmatized_bigrams)

[['Republicans', 'respond'], ['respond', 'IRS'], ['IRS', 'whistleblower'], ['whistleblower', 'say'], ['say', 'Hunter'], ['Hunter', 'Biden'], ['Biden', 'investigation'], ['investigation', 'mishandled'], ['mishandled', 'Members'], ['Members', 'Congress'], ['Congress', 'calling'], ['calling', 'transparency'], ['transparency', 'Biden'], ['Biden', 'administration'], ['administration', 'IRS'], ['IRS', 'whistleblower'], ['whistleblower', 'said'], ['said', 'investigation'], ['investigation', 'Hunter'], ['Hunter', 'Biden'], ['Biden', 'mishandled'], ['mishandled', 'Lawmakers'], ['Lawmakers', 'Capitol'], ['Capitol', 'Hill'], ['Hill', 'calling'], ['calling', 'Biden'], ['Biden', 'administration'], ['administration', 'held'], ['held', 'accountable'], ['accountable', 'blocking'], ['blocking', 'Congress'], ['Congress', 'public'], ['public', 'learning'], ['learning', 'Biden'], ['Biden', 'family'], ['family', 'member'], ['member', 'business'], ['business', 'deal'], ['deal', 'China'], ['China', 'congress

# Create Dictionary and Corpus needed for Topic Modeling

- Creates a Gensim dictionary object id2word from the list of preprocessed and lemmatized texts data_lemmatized. The dictionary assigns a unique id to each word in the corpus.

- Assigns the preprocessed and lemmatized texts to texts.

- Creates a Gensim corpus object corpus from the dictionary and the list of preprocessed and lemmatized texts. The corpus is a list of bags-of-words, where each bag-of-words is a list of tuples. Each tuple represents a term and its frequency in the corresponding document.

- Prints the first bag-of-words in the corpus.

This code prepares the data for topic modeling with LDA by creating a dictionary of all unique words in the corpus and a corpus object with bag-of-words representations of the documents.

Each tuple in the output represents a bigram that has been transformed into a two-element tuple. The first element of each tuple is the ID of the corresponding bigram in the dictionary (id2word) and the second element is the count of how many times that bigram appears in the input text.

For example, the first tuple (0, 1) represents the bigram ('Republicans', 'respond'), where Republicans has ID 0 in the dictionary, and respond has ID 1. The number 1 indicates that this bigram appears once in your text corpus.

In [64]:
# Create Dictionary 
id2word = corpora.Dictionary(lemmatized_bigrams)  

# Create Corpus 
texts = lemmatized_bigrams

corpus = [id2word.doc2bow(text) for text in texts]

print(corpus)


[[(0, 1), (1, 1)], [(1, 1), (2, 1)], [(2, 1), (3, 1)], [(3, 1), (4, 1)], [(4, 1), (5, 1)], [(5, 1), (6, 1)], [(6, 1), (7, 1)], [(7, 1), (8, 1)], [(8, 1), (9, 1)], [(9, 1), (10, 1)], [(10, 1), (11, 1)], [(11, 1), (12, 1)], [(6, 1), (12, 1)], [(6, 1), (13, 1)], [(2, 1), (13, 1)], [(2, 1), (3, 1)], [(3, 1), (14, 1)], [(7, 1), (14, 1)], [(5, 1), (7, 1)], [(5, 1), (6, 1)], [(6, 1), (8, 1)], [(8, 1), (15, 1)], [(15, 1), (16, 1)], [(16, 1), (17, 1)], [(11, 1), (17, 1)], [(6, 1), (11, 1)], [(6, 1), (13, 1)], [(13, 1), (18, 1)], [(18, 1), (19, 1)], [(19, 1), (20, 1)], [(10, 1), (20, 1)], [(10, 1), (21, 1)], [(21, 1), (22, 1)], [(6, 1), (22, 1)], [(6, 1), (23, 1)], [(23, 1), (24, 1)], [(24, 1), (25, 1)], [(25, 1), (26, 1)], [(26, 1), (27, 1)], [(27, 1), (28, 1)], [(28, 1), (29, 1)], [(29, 1), (30, 1)], [(3, 1), (30, 1)], [(3, 1), (31, 1)], [(31, 1), (32, 1)], [(32, 1), (33, 1)], [(33, 1), (34, 1)], [(34, 1), (35, 1)], [(7, 1), (35, 1)], [(5, 1), (7, 1)], [(5, 1), (6, 1)], [(6, 1), (8, 1)], [(6, 

Gensim creates unique id for each word in the document. 
Its mapping of word_id and word_frequency. Example: (8,2) above indicates, word_id 8 occurs twice in the document and so on.

- corpus is the document-term matrix we created earlier, where each document is represented as a list of tuples with (term_id, frequency).

- id2word is the dictionary created earlier, which maps each term to a unique integer ID.
- num_topics specifies the number of topics to be generated by the LDA model.
- random_state sets the random seed for reproducibility of the results.
- update_every specifies how often the model parameters should be updated.
- chunksize is the number of documents to be used in each training chunk.
- passes specifies the total number of passes through the corpus during training.
- alpha is the prior belief over topic distributions. 'auto' sets this to 1/num_topics.
- per_word_topics enables the model to return a list of topics with their corresponding word weights for each word in the corpus.



After initializing the model, we can further train it by calling the lda_model.train() method or by using the update() method with additional text data.

In [65]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

Each topic is combination of keywords and each keyword contributes a certain weightage to the topic.

Keywords for each topic and weightage of each keyword using lda_model.print_topics().

The top keywords and weights associated with keywords contributing to topic.
Topics are words with highest probability in topic and the numbers are the probabilities of words appearing in topic distribution.

In [66]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.004*"Biden" + 0.004*"mishandled" + 0.004*"Department" + 0.004*"either" + '
  '0.004*"choice" + 0.004*"single" + 0.004*"every" + 0.004*"need" + '
  '0.004*"potential" + 0.004*"behavior"'),
 (1,
  '0.239*"whistleblower" + 0.130*"learning" + 0.038*"held" + 0.020*"come" + '
  '0.020*"Hill" + 0.002*"whether" + 0.002*"privy" + 0.002*"protected" + '
  '0.002*"disclosure" + 0.002*"scheme"'),
 (2,
  '0.510*"Hunter" + 0.002*"Neither" + 0.002*"federal" + 0.002*"son" + '
  '0.002*"Biden" + 0.002*"say" + 0.002*"within" + 0.002*"whistleblower" + '
  '0.002*"single" + 0.002*"either"'),
 (3,
  '0.125*"Committee" + 0.099*"Oversight" + 0.040*"Revenue" + 0.040*"Service" + '
  '0.040*"Chairman" + 0.040*"James" + 0.021*"Internal" + 0.021*"deal" + '
  '0.002*"work" + 0.002*"investigation"'),
 (4,
  '0.141*"Fox" + 0.102*"scheme" + 0.097*"public" + 0.097*"told" + '
  '0.065*"deceptive" + 0.033*"shady" + 0.028*"respond" + 0.015*"allowed" + '
  '0.015*"Republicans" + 0.002*"SARs"'),
 (5,
  '0.004*"beh

# Explanation: example of the first topic id of the output

The format of the output is (topic_id, word_distribution). In this case, the topic_id is 0.

The word_distribution is a list of tuples, where each tuple represents a word in the topic and its associated weight. The weight represents the importance of the word in the topic, with higher weights indicating more significant words.

So for the first topic, the top 10 words and their weights are:

"Biden": 0.004 

"mishandled": 0.004

"Department": 0.004

"either": 0.004

"choice": 0.004

"single": 0.004

"every": 0.004

"need": 0.004

"potential": 0.004

"behavior": 0.004

This suggests that the documents in the corpus that are associated with this topic may contain discussions about Biden, the Department, mishandled situations, choices, and behavior, among other things. However, it's important to note that without more context about the corpus and the LDA model itself, it's difficult to interpret these results in a meaningful way.

# Compute model Perplexity and Coherence score

- Coherence is a measure of how coherent the topics are. Higher coherence scores indicate more coherent topics. CoherenceModel is a class in Gensim that computes the coherence of a topic model.

- Perplexity is a measure of how well the LDA model predicts the corpus. Lower perplexity scores indicate better predictions. log_perplexity is a method of the LdaModel class in Gensim that computes the log perplexity of the model given a corpus. 

In [67]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts = lemmatized_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -21.178629464587097

Coherence Score:  0.739709594537693


# LDA model

The visualization shows the topics generated by the LDA model as circles, where the size of the circle represents the prevalence of the topic in the corpus. Each topic is represented by a list of words associated with that topic, and the strength of the association is represented by the distance between the words and the center of the circle.

The visualization also shows the distribution of documents across topics, where each document is represented by a horizontal bar chart. The length of the bar represents the prevalence of the document in the corpus, and the color of the bar represents the topics that the document is associated with.

You can interact with the visualization by hovering over the circles or the bars to see more information about the topics or the documents, respectively. You can also click on a topic to see the words associated with that topic, and adjust the relevance metric to explore the topics further.

In [68]:
# Visualize the topics
pyLDAvis.enable_notebook(local=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
# pyLDAvis.display(vis)
pyLDAvis.save_html(vis, 'lda_plot.html')

Each bubble on the left-hand side represents topic. The larger the bubble, the more prevalent or dominant the topic is. Good topic model will be fairly big topics scattered in different quadrants rather than being clustered on one quadrant.

The model with too many topics will have many overlaps, small sized bubbles clustered in one region of chart.
If you move the cursor the different bubbles you can see different keywords associated with topics.
