# Topic Modeling using Gensim-LDA in Python

Topic modeling is technique to extract the hidden topics from large volumes of text. Topic model is a probabilistic model which contain information about the text.

Ex: If it is a news paper corpus it may have topics like economics, sports, politics, weather.

Topic models are useful for purpose of document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm.

LDA’s approach to topic modeling is, it considers each document as a collection of topics and each topic as collection of keywords. Once you provide the algorithm with number of topics all it does is to rearrange the topic distribution within documents and key word distribution within the topics to obtain good composition of topic-keyword distribution.

Topics are nothing but collection of prominent keywords or words with highest probability in topic , which helps to identify what the topics are about.

In [1]:
import newspaper
import re
import numpy as np
import pandas as  pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim_models as gensimvis


In [2]:
import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
nltk.download('punkt')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Getting the article from the url

In [3]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [4]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [5]:
article = get_article_info(url)

# Removing emails and newline characters

In [6]:
# Remove emails
text_data = re.sub(r'\S+@\S+', '', article)

# Remove newline characters
text_data = re.sub(r'\n', ' ', text_data)

print(text_data)

Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled  Members of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled.  Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for "blocking" Congress and the public from learning more about Biden family members’ business deals with China.  The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration. The whistleblower also alleges "clear conflicts of interest" in the investigation.  "It’s deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Biden for tax violations," House Committee on Oversight and Accountability Chairman James Comer told Fox News on Wednesday.  Comer, R-Ky., also s

# Tokenize words and cleanup the text

In [7]:
# Preprocess the text to remove punctuation
data_words = simple_preprocess(text_data, deacc=True)

# Print the processed text
print(data_words)


['republicans', 'respond', 'after', 'irs', 'whistleblower', 'says', 'hunter', 'biden', 'investigation', 'is', 'being', 'mishandled', 'members', 'of', 'congress', 'are', 'calling', 'for', 'more', 'transparency', 'from', 'the', 'biden', 'administration', 'after', 'an', 'irs', 'whistleblower', 'said', 'an', 'investigation', 'into', 'hunter', 'biden', 'is', 'being', 'mishandled', 'lawmakers', 'on', 'capitol', 'hill', 'are', 'calling', 'for', 'the', 'biden', 'administration', 'to', 'be', 'held', 'accountable', 'for', 'blocking', 'congress', 'and', 'the', 'public', 'from', 'learning', 'more', 'about', 'biden', 'family', 'members', 'business', 'deals', 'with', 'china', 'the', 'congressional', 'outcries', 'come', 'as', 'whistleblower', 'within', 'the', 'internal', 'revenue', 'service', 'alleges', 'an', 'investigation', 'into', 'hunter', 'biden', 'is', 'being', 'mishandled', 'by', 'the', 'biden', 'administration', 'the', 'whistleblower', 'also', 'alleges', 'clear', 'conflicts', 'of', 'interest'

# Bigram and Trigram models

A bigram model is a language model that uses a history of one preceding word to predict the next word. It is a type of n-gram model, where n is the number of words in the history. For example, a bigram model would predict the word "dog" given the preceding word "the" as "the dog".

A trigram model, on the other hand, uses a history of two preceding words to predict the next word. For example, a trigram model would predict the word "jumps" given the preceding two words "the quick" as "the quick jumps".

Both bigram and trigram models are used to improve the accuracy of natural language processing tasks such as text classification, sentiment analysis, and machine translation. By incorporating more context into the model, they are able to better capture the meaning of the text and make more accurate predictions.

In [8]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# Print the bigram and trigram words for all the data words
for i in range(len(data_words)):
    print(f"Data words {i}:")
    print(f"Bigram words: {bigram_mod[data_words[i]]}")
    print(f"Trigram words: {trigram_mod[bigram_mod[data_words[i]]]}")


Data words 0:
Bigram words: ['r', 'e', 'p', 'u', 'b', 'l', 'i', 'c', 'a', 'n', 's']
Trigram words: ['r', 'e', 'p', 'u', 'b', 'l', 'i', 'c', 'a', 'n', 's']
Data words 1:
Bigram words: ['r', 'e', 's', 'p', 'o', 'n', 'd']
Trigram words: ['r', 'e', 's', 'p', 'o', 'n', 'd']
Data words 2:
Bigram words: ['a', 'f', 't', 'e', 'r']
Trigram words: ['a', 'f', 't', 'e', 'r']
Data words 3:
Bigram words: ['i', 'r', 's']
Trigram words: ['i', 'r', 's']
Data words 4:
Bigram words: ['w', 'h', 'i', 's', 't', 'l', 'e', 'b', 'l', 'o', 'w', 'e', 'r']
Trigram words: ['w', 'h', 'i', 's', 't', 'l', 'e', 'b', 'l', 'o', 'w', 'e', 'r']
Data words 5:
Bigram words: ['s', 'a', 'y', 's']
Trigram words: ['s', 'a', 'y', 's']
Data words 6:
Bigram words: ['h', 'u', 'n', 't', 'e', 'r']
Trigram words: ['h', 'u', 'n', 't', 'e', 'r']
Data words 7:
Bigram words: ['b', 'i', 'd', 'e', 'n']
Trigram words: ['b', 'i', 'd', 'e', 'n']
Data words 8:
Bigram words: ['i', 'n', 'v', 'e', 's', 't', 'i', 'g', 'a', 't', 'i', 'o', 'n']
Trigra

# Remove Stopwords, make bigrams and lemmatize

In [9]:
# removing stopwords
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out


Lemmatization is a text preprocessing technique that involves reducing words to their base or dictionary form, which is known as a lemma. In contrast to stemming, which just chops off the ends of words to create a root form, lemmatization uses a vocabulary and morphological analysis to determine the lemma of each word based on its context. For example, the lemma of "am", "are", and "is" is "be". 

In [10]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

data_words_lemmatized = lemmatize(data_words_bigrams)

print(data_words_lemmatized)


[[], ['respond'], [], [], ['whistleblower'], ['say'], ['hunter'], [], ['investigation'], [], [], ['mishandle'], ['member'], [], [], [], ['call'], [], [], ['transparency'], [], [], [], ['administration'], [], [], [], ['whistleblower'], ['say'], [], ['investigation'], [], ['hunter'], [], [], [], ['mishandle'], ['lawmaker'], [], [], ['hill'], [], ['call'], [], [], [], ['administration'], [], [], ['hold'], ['accountable'], [], ['block'], [], [], [], ['public'], [], ['learn'], [], [], [], ['family'], ['member'], ['business'], ['deal'], [], [], [], ['congressional'], ['outcry'], ['come'], [], ['whistleblower'], [], [], ['internal'], ['revenue'], ['service'], ['allege'], [], ['investigation'], [], ['hunter'], [], [], [], ['mishandle'], [], [], [], ['administration'], [], ['whistleblower'], ['also'], ['allege'], ['clear'], ['conflict'], [], ['interest'], [], [], ['investigation'], [], ['deeply'], ['concern'], [], [], [], ['administration'], [], [], ['obstruct'], ['justice'], [], ['block'], ['e

# Create Dictionary and Corpus needed for Topic Modeling

- Creates a Gensim dictionary object id2word from the list of preprocessed and lemmatized texts data_lemmatized. The dictionary assigns a unique id to each word in the corpus.

- Assigns the preprocessed and lemmatized texts to texts.

- Creates a Gensim corpus object corpus from the dictionary and the list of preprocessed and lemmatized texts. The corpus is a list of bags-of-words, where each bag-of-words is a list of tuples. Each tuple represents a term and its frequency in the corresponding document.

- Prints the first bag-of-words in the corpus.

This code prepares the data for topic modeling with LDA by creating a dictionary of all unique words in the corpus and a corpus object with bag-of-words representations of the documents.

In [11]:
# Create Dictionary 
id2word = corpora.Dictionary(data_words_lemmatized)  

# Create Corpus 
texts = data_words_lemmatized

corpus = [id2word.doc2bow(text) for text in texts]


Gensim creates unique id for each word in the document. 
Its mapping of word_id and word_frequency. Example: (8,2) above indicates, word_id 8 occurs twice in the document and so on.

- corpus is the document-term matrix we created earlier, where each document is represented as a list of tuples with (term_id, frequency).

- id2word is the dictionary created earlier, which maps each term to a unique integer ID.
- num_topics specifies the number of topics to be generated by the LDA model.
- random_state sets the random seed for reproducibility of the results.
- update_every specifies how often the model parameters should be updated.
- chunksize is the number of documents to be used in each training chunk.
- passes specifies the total number of passes through the corpus during training.
- alpha is the prior belief over topic distributions. 'auto' sets this to 1/num_topics.
- per_word_topics enables the model to return a list of topics with their corresponding word weights for each word in the corpus.



After initializing the model, we can further train it by calling the lda_model.train() method or by using the update() method with additional text data.

In [12]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

Each topic is combination of keywords and each keyword contributes a certain weightage to the topic.

Keywords for each topic and weightage of each keyword using lda_model.print_topics().

The top keywords and weights associated with keywords contributing to topic.
Topics are words with highest probability in topic and the numbers are the probabilities of words appearing in topic distribution.

In [13]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.139*"also" + 0.038*"deal" + 0.005*"indicator" + 0.005*"indicate" + '
  '0.005*"department" + 0.005*"behavior" + 0.005*"single" + 0.005*"choice" + '
  '0.005*"wrongdoe" + 0.005*"alone"'),
 (1,
  '0.142*"mishandle" + 0.068*"allege" + 0.004*"indicator" + 0.004*"department" '
  '+ 0.004*"behavior" + 0.004*"single" + 0.004*"wrongdoe" + 0.004*"indicate" + '
  '0.004*"either" + 0.004*"alone"'),
 (2,
  '0.005*"single" + 0.005*"either" + 0.005*"dealing" + 0.005*"alone" + '
  '0.005*"indicate" + 0.005*"wrongdoe" + 0.005*"indicator" + 0.005*"behavior" '
  '+ 0.005*"department" + 0.005*"need"'),
 (3,
  '0.043*"congressional" + 0.005*"single" + 0.005*"either" + 0.005*"alone" + '
  '0.005*"indicate" + 0.005*"wrongdoe" + 0.005*"indicator" + 0.005*"behavior" '
  '+ 0.005*"department" + 0.005*"need"'),
 (4,
  '0.043*"service" + 0.005*"single" + 0.005*"either" + 0.005*"alone" + '
  '0.005*"indicate" + 0.005*"wrongdoe" + 0.005*"indicator" + 0.005*"behavior" '
  '+ 0.005*"department" + 0.005*"ne

# Compute model Perplexity and Coherence score

Coherence score and perplexity provide a convinent way to measure how good a given topic model is.

Lower the perplexity better the model.
Higher the topic coherence, the topic is more human interpretable.

In [14]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts = data_words_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -28.48042166238409

Coherence Score:  0.8090537520529155


In [19]:
# Visualize the topics
pyLDAvis.enable_notebook(local=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
# pyLDAvis.display(vis)
pyLDAvis.save_html(vis, 'lda_plot.html')

Each bubble on the left-hand side represents topic. The larger the bubble, the more prevalent or dominant the topic is. Good topic model will be fairly big topics scattered in different quadrants rather than being clustered on one quadrant.

The model with too many topics will have many overlaps, small sized bubbles clustered in one region of chart.
If you move the cursor the different bubbles you can see different keywords associated with topics.
