# Topic Modeling using Gensim-LDA in Python

Topic modeling is technique to extract the hidden topics from large volumes of text. Topic model is a probabilistic model which contain information about the text.

Ex: If it is a news paper corpus it may have topics like economics, sports, politics, weather.

Topic models are useful for purpose of document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm.

LDA’s approach to topic modeling is, it considers each document as a collection of topics and each topic as collection of keywords. Once you provide the algorithm with number of topics all it does is to rearrange the topic distribution within documents and key word distribution within the topics to obtain good composition of topic-keyword distribution.

Topics are nothing but collection of prominent keywords or words with highest probability in topic , which helps to identify what the topics are about.

In [5]:
import re
import numpy as np
import pandas as  pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Load the 'en_core_web_sm' model from spaCy
nlp = spacy.load('en_core_web_sm')

# Function to preprocess the text data
def preprocess(text):
    # Create a spaCy doc object
    doc = nlp(text)
    # Remove stop words and punctuation, and lemmatize the tokens
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    return tokens

# Function to generate LDA model and topics
def generate_lda_topics(text, num_topics=5):
    # Preprocess the text data
    tokens = preprocess(text)

    # Create a dictionary from the preprocessed tokens
    dictionary = Dictionary([tokens])

    # Create a bag-of-words representation of the preprocessed text data
    corpus = [dictionary.doc2bow(tokens)]

    # Train the LDA model on the corpus
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)

    # Print the topics and their top words
    for topic in lda_model.show_topics():
        print(topic)

    # Compute the coherence score of the LDA model
    coherence_model = CoherenceModel(model=lda_model, texts=[tokens], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print(f'Coherence score: {coherence_score}')

    # Visualize the LDA model with pyLDAvis
    vis = gensimvis.prepare(lda_model, corpus, dictionary)
    pyLDAvis.display(vis)

# Get the article URL from user input
article_url = input('Enter the URL of the article: ')

# Download and extract the article text using the newspaper library
article = newspaper.Article(url=article_url)
article.download()
article.parse()
article_text = article.text

# Generate the LDA model and topics
generate_lda_topics(article_text)
