# Topic Modeling on News Snippets Using LDA
**Author:** Virginia Herrero

## Import Libraries and Download Resources

Import essential libraries for text preprocessing, topic modeling, and download required NLTK resources.

In [1]:
# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from gensim.utils import simple_preprocess

# Download NLTK resources
nltk.download("stopwords")
nltk.download("wordnet")

# Topic Modeling
import gensim
import gensim.corpora as corpora
from gensim.models import LdaMulticore, CoherenceModel

# Utilities
from pprint import pprint

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Define the Corpus

In natural language processing, a corpus is a collection of written or spoken texts that serves as the dataset for language-related tasks. The corpus is analyzed to identify language patterns and typically requires preprocessing and transformation into a format suitable for machine learning models.

In [2]:
corpus = [
    "The stock market closed higher today as tech shares rallied amid strong earnings reports.",
    "A major earthquake struck the coastal city early this morning, causing widespread damage.",
    "The government announced new policies aimed at reducing carbon emissions by 2030.",
    "Scientists discovered a new species of dinosaur in the remote mountains of Argentina.",
    "The local football team won the championship after a thrilling final match.",
    "Health officials urge citizens to get vaccinated as flu season approaches.",
    "A breakthrough in renewable energy technology promises cheaper solar panels.",
    "International leaders met to discuss trade agreements and economic cooperation.",
    "A popular film festival opened this weekend, showcasing independent movies from around the world.",
    "The city council approved plans for a new public park to promote green spaces."
]

## Text processing

After defining the corpus, the next step is text preprocessing. This step involves cleaning and preparing the raw text data to make it suitable for modeling. 

In [3]:
# Set stopwords
stop_w = set(stopwords.words("english"))

In [None]:
# Tokenize the corpus
def doc_to_tokens(texts):
    """
    Tokenize a list of documents into clean lowercase words.

    Parameters:
    ----------
    texts (list of str): List of raw text documents.

    Yields:
    ----------
    list of str: Tokenized and lowercased words from each document,
                 with punctuation removed.
    """
    for doc in texts:
        yield simple_preprocess(doc, deacc=True)

tokens = list(doc_to_tokens(corpus))

In [None]:
# Remove stopwords
def rm_stopwords(docs):
    """
    Remove English stopwords from tokenized documents.

    Parameters:
    ----------
    docs (list of list of str): Tokenized documents (list of words).

    Returns:
    ----------
    list of list of str: Tokenized documents with stopwords removed.
    """
    return [[word for word in doc if word not in stop_w] for doc in docs]

tokens = rm_stopwords(tokens)

In [6]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Lemmatize and stem tokens
docs = []
for doc in tokens:
    word_list = []
    for token in doc:
        lemm = lemmatizer.lemmatize(token)
        stem = stemmer.stem(lemm)
        word_list.append(stem)
    docs.append(word_list)

print("Sample processed documents:")
print(docs[:2])

Sample processed documents:
[['stock', 'market', 'close', 'higher', 'today', 'tech', 'share', 'ralli', 'amid', 'strong', 'earn', 'report'], ['major', 'earthquak', 'struck', 'coastal', 'citi', 'earli', 'morn', 'caus', 'widespread', 'damag']]


## Create Dictionary

A dictionary in natural language processing is a mapping between unique words (tokens) in the corpus and their integer IDs. It serves as a vocabulary reference that converts text data into numerical formats required by machine learning models. In topic modeling, the dictionary helps translate words into a consistent numeric representation used to build the corpus and train models like LDA.

In [7]:
# Create a dictionary representation of the documents
word_dict = corpora.Dictionary(docs)

# Print the first 10 token-id
print("Sample dictionary token-id pairs:")
print(list(word_dict.items())[:10])

Sample dictionary token-id pairs:
[(0, 'amid'), (1, 'close'), (2, 'earn'), (3, 'higher'), (4, 'market'), (5, 'ralli'), (6, 'report'), (7, 'share'), (8, 'stock'), (9, 'strong')]


## Create Bag-of-Words

A bag of words (BoW) is a simple and commonly used method for representing text data in natural language processing. It treats a document as a "bag" of individual words, ignoring grammar and word order, but keeping track of how many times each word appears. Each document is converted into a vector of word counts based on a predefined vocabulary. In conclusion, a bag of words is a numerical representation of text that captures word frequency, used to feed text data into machine learning models.


In [None]:
# Create the bag-of-words corpus
bow_corpus = [word_dict.doc2bow(doc) for doc in docs]

# Print the bag-of-words for the first document
print("Sample bag-of-words representation for first document:")
print(bow_corpus[0])

Sample bag-of-words representation for first document:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]
