# Topic Modeling
In topic modeling we provide text to a (normally) unsupervised machine learning algorithm which then groups words into topics based on how they appear in the text. The most popular topic modeling algorithm is Latent-Dirichlet-Allocation (LDA)

## Load prerequisites

In [2]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import pandas as pd

from nltk import download
download('punkt_tab')
download('wordnet')
download('stopwords')


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/juliusc/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/juliusc/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliusc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Define sample data

In [None]:
# Sample text data
documents = [
    "I love to eat pizza. Pizza is my favorite food.",
    "The cat is playing with the ball.",
    "I enjoy reading books on machine learning.",
    "The dog is chasing the cat.",
    "Pizza and pasta are popular Italian dishes."
]

## Preprocess data
Preprocessing the texts reduces the dimensionality of the problem by making everything lower case, removing common stopwords and shortening each word to its stem (lemmatize).

In [None]:
# Preprocess the documents
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in stop_words]
    return tokens

preprocessed_docs = [preprocess(doc) for doc in documents]

## Create dictionary and corpus
First a dictionary is created, which maps words to numbers. Then every document is transformed into numbers and stored in a corpus.

In [4]:
from gensim import corpora

# Create a dictionary representation of the documents.
dictionary = corpora.Dictionary(preprocessed_docs)

# Convert documents into a document-term matrix.
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

## Build LDA Model
Based on the dictionary and corpus we can now start modeling the topic with LDA. 

In [5]:
from gensim.models.ldamodel import LdaModel

# Set parameters for LDA
num_topics = 2  # The number of topics to find
passes = 15     # The number of passes through the corpus during training

# Train the LDA model
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=passes)


## Explore Topics
For each topic we get a collection of words that is most closely associated with this topic.

In [10]:
# Print the topics
for i, topic in lda_model.print_topics(num_words=4):
    print(f"Topic {i}: {topic}")


Topic 0: 0.122*"cat" + 0.073*"chasing" + 0.073*"dog" + 0.073*"ball"
Topic 1: 0.125*"pizza" + 0.073*"eat" + 0.073*"favorite" + 0.073*"learning"


## Infer new topics
We can provide new text to the model to assign it to a topic.

In [7]:
new_doc = "I love eating Italian pizza"
new_doc_preprocessed = preprocess(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc_preprocessed)

# Infer the topic distribution for the new document
topics = lda_model.get_document_topics(new_doc_bow)
print("Topic distribution:", topics)


Topic distribution: [(0, 0.40136445), (1, 0.59863555)]
