# Topic Modeling

**Topic modeling** is a technique in natural language processing (NLP) that aims to identify hidden thematic structures or topics within a collection of documents. It is an unsupervised machine learning approach that automatically clusters words or documents together based on their co-occurrence patterns, statistical distributions, or semantic similarities.

The goal of topic modeling is to discover latent topics that can explain the main themes or subjects present in the text data. Each topic is represented as a probability distribution over words, indicating the likelihood of certain words appearing in that topic. Documents are then represented as a mixture of these topics, indicating the proportion of each topic present in a particular document.

In [None]:
#import libraries
import os
import re
import numpy as np
import pandas as pd
from pprint import pprint
import random
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from gensim.corpora import Dictionary
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', 100)

Let's read in the wikipedia data we've used in previous noteoboks

In [None]:
df = pd.read_csv("../data/supplementary_content/people_wiki.csv")

We'll define a few functions that will help us clean the text and remove any stop words:

In [None]:
def clean(text):
    """
    Function that cleans up text data using various regular expression patterns
    """
    text = str(text).lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'\(.*?\)', '', text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r'\w*\d\w*', '', text)
    text = re.sub(r"\w+…|…", "", text)  # Remove ellipsis (and last word)
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    return text

In [None]:
def remove_stopwords_and_tokenize(text):
    """
    Removing stopwords using the NLTK English stopwords.
    """
    my_stopwords = set(stopwords.words("english"))
    tokens = word_tokenize(text)  # tokenize 
    tokens = [t for t in tokens if not t in my_stopwords]  # Remove stopwords
    tokens = [t for t in tokens if len(t) > 1]  # Remove short tokens
    return tokens

In [None]:
# Apply the functions to our text data
df["clean_text"] = df.text.apply(clean)
df["tokens"] = df.clean_text.apply(remove_stopwords_and_tokenize)

In [None]:
# Quickly visualize 
df.head()

## Apply LDA 
One popular topic modeling algorithm is **Latent Dirichlet Allocation (LDA)**, which assumes that each document is a mixture of topics, and each topic is a mixture of words. LDA identifies the latent topics and their corresponding word distributions by iteratively learning the topic-word and document-topic distributions. We'll use the `genism` library as the python interface to apply LDA. 

In [None]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(df["tokens"])

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in df["tokens"]]

In [None]:
from gensim.models import LdaModel

# Build LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, random_state=100,
                chunksize=200, passes=1)

In [None]:
def get_document_topic_table(lda_model, corpus, texts=df):
    """
    Generates document topic table 
    
    @params
    lda_model: gensim.models.ldamodel.LdaModel,
    corpus: list, document corpus
    texts: pd.DataFrame
    
    @returns 
    pd.DataFrame: returns topic keywords for each document
    """
    # Init output
    document_topic_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(lda_model[corpus]):           
        row = sorted(row_list, key=lambda x: (x[1]), reverse=True)
        topic_num=row[0][0]
        prop_topic=row[0][1]
        wp = lda_model.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        document_topic_df.at[i,'best_topic'] = topic_num
        document_topic_df.at[i,'prop_topic'] = prop_topic
        document_topic_df.at[i,'topic_keywords'] = topic_keywords
        document_topic_df.at[i,'document_num'] = i
    return document_topic_df


We'll create a document table that highlights the `best_topic`, `prop_topic`, `topic_keywords`, and a `document_num` as a document indentifier. 

In [None]:
document_topic_df = get_document_topic_table(lda_model=lda_model, corpus=corpus, texts=df["tokens"])
document_topic_df

Now that we have our LDA model and our documents table, we can write a few functions that will aggregate and return the `k` most related topics. In our example, a `topic` represents a `person` from the wikipedia dataset. 

In [None]:
def get_topic_id(doc_id):
    """
    Get the id associated with select topic
    
    @params:
    doc_id: str, document identifier
    
    @returns:
    pd.DataFrame: matched topics
    """
    for i,row in df.iterrows():
        if(row["URI"]==doc_id):
            return document_topic_df["best_topic"][i]
    return -1

def get_matching_topics_docs(topic_id):
    """
    Gets the matching topic documents 
    
    @params:
    topic_id: str, lookup 
    
    @returns:
    matched_topics: list, list of matched topics 
    """
    matched_topics=[]
    for i,row in document_topic_df.iterrows():
        
        if(row["best_topic"]==topic_id):
            topic_prop_doc=(topic_id,row["prop_topic"],i)
            matched_topics.append(topic_prop_doc)
        
    return matched_topics

def get_top_k_topics(matched_topics,k):
    """
    Getting matched K topics 
    
    @params:
    matched_topics:list of matched toipcs
    k: int, top K related topics
    
    @returns:
    k_topics_df: pd.DataFrame of matched topics
    """
    top_k=sorted(matched_topics, key=lambda x: [x[1]], reverse=True)
    k_topics_df=pd.DataFrame(columns=["doc_id","topic_id","topic_prop","title"])
    i=0
    for topic_id,topic_prop,doc_num in top_k[:k]:
        k_topics_df.at[i,'doc_id']=df["URI"][doc_num]
        k_topics_df.at[i,'topic_id']=topic_id
        k_topics_df.at[i,'topic_prop']=topic_prop
        k_topics_df.at[i,'title']=df["name"][doc_num]
        i+=1
    return k_topics_df

def recommend_k_topics(doc_id,k):
    """
    Identifies topics, gets list of K most simliar topics 
    
    @params:
    doc_id: str document id, 
    k: int number of matched topics to return
    
    @returns:
    pd.DataFrame
    """
    topic_id=get_topic_id(doc_id)
    if(topic_id!=-1):
        matched_topics=get_matching_topics_docs(topic_id) 
        return get_top_k_topics(matched_topics,k)
    


In [None]:
k_topics_df=recommend_k_topics(doc_id="<http://dbpedia.org/resource/Alfred_J._Lewy>",k=10)
k_topics_df

Note that for the sake of example, we only trained our LDA model for one pass. In more real-life applications, that number would be much higher and the model will be a bettter at topic modeling.