# Topic Modeling

### **What is Topic Modeling?**
* A coarse-level analysis of what's in a text collection
* **Topic**: the subject (theme) of a discourse
* Topics are represented as a word distribution
* A document is assumed to be a mixture of topics
* What's known:
    * The text collection or corpus
    * The number of topics
* What's not known:
    * The actual topics
    * Topic distribution for each document
* Essentially, topic modeling is a text clustering problem - documents and words clustered simultaneously
* Different topic modeling approaches are available:
    * Probabilistic Latent Semantic Analysis (PLSA)
    * Latent Dirichlet Allocation

### **Latent Dirichlet Allocation (LDA)**
* Generative model for a document `d`
    * Choose length of document `d`  
    * Choose a mixture of topics for document `d` 
    * Use a topic's multinomial distribution to output words to fill that topic's quota

### **Topic Modeling in Practice**
* How many topics? - Finding or even guessing the number of topics is hard
* Interpreting the topics
    * Topics are just word distributions
    * Making sense of words / generating labels is subjective

### **Important Concepts**
* Topic Modeling is a great tool for exploratory text analysis - What are the documents (tweets, reviews, news, articles, ...) about?
* Many tools available to do it effortlessly in Python

### **Working with LDA in Python**
* Many packages available such as gensim and lda
* Pre-processing text:
    * Tokenize, normalize (lowercase)
    * Stop word removal
    * Stemming
* Convert tokenized documents to a document-term matrix
* Build LDA models on the doc-term matrix

In [10]:
import pandas as pd
pd.set_option('display.width', 1000)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')

# sample the news data stratifying by topic
news_data = (
    pd.read_csv('../datasets/News_Classification/train.csv')
    .groupby('Class Index', group_keys=False)
    .apply(lambda x: x.sample(n=50))['Description']
    .str.lower()
    .replace(r'\(.*\)', '', regex=True)
    .replace(r'[^a-zA-Z\ ]', '', regex=True)
    .to_list()
)

#Pre-process data
for i, text in enumerate(news_data):
    token_text = word_tokenize(text)
    news_data[i] = [word for word in token_text if word not in [*stopwords.words('english'), 'said', 'us', 'ap', 'one', 'new']]

[nltk_data] Downloading package stopwords to /home/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary

num_topics = 8

dictionary = Dictionary(news_data)
corpus = [dictionary.doc2bow(doc) for doc in news_data]
ldamodel = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=50)
for i in range(num_topics): print(ldamodel.print_topic(topicno=i, topn=5))

0.005*"security" + 0.005*"game" + 0.004*"southern" + 0.004*"day" + 0.004*"european"
0.007*"president" + 0.007*"olympic" + 0.005*"first" + 0.005*"committee" + 0.004*"home"
0.009*"pm" + 0.005*"three" + 0.005*"china" + 0.005*"search" + 0.004*"st"
0.006*"nation" + 0.005*"sunday" + 0.005*"iraq" + 0.004*"last" + 0.004*"today"
0.005*"software" + 0.004*"companies" + 0.004*"would" + 0.004*"state" + 0.004*"court"
0.006*"president" + 0.005*"first" + 0.004*"last" + 0.004*"million" + 0.004*"corp"
0.006*"group" + 0.006*"wednesday" + 0.005*"thursday" + 0.005*"yesterday" + 0.005*"prime"
0.009*"microsoft" + 0.006*"state" + 0.006*"first" + 0.004*"may" + 0.004*"company"


#### **Important Concepts**
* Topic modeling is an exploratory tool frequently used for text mining
* Latent Dirichlet Allocation is a generative model used extensively for modeling large text corpora
* LDA can also be used as a feature selection technique for text classification and other tasks