# Topic Modeling: Latent Dirichlet Allocation (LDA)

This notebook introduces the concept of **Latent Dirichlet Allocation (LDA)**, which is an essential unsupervised learning technique to approach **topic modeling**.

In topic modeling, we typically have large amounts of **unlabeled** data/text (corpus) divided in many documents. The goal is to cluster those documents into topic-groups, which need to be discovered, i.e., we don't know the topic contents, since they are to be detected by the approach.

The [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) approach was published in 2003 by Blei, Ng & Jordan:

`../literature/BleiNgJordan_LDA_2003.pdf`

The method uses the [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution), hence the name -- not that they invented the distirbution. The Dirichlet distribution is a multivariate distribution with the property that the sum of all its variables needs to be 1.

These **assumptions** are done in LDA:

- Documents with similar topics use similar groups of words.
- Latent topics can be found by searching for groups of words that frequently occur together in the documents.

These assumptions are translated into probability distributions:

- Documents are probability distributions over `K` latent topics; these latent topics are like bins, and for each document, the weight of dealing with any of the topics is discovered.
- Topics are probability distributions over words. The idea is anallogous to the previous point. In practice, once the topic has been discovered, we check the top-10 words from its topic-word distribution and infer what the theme is.

![LDA Topic-Document-Word Distributions](../pics/LDA_distributions.png)

LDA starts working as follows:

- We have `M` documents, each document `d` with `N_d` words in it, from a corpus vocabulary consisting of `W` words.
- We set a fixed amount of latent topics `K` that are going to be discrovered, as the number of clusters in K-means; e.g., `K = 50`.
- We assign (randomly) to each document the weights/percentages associated to each topic: `alpha_k`, `k = 1:K`, `sum(alpha_k) = 1`.
- From the random assignment done for document-topics, we get the topic weights/percentages associated to each word: `beta_w`, `w = 1:W`, `sum(beta_w) = 1`.

Notes on the initialization:
- The LDA approach assumes that the documents are generated following the points below; although it's not true, it's a useful construct that.
- The first assignment does not make any sense, but we iterate to improve it.

Once we have performed the initialization, the optimization algorithm works as follows:
1. We iterate over every **word** in every **document**, and for each **topic** we compute:
    - `p(topic k | document d) = proportion of words in document d that are assigned to topic k`.
    - `p(word w | topic k) = proportion of assignments to topic k over all documents that contain/come this/from word w`.
2. We re-assign each **word** a new **topic**, where we choose topic k with this probability:
    - `p(topic k | document d) * p(word w | topic k): probability that topic k is generated from word w`.
    - Document topics are re-computed after the re-assignment of word-topic probabilities.
3. We repeat these steps 1 & 2 enough times until we reach a steady state.

Note that before applying anything it is convenient to (1) remove stop-words and (2) reduce the tokens/words to a base form using stemming or lemmatization.

Overview of contents:
1. Load the Dataset
2. Create a Document-Term Matrix (DTM) and Fit the Latent Dirichlet Allocation (LDA) Model to It

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Load the Dataset 

In [1]:
import pandas as pd

In [8]:
# Load the NPR dataset: around 12k articles; we want to discover and assign topics
npr = pd.read_csv('../data/npr.csv')

In [9]:
npr.shape

(11992, 1)

In [10]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [24]:
# Text of an article i
i = 10
print(npr["Article"][i])



## 2. Create a Document-Term Matrix (DTM) and Fit the Latent Dirichlet Allocation (LDA) Model to It

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
# Important parameters of CountVectorizer
# max_df: When building the vocabulary 
#   ignore terms that have a document frequency strictly higher than the given threshold,
#   i.e., corpus-specific stop words.
#   If float, the parameter represents a proportion of documents, integer absolute counts.
# min_df: When building the vocabulary
#   ignore terms that have a document frequency strictly lower than the given threshold.
#   This value is also called cut-off in the literature.
#   If float, the parameter represents a proportion of documents, integer absolute counts
# Stop words: we remove them
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [19]:
# We build the Document-Term Matrix
# We can't do any split, because that's unsupervised learning!
dtm = cv.fit_transform(npr['Article'])

In [20]:
from sklearn.decomposition import LatentDirichletAllocation

In [22]:
# Latent Dirichlet Allocation
# n_components: number of topics
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [23]:
# We fit Latent Dirichlet Allocation model to our Document-Term Matrix
# This can take awhile, we're dealing with a large amount of documents!
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

## 3. Explore the Discovered Topics

In [26]:
# Get all words in the DTM, i.e., our vocabulary
len(cv.get_feature_names_out())

54777

In [28]:
# Explore some of those vocabulary words
import random
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

masks
rehabbed
baffled
patti
tiered
repayment
vladmir
leil
weissman
discreet


In [32]:
# Matrix with topic-word weights/probabilities: topics x words
LDA.components_.shape

(7, 54777)

In [34]:
# Take a single topic k
k = 0
single_topic = LDA.components_[k]

In [37]:
# Get indices that sort this array: [0.9 0.7 0.3] -> [2 1 0]
# These are the word indices ordered according to their weight for topic k
# Watch out the order: ascending (default) / descending
single_topic.argsort()

array([21349, 37109, 17024, ..., 47210, 43172, 42993])

In [40]:
# Get the 15 most significant words
# We can see how we could assign topics/themes:
# 1: Economy & Finances
# 2: Military and Security Affairs
# 3: Family & Resouces
# 4: Health
# 5: Politics and Elections
# 6: Lifestyle
# 7: Education
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index+1}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #1
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #2
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #3
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #5
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #6
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

## 4. Assign Topics to Articles

In [41]:
# In order to assign a topic to each article, we need to combine
# - the DTM matrix: articles x words
# - the LDA weights: topics x words
# A weighted multiplication yields the desired matrix:
# (articles x topics) <- (articles x words) x (words x topics)
topic_results = LDA.transform(dtm)

In [42]:
dtm.shape

(11992, 54777)

In [43]:
LDA.components_.shape

(7, 54777)

In [44]:
topic_results.shape

(11992, 7)

In [46]:
# Topic weights / probabilities of a single article
d = 0
topic_results[d].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [47]:
# Topic index with the highest probability / weight
topic_results[0].argmax()

1

In [49]:
# Create new column with most probable topic index per article
npr['Topic'] = topic_results.argmax(axis=1)

In [50]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
