`Zoumana KEITA, Data Scientist`

# Latent Dirichlet Allocation / Analysis (LDA)     

**Note**: you will need to unzip the data from the `data` folder in order to follow this notebook.  

This is a probabilistic model used to find clusters assigments for documents.  
It uses two probability values to cluster documents: 
- **P(word | topic)**: the probability that a particular word is associated with a particular topic. This first set of probability is also considered as the **Word X Topic** matrix.  
- **P(topics | documents)**: the topics associated with documents. This second set of probability is considered as **Topics X Documents** matrix.   
These probability values are calculated for all words, topics and documents.    

For this tutorial, we will be using the dataset of the Australian Broadcasting Corporation, available on kaggle:   
https://www.kaggle.com/therohk/million-headlines 

## Import Useful Libraries 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Load the Dataset

In [2]:
news_data = pd.read_csv("./data/news-data.csv")
news_data.shape

(1226258, 2)

In [3]:
news_data.sample(5)

Unnamed: 0,publish_date,headline_text
1215218,20200925,beetaloo gas development lands council asked t...
527404,20100324,land buy gives threatened birds space to spread
742690,20121122,ashburton council sacks ceo
766154,20130227,daisy smith interviews casey palmer
519049,20100212,prison escapee shot recaptured


Our data has over a million of records, and there are two columns: 
- the date a particular headline have been published.  
- the actual headline.   
By looking at the first 5 rows, we can see that we don't have the topic of the headline text! So, we will use LDA to attempt to figure out clusters of the news.   
Over a **a million** of record, that is a lot of data. To do so, we will use only **20000** records to make the computation faster. You can increase the number of observation if you wish. 

## Preprocessing.    

In [7]:
NUM_SAMPLES = 20000 # The number of sample to use 
sample_df = news_data.sample(NUM_SAMPLES, replace=False).reset_index(drop=True)

In [8]:
sample_df.shape

(20000, 2)

In [10]:
sample_df.sample(5) # randomly show 5 rows

Unnamed: 0,publish_date,headline_text
6779,20100722,the super seed chia takes off
9010,20160111,tasmanian mother sentenced over death of young...
16006,20141109,a league live streaming updates
17312,20181011,renewed push to remove abortion from crime law...
7497,20050802,shires form regional local govt


We are not interested in the **publish_data** column, since we will only be using **headline_text** data.    

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.     


Be defining the **CountVectorizer** object as below, we ignore:   
- all terms that occur over 95% times in our document corpus. We say in this case that the terms occuring more than this threshold are not significant, most of them are  `stopwords`.   

- all the terms that occur fewer than three times in the entire corpus.  

In [11]:
cv = CountVectorizer(max_df=0.95, min_df=3, stop_words="english")
dtm = cv.fit_transform(sample_df['headline_text'])

In [14]:
dtm

<20000x6425 sparse matrix of type '<class 'numpy.int64'>'
	with 89645 stored elements in Compressed Sparse Row format>

We can observe that our Document X Term Matrix (dtm) has:  
- 20000 documents, and.  
- 6425 distinct words   

We can also get all those words using the `get_feature_names()` function

In [18]:
feature_names = cv.get_feature_names()
len(feature_names) # show the total number of distinct words

6425

Let's have a look at some of the features that have been extracted from the documents.  

In [23]:
feature_names[6420:]

['zimbabwe', 'zimbabwean', 'zone', 'zones', 'zoo']

## LDA     
From our DTM matrix, we can now build our LDA to extract topics from the underlined texts. The number of topic to be extracted is a hyperparameter, so we do not know it a a glance. In our case, we will be using 7 topics.   
LDA is an iterative algorithm, we will have 30 iterations in our case, but the default value is 10.  

In [25]:
# Set the number of topics
NB_TOPICS = 7 

# Creat the model
LDA_model = LatentDirichletAllocation(n_components = NB_TOPICS, max_iter = 30, random_state = 2021)

# Fit the model on the dtm
LDA_model.fit(dtm)

LatentDirichletAllocation(max_iter=30, n_components=7, random_state=2021)

### Show Stored Words.   
Let's randomnly have a look at some words of that have been stored.  

In [14]:
len(feature_names)

6512

In [15]:
import random 
for index in range(15):
    random_word_ID = random.randint(0, 6506)
    print(cv.get_feature_names()[random_word_ID])

critical
cover
prepare
named
day
gender
cold
chancellor
danny
quizzed
offices
november
check
vline
downturn


### Top Words Per Topic

In [16]:
len(LDA_model.components_[0])

6512

In [17]:
# Pick a single topic 
a_topic = LDA_model.components_[0]

# Get the indices that would sort this array
a_topic.argsort()

array([3639, 4609, 1216, ...,  488, 3071, 5073])

In [18]:
# The word least representative of this topic
a_topic[597]

1.1428829090017396

In [19]:
# The word most representative of this topic
a_topic[3598]

2.140750629758621

Let have a look at the top 10 words for the topic we previously took

In [20]:
top_10_words_indices = a_topic.argsort()[-10:]

for i in top_10_words_indices:
    print(cv.get_feature_names()[i])

support
home
government
pm
body
mp
new
australia
iraq
says


This looks like Government Article. Let's have a look at all the 7 topics found. 

In [26]:
for i, topic in enumerate(LDA_model.components_):
    print("THE TOP {} WORDS FOR TOPIC #{}".format(10, i))
    print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print("\n")

THE TOP 10 WORDS FOR TOPIC #0
['weather', 'election', 'labor', 'urged', 'qld', 'act', 'council', 'nsw', 'new', 'govt']


THE TOP 10 WORDS FOR TOPIC #1
['report', 'indigenous', 'country', 'rural', 'charged', 'accused', 'new', 'health', 'calls', 'says']


THE TOP 10 WORDS FOR TOPIC #2
['queensland', 'charges', 'hospital', 'case', 'guilty', 'child', 'sex', 'murder', 'man', 'court']


THE TOP 10 WORDS FOR TOPIC #3
['face', 'england', 'years', 'win', 'australian', 'talks', 'wins', 'final', 'cup', 'world']


THE TOP 10 WORDS FOR TOPIC #4
['probe', 'dead', 'woman', 'killed', 'dies', 'car', 'crash', 'man', 'interview', 'police']


THE TOP 10 WORDS FOR TOPIC #5
['live', 'return', 'care', 'residents', 'test', 'australia', 'new', 'change', 'workers', 'day']


THE TOP 10 WORDS FOR TOPIC #6
['news', 'search', 'west', 'market', 'coronavirus', 'national', 'gold', 'farmers', 'sydney', 'coast']




### Attach Discovered Topic Labels to Original News

In [36]:
# Link documents to topics
final_topics = LDA_model.transform(dtm)

# Show the shape of the object 
print(final_topics.shape)

(20000, 7)

In [37]:
final_topics

array([[0.78546277, 0.03584818, 0.03573262, ..., 0.03573895, 0.03575926,
        0.03572223],
       [0.02396784, 0.85686803, 0.02382918, ..., 0.02381215, 0.02389657,
        0.02381165],
       [0.02042289, 0.02043044, 0.87724226, ..., 0.02047303, 0.0204082 ,
        0.0204082 ],
       ...,
       [0.02041586, 0.31497376, 0.02040817, ..., 0.02042139, 0.02040817,
        0.58295903],
       [0.07142857, 0.07142857, 0.07142857, ..., 0.57142857, 0.07142857,
        0.07142857],
       [0.02061265, 0.5724096 , 0.02043014, ..., 0.02040818, 0.32524833,
        0.02041491]])

**final_topics** contains, for each of our 20.000 documents, the probability score of how likely a document belongs to each of the 7 topics.  This is a Document X Topics matrix. 
For example, below is the probability values for the fourth document.

In [33]:
final_topics[4]

array([0.02046722, 0.87731625, 0.02041653, 0.02040818, 0.0204244 ,
       0.02050817, 0.02045925])

In [34]:
final_topics[4].argmax()

1

This value (4) means that our LDA model thinks that the first document belongs to the 4th topic.

### Combination with the original data     
Let's create a new column called **Topic N°** that will correspond to the topic value to which each document belongs to.

In [38]:
sample_df["Topic N°"] = final_topics.argmax(axis=1)

In [39]:
sample_df.head()

Unnamed: 0,publish_date,headline_text,Topic N°
0,20120712,rac welcomes new laws,0
1,20090608,aboriginal groups praised for weed removal,1
2,20171211,josh homme queens of the stone age kicks photo...,2
3,20061106,iraq likely to be top issue for us voters,3
4,20150423,qld country hour 23 april 2015,1


According to our LDA model:   
- the first document belongs to 4th topic.  
- the second document belongs to 4th topic. 
- the third document belongs to 6th topic.  
etc.   

## Some Visualization       
We will be using the `pyldavis` module to visualize the topics associated to our documents.   

In [40]:
import pyLDAvis.sklearn

In [41]:
pyLDAvis.enable_notebook() # To enable the visualization on the notebook

In [42]:
panel = pyLDAvis.sklearn.prepare(LDA_model, dtm, cv, mds='tsne') # Create the panel for the visualization
panel



### Some Comments On The Graphic     

- By selecting a particular term on the right, we can see which topic(s) it belongs.    
- Vice-versa, by choosing a topic on the left, we can see all the terms, from most to least relevant term.  

**If you liked this kernel, please upvote. I am also open to suggestions**