`Zoumana KEITA, Data Scientist`

# Latent Dirichlet Allocation / Analysis (LDA)     

**Note**: you will need to unzip the data from the `data` folder in order to follow this notebook.  

This is a probabilistic model used to find clusters assigments for documents.  
It uses two probability values to cluster documents: 
- **P(word | topic)**: the probability that a particular word is associated with a particular topic. This first set of probability is also considered as the **Word X Topic** matrix.  
- **P(topics | documents)**: the topics associated with documents. This second set of probability is considered as **Topics X Documents** matrix.   
These probability values are calculated for all words, topics and documents.    

For this tutorial, we will be using the dataset of the Australian Broadcasting Corporation, available on kaggle:   
https://www.kaggle.com/therohk/million-headlines 

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [3]:
news_data = pd.read_csv("./data/news-data.csv")

In [5]:
news_data.shape

(1103663, 2)

In [6]:
news_data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


Our data have over a million of records, and there are two columns: 
- the date a particular headline have been published.  
- the actual headline.   
By looking at the first 5 rows, we can see that we don't have the topic of the headline text! So, we will use LDA to attempt to figure out clusters of the news.   
**A million** of record, that is a lot of data. To do so, we will use only **12000** records to make the computation faster.   

## Preprocessing.    

In [7]:
NUM_SAMPLES = 12000 # The number of sample to use 

In [8]:
sample_df = news_data.sample(NUM_SAMPLES, replace=False).reset_index(drop=True)

In [9]:
sample_df.shape

(12000, 2)

In [10]:
sample_df.head()

Unnamed: 0,publish_date,headline_text
0,20060531,closer am1
1,20121017,national rural news for wednesday 171012
2,20090109,bail has been granted in a perth court to three
3,20170814,barnaby joyce caught up in citizenship debacle
4,20070423,police investigate portland abduction attempt sex


We are not interested in the **publish_data** column, since we will only be using **headline_text** data.  

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Be defining the **CountVectorizer** object as below, we ignore:   
- all terms that occur over 95% times in our document corpus. We say in this case that the terms occuring more than this threshold are not significant, most of them are  `stopwords`.   

- all the terms that occur fewer than twice in the entire corpus.  

In [27]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words="english")

In [28]:
dtm = cv.fit_transform(sample_df['headline_text'])

In [29]:
dtm

<12000x6506 sparse matrix of type '<class 'numpy.int64'>'
	with 54104 stored elements in Compressed Sparse Row format>

We can observe that our Document X Term Matrix (dtm) has:  
- 12000 documents, and.  
- 6506 distinct words   

We can also get all those words using the `get_feature_names()` function

In [30]:
feature_names = cv.get_feature_names()
len(feature_names) # show the total number of distinct words

6506

Let's have a look at some of the features that have been extracted from the documents.  

In [31]:
feature_names[6500:]

['zidane', 'zimbabwe', 'zone', 'zoo', 'zoos', 'zvonareva']

## LDA.     
From our DTM matrix, we can now build our LDA to extract topics from the underlined texts. The number of topic to be extracted is a hyperparameter, so we do not know it a a glance. In our case, we will be using 7 topics.   
LDA is an iterative algorithm, we will have 30 iterations in our case, but the default value is 10.  

In [32]:
NUM_TOPICS = 7

In [33]:
LDA_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=30, random_state=42)

In [39]:
LDA_model.fit(dtm)

LatentDirichletAllocation(max_iter=30, n_components=7, random_state=42)

## Show Stored Words.   
Let's randomnly have a look at some words of that have been stored.  

In [41]:
len(feature_names)

6506

In [42]:
import random 
for index in range(15):
    random_word_ID = random.randint(0, 6506)
    print(cv.get_feature_names()[random_word_ID])

roper
victory
rescues
firefighting
passionate
hired
blackouts
vegie
area
spurs
swimming
jamaican
verdict
seven
rolf


### Top Words Per Topic

In [48]:
len(LDA_model.components_[0])

6506

In [50]:
# Pick a single topic 
a_topic = LDA_model.components_[0]

# Get the indices that would sort this array
a_topic.argsort()

array([ 597, 3660, 5316, ..., 5070, 3921, 3598])

In [54]:
# The word least representative of this topic
a_topic[597]

0.14285721794371803

In [53]:
# The word most representative of this topic
a_topic[3598]

67.14227858485086

Let have a look at the top 10 words for the topic we previously took

In [57]:
top_10_words_indices = a_topic.argsort()[-10:]

for i in top_10_words_indices:
    print(cv.get_feature_names()[i])

news
local
kills
govt
land
sydney
time
says
new
market


This looks like Government Article. Let's have a look at all the 7 topics found. 

In [60]:
for i, topic in enumerate(LDA_model.components_):
    print("THE TOP {} WORDS FOR TOPIC #{}".format(10, i))
    print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print("\n")

THE TOP 10 WORDS FOR TOPIC #0
['news', 'local', 'kills', 'govt', 'land', 'sydney', 'time', 'says', 'new', 'market']


THE TOP 10 WORDS FOR TOPIC #1
['north', 'rise', 'wa', 'calls', 'continues', 'health', 'council', 'crash', 'qld', 'new']


THE TOP 10 WORDS FOR TOPIC #2
['child', 'report', 'government', 'dies', 'gold', 'set', 'workers', 'win', 'coast', 'interview']


THE TOP 10 WORDS FOR TOPIC #3
['death', 'wa', 'begins', 'australian', 'power', 'high', 'election', 'plan', 'says', 'water']


THE TOP 10 WORDS FOR TOPIC #4
['abc', 'rural', 'takes', 'rain', 'farmers', 'defends', 'hour', 'country', 'nsw', 'day']


THE TOP 10 WORDS FOR TOPIC #5
['face', 'cup', 'deal', 'faces', 'mp', 'hit', 'urged', 'group', 'council', 'govt']


THE TOP 10 WORDS FOR TOPIC #6
['attack', 'charges', 'death', 'accused', 'charged', 'woman', 'murder', 'court', 'man', 'police']




### Attach Discovered Topic Labels to Original News

In [61]:
final_topics = LDA_model.transform(dtm)
final_topics.shape

(12000, 7)

**final_topics** contains, for each of our 12000 documents, the probability score of how likely a document belongs to each of the 7 topics.  This is a Document X Topics matrix. 
For example, below is the probability values for the first document.

In [63]:
final_topics[0]

array([0.04761906, 0.04761906, 0.04761906, 0.04761906, 0.71428564,
       0.04761906, 0.04761906])

In [64]:
final_topics[0].argmax()

4

This value (4) means that our LDA model thinks that the first document belongs to the 4th topic

### Combination with the original data     
Let's create a new column that will contain the topic value for each document.   

In [65]:
sample_df["Topic N°"] = final_topics.argmax(axis=1)

In [66]:
sample_df.head()

Unnamed: 0,publish_date,headline_text,Topic N°
0,20060531,closer am1,4
1,20121017,national rural news for wednesday 171012,4
2,20090109,bail has been granted in a perth court to three,6
3,20170814,barnaby joyce caught up in citizenship debacle,1
4,20070423,police investigate portland abduction attempt sex,6


According to our LDA model:   
- the first document belongs to 4th topic.  
- the second document belongs to 4th topic. 
- the third document belongs to 6th topic.  
etc.   

# Some Visualization       
We will be using the `pyldavis` module to visualize the topics associated to our documents.   

In [68]:
#!pip install pyldavis

In [75]:
import pyLDAvis.sklearn

In [71]:
pyLDAvis.enable_notebook()

In [72]:
#transformed_vector = dtm
#lda_model = final_topics

In [77]:
panel = pyLDAvis.sklearn.prepare(LDA_model, dtm, cv, mds='tsne')

In [78]:
panel

### Some Comments On The Graphic     

- By selecting a particular term on the right, we can see which topic(s) it belongs.    
- Vice-versa, by choosing a topic on the left, we can see all the terms, from most to least relevant term.  

# Congratulations ! 