In [1]:
from sklearn.datasets import fetch_20newsgroups
from lda_model import LDA
from preprocessing import Preprocessing
import pandas as pd

# Latent Dirichlet Allocation : an application

## Authors :
- Mathis Demay
- Luqman Ferdjani

## Purpose of the notebook :

The idea behind this notebook is to apply Latent Dirichlet Analysis, a generative model for document topic prediction. We shall first give a short explanation of what LDA is before applying it to a well known dataset found in sklearn, the 20 newsgroup dataset.

## What is LDA ? A succint explanation

LDA is a model which posits that each document is generated as a mixture of topics where all proportions (words within a topic or within a document, proportions of each topic within a document) are distributed according to latent Dirichlet random variables.

This DAG presents the model : 
    
<img src="images/dag_lda.png"
     alt="DAG of lda"/>

With :

<ul> 
    <li>N the total number of different words</li> 
    <li>M the number of documents in total</li>
    <li>$\alpha$ the concentrations of the Dirichlet distribution used to generate theta</li>
    <li>$\theta$ : a topic mixture. A vector of topic proportions within a document. Topics are distributed according to a multinomial law</li>
    <li>z : the topic</li>
    <li>$\beta$ : a matrix of size k $\times$ V where V is the total number of different words and k is the total number of different topics. Each line is indexed by a topic and each column by a word.</li>
</ul>


<b>Important precision</b> : LDA uses unigrams. Each document is basically treated as a bag of word where each bag is of size one word. This works because the articulation of words within a topic is not necessary to finding its topic. For example fast skimming through a biology article talking about "dna", "rna", "genomics" without finding how these notions are linked within the document one could assume with high probability that the article is about molecular biology.

More information detailed information on what is LDA, the idea behind the model, the model itself, the applications, the inferential methods are found in the detailed report present in the repository.

## Presentation of the newsgroup dataset

The dataset contains 20,000 newsgroups about 20 different topics. Why choose this dataset ? 

- It is easy to find as it is loadable via the sklearn API
- It has great documentation on how to pre-process it
- It is a dataset comprised of documents categorized by topics, which is exactly what LDA is made for. Additionaly, with these provided topics, we can assess the quality of our prediction.

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

In [3]:
#Example of a newsgroup 

print(train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







We first start by stripping the documents of headers footers and quotes. Why ? Because these are parts of newgroups which are irrelevant to topic prediction. Headers and footers typically contain information about the author, references to other newsgroups, locations, etc ... These features could cause our model to overfit.

In [4]:
data = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))

In [5]:
print("The training dataset contains", len(data.data), "training examples")

The training dataset contains 11314 training examples


In [6]:
#Example of a newsgroup

print(data.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


We can also notice something else about the dataset, the presence of special characters and many words which are not relevant to topic analysis such as determinants, common verbs and words etc ... Words that pertain to many different topics and don't give any clear indication. However this will be done in the pre-processing part of our notebook.

In [7]:
#Target topics, these are already coded by numbers

data.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [8]:
#Topics for all newgroups

data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

These topics can get very specific, we shall boil them down to broader topics to simplify the job for our model. We can already notice broader topics about politics, religion, technology, sports, sales, etc ...

## Data pre-processing

The preprocessing pipeline is implemente in the pre-processing module. We split texts into lists of unigrams, and then proceed to lemmatize, remove stop words before indexing words and transforming documents into bags of words.

In [9]:
pp = Preprocessing()
proc_corpus = pp.corpus_preproc(data["data"][:1000])
d, bow = pp.build_bow(proc_corpus)

[nltk_data] Downloading package wordnet to /Users/Lucky/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Example of pre-processed document :

In [None]:
print(proc_corpus[0])

Showing how bag of words function :

In [None]:
print("In text 10 : ")
for i in range(bow[10]):
    print("Word", d[bow[10][i][0]], "occurs", bow[10][i][1], "times")

## Fitting LDA

We instanciate an instance of LDA from the LDA class. We feed it the number of topics we wish to learn, the documents in bag of words form and the index. We can also decide on a value of the alpha hyperparameter of the LDA model. The lower alpha the more the model will be biased towards one topic. The higher the alpha the more uniform our prior over the distribution of topics inside the corpus.

In [10]:
lda = LDA(5, bow, d, alpha=0.1, set_alpha=True)

lda.estimation(max_iter_em=100, max_iter_var=10)

Iteration: 0
E-Step
E-step through 0 documents
E-step through 100 documents
E-step through 200 documents
E-step through 300 documents
E-step through 400 documents
E-step through 500 documents
E-step through 600 documents
E-step through 700 documents
E-step through 800 documents
E-step through 900 documents
M-Step
iteration 0 -63068.791621604796
Iteration: 1
E-Step
E-step through 0 documents
E-step through 100 documents
E-step through 200 documents
E-step through 300 documents
E-step through 400 documents
E-step through 500 documents
E-step through 600 documents
E-step through 700 documents
E-step through 800 documents
E-step through 900 documents
M-Step
iteration 1 121337.30541499938
Iteration: 2
E-Step
E-step through 0 documents
E-step through 100 documents
E-step through 200 documents
E-step through 300 documents
E-step through 400 documents
E-step through 500 documents
E-step through 600 documents
E-step through 700 documents
E-step through 800 documents
E-step through 900 documents

KeyboardInterrupt: 

This method displays for every topic the top 20 words with highest probabilities of occurring.

In [11]:
lda.display_word_topic_association()

network number share store suppos pretti word simpl packag leav futur condit obvious teach sale american grant publish consider normal 
better basic warn make total abl generat success ask stori user voic restrict strong special notic grant regard individu associ 
figur mayb prove tell class consid appreci attempt happi claim practic order forget polit self easi million explan law paper 
call earli answer appear code class evid bibl creat water edit reason book contain packag resourc proper mere will sit 
anybodi display mayb basic ignor delet fact true hard moral pretti place repli maintain violat packag wouldn measur demonstr determin 
