In [18]:
%load_ext autoreload
%autoreload 2

from sklearn.datasets import fetch_20newsgroups
from lda_model import LDA
from preprocessing import Preprocessing
import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Latent Dirichlet Allocation : an application

## Authors :
- Mathis Demay
- Luqman Ferdjani

## Purpose of the notebook :

The idea behind this notebook is to apply Latent Dirichlet Analysis, a generative model for document topic prediction. We shall first give a short explanation of what LDA is before applying it to a well known dataset found in sklearn, the 20 newsgroup dataset.

## What is LDA ? A succint explanation

LDA is a model which posits that each document is generated as a mixture of topics where all proportions (words within a topic or within a document, proportions of each topic within a document) are distributed according to latent Dirichlet random variables.

This DAG presents the model : 
    
<img src="images/dag_lda.png"
     alt="DAG of lda"/>

With :

<ul> 
    <li>N the total number of different words</li> 
    <li>M the number of documents in total</li>
    <li>$\alpha$ the concentrations of the Dirichlet distribution used to generate theta</li>
    <li>$\theta$ : a topic mixture. A vector of topic proportions within a document. Topics are distributed according to a multinomial law</li>
    <li>z : the topic</li>
    <li>$\beta$ : a matrix of size k $\times$ V where V is the total number of different words and k is the total number of different topics. Each line is indexed by a topic and each column by a word.</li>
</ul>


<b>Important precision</b> : LDA uses unigrams. Each document is basically treated as a bag of word where each bag is of size one word. This works because the articulation of words within a topic is not necessary to finding its topic. For example fast skimming through a biology article talking about "dna", "rna", "genomics" without finding how these notions are linked within the document one could assume with high probability that the article is about molecular biology.

More information detailed information on what is LDA, the idea behind the model, the model itself, the applications, the inferential methods are found in the detailed report present in the repository.

## Presentation of the newsgroup dataset

The dataset contains 20,000 newsgroups about 20 different topics. Why choose this dataset ? 

- It is easy to find as it is loadable via the sklearn API
- It has great documentation on how to pre-process it
- It is a dataset comprised of documents categorized by topics, which is exactly what LDA is made for. Additionaly, with these provided topics, we can assess the quality of our prediction.

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

In [3]:
#Example of a newsgroup 

print(train.data[0])

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

We first start by stripping the documents of headers footers and quotes. Why ? Because these are parts of newgroups which are irrelevant to topic prediction. Headers and footers typically contain information about the author, references to other newsgroups, locations, etc ... These features could cause our model to overfit.

In [4]:
train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [5]:
print("The training dataset contains", len(train.data), "training examples")
print("The testing dataset contains", len(test.data), "test examples")

The training dataset contains, 11314 training examples
The testing dataset contains, 7532 training examples


In [6]:
#Example of a newsgroup

print(train.data[0])

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

We can also notice something else about the dataset, the presence of special characters and many words which are not relevant to topic analysis such as determinants, common verbs and words etc ... Words that pertain to many different topics and don't give any clear indication. However this will be done in the pre-processing part of our notebook.

In [7]:
#Target topics, these are already coded by numbers

train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [8]:
#Topics for all newgroups

train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

These topics can get very specific, we shall boil them down to broader topics to simplify the job for our model. We can already notice broader topics about politics, religion, technology, sports, sales, etc ...

## Data pre-processing

In [9]:
pp = Preprocessing()
proc_corpus = pp.corpus_preproc(train)
d, bow = pp.build_bow(proc_corpus)

[nltk_data] Downloading package wordnet to /Users/Lucky/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Fitting LDA

In [19]:
lda = LDA(10, bow, d, alpha=1, set_alpha=True)

lda.estimation()

Iteration: 0
E-Step
E-step through 0 documents
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
M-Step
0.0
iteraton 0 -53.543210862534735
Iteration: 1
E-Step
E-step through 0 documents
Iteration 0 of variational parameters estimation
-10.708642172506947
Iteration 1 of variational parameters estimation
-10.708642172506947
Iteration 0 of variational parameter

(array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[1.2, 1.2, 1.2, 1.2, 1.2],
        [1.2, 1.2, 1.2, 1.2, 1.2],
        [1.2, 1.2, 1.2, 1.2, 1.2],
        [1.2, 1.2, 1.2, 1.2, 1.2],
        [1.2, 1.2, 1.2, 1.2, 1.2]]),
 array([[[0., 0., 0., 0., 0.]],
 
        [[0., 0., 0., 0., 0.]],
 
        [[0., 0., 0., 0., 0.]],
 
        [[0., 0., 0., 0., 0.]],
 
        [[0., 0., 0., 0., 0.]]]))

## Results evaluation