In [17]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# Latent Dirichlet Allocation : an application

## Authors :
- Mathis Demay
- Luqman Ferdjani

## Purpose of the notebook :

The idea behind this notebook is to apply Latent Dirichlet Analysis, a generative model for document topic prediction. We shall first give a short explanation of what LDA is before applying it to a well known dataset found in sklearn, the 20 newsgroup dataset.

## What is LDA ? A succint explanation

LDA is a model which posits that each document is generated as a mixture of topics where all proportions (words within a topic or within a document, proportions of each topic within a document) are distributed according to latent Dirichlet random variables.

This DAG presents the model : 
    
<img src="images/dag_lda.png"
     alt="DAG of lda"/>

With :

<ul> 
    <li>N the number of words* and topics in total</li> 
    <li>M the number of documents in total</li>
    <li>$\alpha$ the concentrations of the Dirichlet distribution used to generate theta</li>
    <li>$\theta$ : a topic mixture. A vector of topic proportions within a document. Topics are distributed according to a multinomial law</li>
    <li>z : the topic</li>
    <li>$\beta$ : a matrix of size k * V where V are all the different words and V and k all the different topics. Each line is indexed by a topic and each column by a word.</li>
</ul>


<b>Important precision</b> : LDA uses unigrams. Each document is basically treated as a bag of word where each bag is of size one word. This works because the articulation of words within a topic is not necessary to finding its topic. For example fast skimming through a biology article talking about "dna", "rna", "genomics" without finding how these notions are linked within the document one could assume with high probability that the article is about molecular biology.

More information detailed information on what is LDA, the idea behind the model, the model itself, the applications, the inferential methods are found in the detailed report present in the repository.

## Presentation of the newsgroup dataset

The dataset contains 20000 newsgroups about 20 different topics. Why choose this dataset ? 

- It is easy to find as it is loadable via the sklearn API
- It has great documentation on how to pre-process it
- It is a dataset comprised of documents categorized by topics. Which is exactly what LDA is made for



In [39]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

In [40]:
#Example of a newsgroup 

train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

We first start by stripping the documents of headers footers and quotes. Why ? Because these are parts of newgroups which are irrelevant to topic prediction. Headers and footers typically contain information about the author, references to other newsgroups, locations, etc ... These features could cause our model to overfit.

In [41]:
train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [45]:
print("The training dataset contains,", len(train.data), "training examples")
print("The testing dataset contains,", len(test.data), "training examples")

The training dataset contains, 11314 training examples
The testing dataset contains, 7532 training examples


In [42]:
#Example of a newsgroup

train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

We can also notice something else about the dataset, the presence of special characters and many words which are not relevant to topic analysis such as determinants, common verbs and words etc ... Words that pertain to many different topics and don't give any clear indication. However this will be done in the pre-processing part of our notebook.

In [36]:
#Target topics, these are already coded by numbers

train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [33]:
#Topics for all newgroups

train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

These topics can get very specific, we shall boil them down to broader topics to simplify the job for our model. We can already notice broader topics about politics, religion, technology, sports, sales, etc ...

## Data pre-processing

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

## Fitting LDA

## Results evaluation