# Latent Dirichlet allocation


In [13]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import pickle
import numpy as np
import json
from collections import defaultdict

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDA.html#pyspark.mllib.clustering.LDA

https://spark.apache.org/docs/1.6.2/mllib-clustering.html#latent-dirichlet-allocation-lda



## 1. Topics extraction

Pre-processed courses: list of documents with their word-count vector (represented as a dict{word:count})

In [14]:
with open('data/ppcourses.pkl', 'rb') as f:
    pre_processed_courses = pickle.load(f)

In [15]:
len(pre_processed_courses)

854

Set of words: the set of all occuring words

In [16]:
with open('data/set_of_words.pkl', 'rb') as f1:
    set_of_words = pickle.load(f1)

In [17]:
set_of_words = list(set_of_words.keys())

In [18]:
len(set_of_words)

5806

In [19]:
data = sc.parallelize(pre_processed_courses)

In [20]:
#Create a list of count vectors for each document (shape: 854x3002)
parsedData = data.map(lambda d: Vectors.dense([d['description'][w] for w in set_of_words]))

In [21]:
#Create an id for each document
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

In [43]:
# Cluster the documents into 10 topics using LDA
ldaModel = LDA.train(corpus, k=10)

In [44]:
#topics matrix: for every 5806 word, we obtain a size-10 vector with the word's distribution per topic
topics = ldaModel.topicsMatrix()

In [34]:
#Prints the top-m words for the first l topics
def print_topics_top_words(ldaModel, topics, l, m):
    for topic in range(l):
        print("- Topic " + str(topic+1) + ":")
        w = [topics[word][topic] for word in range(0, ldaModel.vocabSize())]
        x = np.flip(np.argsort(w))
        for i in range(0,m):
            print(set_of_words[x[i]])
        print('----------------------')

In [46]:
print_topics_top_words(ldaModel, topics, 10, 10)

-Topic 1:
drug
cancer
business
doctoral
disease
open
area
priority
pathway
doctoral student
----------------------
-Topic 2:
signal processing
matlab
linear algebra
finite
fluid
regression
estimation
discrete
speech
coding
----------------------
-Topic 3:
building
urban
structural
studio
reading
identify
beam
thinking
architectural
making
----------------------
-Topic 4:
noise
low
cmos
analog
wave
acoustic
fiber
coupling
mode
filter
----------------------
-Topic 5:
financial
chain
pricing
finance
option
portfolio
choice
asset
fracture
supply
----------------------
-Topic 6:
treatment
robot
c
quality
production
plasma
object
wastewater
animal
student learn
----------------------
-Topic 7:
policy
conversion
social
thermodynamic
rate
equilibrium
energy conversion
medium
cycle
balance
----------------------
-Topic 8:
progress
written report
neuroscience
acquired
waste
adapt
progress plan
plan adapt
ass progress
adapt plan
----------------------
-Topic 9:
theorem
calculus
common
en
integral

#### Comments:

Some labels for each topic:
- Topic 1: Pharma
- Topic 2: Signal Processing 
- Topic 3: Civil Engineering
- Topic 4: Electronics
- Topic 5: Business Administration/Economics
- Topic 6: Various Subjects
- Topic 7: SHS/Thermodynamics 
- Topic 8: Study Methods
- Topic 9: Maths
- Topic 10: Materials Engineering

We may obtain a different result everytime we train the model.
Some topics are hard to describe (like 6 and 7) as they seem to cover several subjects.

LDA seems to perform better than LSI here, with more clear and better separated topics.

## 2. Dirichlet hyperparameters

In the PySpark library, this is how the hyperparameters are called
- $\alpha$ : docConcentration
- $\beta$ : topicConcentration

$\alpha$ is the distribution of topics in documents and $\beta$ is the distributions of words over topics.

In [29]:
#Trains a LDA model with hyperparameters alpha, beta and k, prints the top words of first few topics
def LDA_with_hyperparameters(alpha, beta, k=10):
    ldaModel = LDA.train(corpus, k=10, docConcentration=alpha, topicConcentration=beta, seed=0)
    topics = ldaModel.topicsMatrix()
    print_topics_top_words(ldaModel, topics, l=3, m=3)

In [33]:
#default values:
alpha = 6.0
beta = 1.01
#test values:
alphas = [1.01, 2.0, 6.0, 10.0, 50.0, 100.0, 500.0]
betas = [1.01, 1.51, 3.0, 6.0, 10.0, 50.0, 100.0]

The choice of test-values is based on the fact that:
-  a high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically.
- a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.


(Note: This implementation of LDA requires alpha > 1.0)

In [31]:
#1. Vary alpha
for a in alphas:
    print("#### alpha = " + str(a) + " ####")
    LDA_with_hyperparameters(a, beta)

#### alpha = 1.01 ####
-Topic 1:
low
analog
noise
----------------------
-Topic 2:
security
finance
final exam
----------------------
-Topic 3:
supply
culture
grid
----------------------
#### alpha = 2.0 ####
-Topic 1:
noise
low
analog
----------------------
-Topic 2:
finance
security
final exam
----------------------
-Topic 3:
culture
supply
chain
----------------------
#### alpha = 6.0 ####
-Topic 1:
noise
low
finite
----------------------
-Topic 2:
final exam
discrete
finance
----------------------
-Topic 3:
culture
supply
fluid
----------------------
#### alpha = 10.0 ####
-Topic 1:
noise
analog
low
----------------------
-Topic 2:
security
discrete
calculus
----------------------
-Topic 3:
fluid
chain
supply
----------------------
#### alpha = 50.0 ####
-Topic 1:
business
financial
theorem
----------------------
-Topic 2:
speech
food
option
----------------------
-Topic 3:
snow
electronics
chain
----------------------
#### alpha = 100.0 ####
-Topic 1:
business
production
metal
---

#### Comments: 

Varying $\alpha$ by a little bit (between 1 and 10) does not have much effect. It starts having an effect when $\alpha$ starts getting bigger: as we jump to $\alpha = 50$, we see a big change in every topic. Then, the result tends to stabilize (even by making $\alpha$ change a lot), with small changes but the same top-words.

In [32]:
#2. Vary beta
for b in betas:
    print("#### beta = " + str(b) + " ####")
    LDA_with_hyperparameters(alpha, b)

#### beta = 1.01 ####
-Topic 1:
noise
low
finite
----------------------
-Topic 2:
final exam
discrete
finance
----------------------
-Topic 3:
culture
supply
fluid
----------------------
#### beta = 1.51 ####
-Topic 1:
low
noise
analog
----------------------
-Topic 2:
financial
finance
security
----------------------
-Topic 3:
chain
supply
sample
----------------------
#### beta = 3.0 ####
-Topic 1:
discrete
business
multi
----------------------
-Topic 2:
speech
signal processing
discrete
----------------------
-Topic 3:
snow
chain
electronics
----------------------
#### beta = 6.0 ####
-Topic 1:
large
final exam
reading
----------------------
-Topic 2:
large
quality
final exam
----------------------
-Topic 3:
reading
quality
large
----------------------
#### beta = 10.0 ####
-Topic 1:
large
quality
reading
----------------------
-Topic 2:
large
quality
reading
----------------------
-Topic 3:
reading
quality
large
----------------------
#### beta = 50.0 ####
-Topic 1:
reading
large
qu

#### Comments:

On the other hand, small variations on a small $\beta$, even by 1/2 has an effect on the topics' _leading_ words, event though the top words roughly stay the same. We observe a lot of change with values $\beta = \{1.01, 1.51, 3.0,  6.0\}$. But from 6.0, and all larger $\beta's$, all topics start containing the same top-words, which is expected as it means all topics are more likely to contain a mixture of all words.

## 3. EPFL's taught subjects

In [48]:
corpus.count()

854

There are about 7 faculties and more or less 15 sections. Our corpus has 854 documents and 5806 words.

In our case, every document describes a specific course so is more likely to have one or a few topics. And as each topic seems to be some scientific subject, a topic may contain a mixture of just a few words.



- k: They are many faculties and different fields of engineering taught at EPFL, so we pick a rather large k to make sure topics are well separated. We also want to avoid a topic containing several unrelated subjects.
- alpha: rather small because each document contain a small number of topics.
- beta: small because it will produce quite specific topics when it comes to words, as we chose a rather large k.

In [103]:
alpha = 6.0
beta = 1.01
k = 15

In [104]:
ldaModelEPFL = LDA.train(corpus, k, docConcentration=alpha, topicConcentration=beta, seed=2)
topics = ldaModelEPFL.topicsMatrix()

In [105]:
print_topics_top_words(ldaModel, topics, k, 5)

-Topic 1:
fluid
layer
thermodynamic
conversion
bio
----------------------
-Topic 2:
electronics
low
analog
brain
wireless
----------------------
-Topic 3:
progress
written report
acquired
obtained
professor
----------------------
-Topic 4:
reactor
kinetics
equilibrium
law
liquid
----------------------
-Topic 5:
chain
business
supply
regression
urban
----------------------
-Topic 6:
financial
calculus
rate
finance
option
----------------------
-Topic 7:
policy
industry
coding
speech
recognition
----------------------
-Topic 8:
quality
waste
discipline
reading
sample
----------------------
-Topic 9:
cycle
membrane
conversion
corporate
energy conversion
----------------------
-Topic 10:
drug
doctoral
selected
disease
food
----------------------
-Topic 11:
robot
reach
production
set objective
action plan
----------------------
-Topic 12:
spectroscopy
tissue
diffraction
ray
studio
----------------------
-Topic 13:
semiconductor
fiber
3d
stress
fracture
----------------------
-Topic 14:
stru

#### Comments:
$\alpha = 6.0, \beta = 1.01, k = 15$

Labels for topics:

- Topic 1: Thermodynamics
- Topic 2: Electronics 
- Topic 3: Study Methods
- Topic 4: Physics
- Topic 5: Business
- Topic 6: Finance
- Topic 7: Business Ventures for AI
- Topic 8: Environmental Science
- Topic 9: Biology
- Topic 10: Medical Research
- Topic 11: Robotics
- Topic 12: Human Medicine
- Topic 13: Materials Engineering
- Topic 14: Structure
- Topic 15: Maths

_Note:_ Some topics are not well defined or hard to defined as they contain words that could be interpreted as severeal topics. Also, the topics are not as clear and separated as desired.