# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Albert Koppelmaa*
* *Edouard Lacroix*
* *Guillem Pruñonosa Soler*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
%config Completer.use_jedi = False
import scipy.sparse

In [2]:
from utils import load_json

courses = load_json('data/courses.txt')


In [3]:
!hdfs dfs -put data/bagofwords.txt

put: `bagofwords.txt': File exists


In [4]:
import json
bagofwords = sc.textFile("bagofwords.txt").map(json.loads)

In [5]:
n_gram_list = list(bagofwords.take(1)[0])

In [6]:
def convert_to_index(bow):
    final_list = dict()
    for index,word in enumerate(bow):
        count= bow[word]
        if(count != 0.0):
            final_list[index]= count
    return Vectors.sparse(len(bow), final_list)
    
    
bag_indexes = bagofwords.map(convert_to_index)

In [7]:
bag_with_document_index = bag_indexes.zipWithIndex().map(lambda x : [x[1],x[0]])

In [8]:
lda = LDA.train(bag_with_document_index, k=10, seed=1, maxIterations=30)

## Exercise 4.8: Topics extraction

In [9]:
topics = lda.describeTopics(10)

In [10]:
def printTopics(topics):
    counter = 1
    for topic in topics:
        print("Topic "+str(counter) + ": ")
        counter +=1
        words = topic[0]
        for word in words:
            print(n_gram_list[word])
        print("----------------")


In [11]:
printTopics(topics)

Topic 1: 
case study
expected activity
transversal skill
linear algebra
solid waste
office hour
mechanical property
problem set
final exam
final grade
----------------
Topic 2: 
important start
transversal skill
expected activity
chemical engineering
written exam
mass spectrometry
office hour
supervision office
evaluate source
solar cell
----------------
Topic 3: 
transversal skill
heat transfer
important start
office hour
assistant forum
hour assistant
supervision office
expected activity
data structure
monte carlo
----------------
Topic 4: 
signal processing
expected activity
transversal skill
office hour
important start
supervision office
hour assistant
written exam
assistant forum
evaluate source
----------------
Topic 5: 
transversal skill
written report
electron microscopy
expected activity
plan adapt
adapt plan
progress plan
scientific technical
oral presentation
transmission electron
----------------
Topic 6: 
transversal skill
expected activity
important start
oral presentatio

Each of the ten proposed topics are described by the following words: 

1. **Mathematics**
expected activity
case study
transversal skill
linear algebra
mechanical property
solid waste
final grade
problem set
office hour
important start

2. **Chemistry** 
important start
transversal skill
expected activity
chemical engineering
evaluate source
mass spectrometry
office hour
written exam
supervision office
hour assistant

3. **Computer science** 
heat transfer
office hour
transversal skill
important start
assistant forum
hour assistant
supervision office
expected activity
monte carlo
data structure

4. **Communication systems**
signal processing
expected activity
transversal skill
office hour
important start
supervision office
hour assistant
assistant forum
evaluate source
written exam

5. **Microtechnics** 
transversal skill
written report
electron microscopy
expected activity
oral presentation
plan adapt
adapt plan
progress plan
scientific technical
continuous control

6. **??? No distinguishing topic / outside of any other topics** 
transversal skill
expected activity
important start
oral presentation
supervision office
hour assistant
office hour
case study
general domain
domain specific

7. **Biology**
transversal skill
expected activity
office hour
supervision office
stem cell
hour assistant
assistant forum
activity make
make optimal
carry activity

8. **Physcis ?** 
energy conversion
transversal skill
office hour
supervision office
hour assistant
expected activity
assistant forum
written exam
evaluate source
case study

9. **Phd program**
transversal skill
expected activity
supervision office
office hour
hour assistant
assistant forum
note open
note note
important start
doctoral priority

10. **Material sciences** 
expected activity
important start
case study
supervision office
office hour
magnetic material
transversal skill
hour assistant
intellectual property
resource bibliography

As we can see, many of the words are present in multiple/almost all of the topics. This is the same kind of issue we had with LSI. Because many of the words are similar we are forced to guess the topics based on the few words that distinguish each topic.



## Exercise 4.9: Dirichlet hyperparameters

### 4.9.1 Trying with Beta = 1.01 and varying alpha

In [12]:
lda2 = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 1.01,
                 docConcentration = 1.01,maxIterations=30)


In [13]:
printTopics(lda2.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
important start
supervision office
solid waste
hour assistant
linear algebra
assistant forum
policy regulation
----------------
Topic 2: 
transversal skill
important start
expected activity
office hour
supervision office
hour assistant
chemical engineering
scientific technical
evaluate source
written exam
----------------
Topic 3: 
transversal skill
important start
expected activity
office hour
supervision office
hour assistant
assistant forum
heat transfer
linear algebra
written exam
----------------
Topic 4: 
transversal skill
expected activity
office hour
supervision office
hour assistant
important start
assistant forum
oral presentation
evaluate source
case study
----------------
Topic 5: 
transversal skill
expected activity
written report
oral presentation
hour assistant
supervision office
office hour
electron microscopy
important start
signal processing
----------------
Topic 6: 
case study
transversal skill
expected activ

In [14]:
lda3 = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 1.01,
                 docConcentration = 2.01,maxIterations=30)

In [15]:
printTopics(lda3.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
important start
supervision office
hour assistant
solid waste
linear algebra
assistant forum
policy regulation
----------------
Topic 2: 
transversal skill
important start
expected activity
office hour
supervision office
hour assistant
chemical engineering
scientific technical
evaluate source
differential equation
----------------
Topic 3: 
transversal skill
important start
expected activity
office hour
hour assistant
supervision office
assistant forum
heat transfer
written exam
data structure
----------------
Topic 4: 
transversal skill
expected activity
office hour
supervision office
hour assistant
important start
assistant forum
oral presentation
evaluate source
case study
----------------
Topic 5: 
transversal skill
expected activity
written report
oral presentation
electron microscopy
hour assistant
supervision office
office hour
signal processing
important start
----------------
Topic 6: 
case study
transversal skill
expec

In [16]:
lda4 = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 1.01,
                 docConcentration = 10.01,maxIterations=30)

In [17]:
printTopics(lda4.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
linear algebra
solid waste
mechanical property
monte carlo
important start
policy regulation
final grade
----------------
Topic 2: 
important start
transversal skill
expected activity
chemical engineering
office hour
scientific technical
supervision office
evaluate source
written exam
hour assistant
----------------
Topic 3: 
important start
office hour
transversal skill
assistant forum
hour assistant
supervision office
heat transfer
expected activity
exam supervision
monte carlo
----------------
Topic 4: 
expected activity
transversal skill
office hour
supervision office
hour assistant
signal processing
oral presentation
important start
evaluate source
scientific technical
----------------
Topic 5: 
transversal skill
written report
expected activity
oral presentation
electron microscopy
transmission electron
life science
plan carry
activity make
optimal time
----------------
Topic 6: 
case study
transversal skill
expected activ

#### 4.9.1 results
As alpha is increased the topics seems to be easier and easier to distinguish.This is because a small alpha leads to the tendency to map documents to a small set of dominant topics, however our data is about course descriptions of a technical university. The course descriptions often go over typical administrative or logistic data like office hours and some technical jargon shared between different classes. This means each document will most likely be part of multiple "topics" and therefore needs a higher alpha.

### 4.9.2 Trying with Alpha = 6 and varying Beta


In [18]:
lda = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 1.01,
                 docConcentration = 6.0,maxIterations=30)
printTopics(lda.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
linear algebra
solid waste
important start
mechanical property
monte carlo
supervision office
hour assistant
----------------
Topic 2: 
transversal skill
important start
expected activity
office hour
supervision office
chemical engineering
scientific technical
hour assistant
differential equation
evaluate source
----------------
Topic 3: 
transversal skill
important start
office hour
hour assistant
supervision office
assistant forum
heat transfer
expected activity
data structure
exam supervision
----------------
Topic 4: 
expected activity
transversal skill
office hour
supervision office
hour assistant
oral presentation
important start
evaluate source
assistant forum
scientific technical
----------------
Topic 5: 
transversal skill
written report
expected activity
oral presentation
electron microscopy
signal processing
life science
activity make
carry activity
optimal time
----------------
Topic 6: 
case study
transversal skill


In [19]:
lda = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 2.01,
                 docConcentration = 6.0,maxIterations=30)
printTopics(lda.describeTopics(10))

Topic 1: 
case study
transversal skill
expected activity
office hour
supervision office
hour assistant
assistant forum
important start
evaluate source
problem set
----------------
Topic 2: 
expected activity
office hour
important start
transversal skill
supervision office
hour assistant
assistant forum
written exam
evaluate source
case study
----------------
Topic 3: 
expected activity
office hour
important start
supervision office
hour assistant
transversal skill
assistant forum
written exam
linear algebra
case study
----------------
Topic 4: 
expected activity
important start
office hour
supervision office
hour assistant
transversal skill
signal processing
assistant forum
written exam
linear algebra
----------------
Topic 5: 
transversal skill
expected activity
scientific technical
oral presentation
written report
activity make
make optimal
optimal time
carry activity
find optimal
----------------
Topic 6: 
expected activity
transversal skill
office hour
important start
supervision o

In [20]:
lda = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 4.01,
                 docConcentration = 6.0,maxIterations=30)
printTopics(lda.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 2: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 3: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 4: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 5: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 6: 
transversal skill
expected activity
office hour
supervision office
important start
ho

In [132]:
lda = LDA.train(bag_with_document_index, k=10, seed=1,topicConcentration = 6.01,
                 docConcentration = 6.0,maxIterations=30)
printTopics(lda.describeTopics(10))

Topic 1: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 2: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 3: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 4: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 5: 
transversal skill
expected activity
office hour
supervision office
important start
hour assistant
assistant forum
case study
written exam
evaluate source
----------------
Topic 6: 
transversal skill
expected activity
office hour
supervision office
important start
ho

#### 4.9.2 results
As Beta increases we see that there are more and more words that are present in each or multiple topic. This is because Beta is hyperparameter on word distribution per topic, meaning higher Beta means that a word often belongs to multiple topics. However, in this case we know that there are a lot of words that clearly distinguish between different topics present in course description. For example mitochondria or dna would most likely only be present in a topic related to Biology etc. This means to better analyse the course descriptions we need a low Beta. 

## Exercise 4.10: EPFL's taught subjects

As explained in 4.9, for maximum descernability we should use a low Beta and high Alpha value for the LDA. We try a few different values to find a somewhat optimal solution. First intuition is that K has to be at least the amount of study plans https://www.epfl.ch/education/studies/en/rules-and-procedures/study_plans/ or the amount of schools on https://www.epfl.ch/schools/. Also accounting for other topics such as administration and the fact that every study plan accounts for multiple topics K has to be quite big.

In [22]:
lda = LDA.train(bag_with_document_index, k=25, seed=1,topicConcentration = 1.01,
                 docConcentration = 20.0)
printTopics(lda.describeTopics(10))

Topic 1: 
problem set
case study
solid waste
transversal skill
oral presentation
waste management
exchange rate
expected activity
waste engineering
office hour
----------------
Topic 2: 
office hour
transversal skill
expected activity
life cycle
hour assistant
important start
supervision office
topic include
assistant forum
science engineering
----------------
Topic 3: 
transversal skill
expected activity
office hour
supervision office
hour assistant
important start
assistant forum
written exam
evaluate source
case study
----------------
Topic 4: 
case study
transversal skill
expected activity
large area
office hour
supervision office
important start
sensor actuator
hour assistant
assistant forum
----------------
Topic 5: 
monte carlo
asset pricing
linear algebra
social medium
transversal skill
financial market
expected activity
important start
academic year
probability measure
----------------
Topic 6: 
transversal skill
expected activity
office hour
supervision office
hour assistant


## Exercise 4.11: Wikipedia structure

In [24]:
wiki_data  = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

In [101]:
unique_tokens = wiki_data.flatMap(lambda x : x['tokens']).distinct()

In [103]:

token_to_index = dict(unique_tokens.zipWithIndex().sortBy(lambda y : y[1], ascending=False).collect())

In [105]:
unique_tokens = unique_tokens.collect()

In [113]:
def count_tokens(tokens):
    token_count_dict = dict()
    for token in tokens:
        index  = token_to_index[token]
        if(index in token_count_dict):
            token_count_dict[index] = token_count_dict[index]+1
        else:
            token_count_dict[index] = 1
    return Vectors.sparse(len(unique_tokens), token_count_dict)
wiki_counted = wiki_data.map(lambda x: count_tokens(x['tokens']))

In [115]:
wiki_with_document_index = wiki_counted.zipWithIndex().map(lambda x : [x[1],x[0]])

In [129]:
lda_wiki = LDA.train(wiki_with_document_index, k=5, seed=1, maxIterations=30, docConcentration = 1.21,topicConcentration=6.0)

In [130]:
def printWikiTopics(topics):
    counter = 1
    for topic in topics:
        print("Topic "+str(counter) + ": ")
        counter +=1
        words = topic[0]
        for word in words:
            print(unique_tokens[word])
        print("----------------")

In [131]:
printWikiTopics(lda_wiki.describeTopics(10))

Topic 1: 
crossfade
bilsonlegge
sratha
modified
eurasias
student
coin
comgall
invalided
icon
----------------
Topic 2: 
outbreak
britainin
cottage
multiplatform
resemble
rowsynchronizing
comgall
cries
highbrow
circa
----------------
Topic 3: 
rask
td
firefight
mannkind
flopped
conservationtoday
governmentindustry
speedtime
arenediazonium
relationshipbased
----------------
Topic 4: 
nameplates
rodrigo
sydney
komics
cidermaking
favour
eadwulf
alps
secretary
bearer
----------------
Topic 5: 
nyårsdagen
britons
agothe
type
stores
interviews
waybudding
pennants
morisots
regions
----------------


## Results

We noticed that with any k we could try, the order of the topics changed massively with every seed and didn't seem to make much sense. This could be because a much larger k is needed as there are so many different topics on Wikipedia but we were limited by computing ability, more specifically memory and couldn't label the topics. We chose a low alpha as each document on wikipedia is about a specific subject. For the beta we chose a higher number because of the amount of documents each word will be in a lot of them. It is possible better preprocessing might help. 