# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *O*

**Names:**

* *Argelaguet Franquelo, Pau*
* *du Bois de Dunilac, Vivien*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

import pickle
import numpy as np
import string
import collections
import operator
import math

from functools import reduce
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_pkl

In [11]:
courses = load_json('data/courses.txt')
preprocessed = load_pkl('data/preprocess.pckl')
terms = load_pkl('data/terms.pckl')

## Exercise 4.8: Topics extraction

In [3]:
print(list(list(preprocessed.items())[0][1]['description'].items()))

[('adapt', 2), ('adapt composit', 2), ('applic', 2), ('biocomposit', 2), ('composit', 12), ('composit applic', 2), ('cost', 2), ('develop', 2), ('nanocomposit', 2), ('perform', 2), ('polym', 2), ('product', 3), ('team', 2)]


In [4]:
def createTermVector(bagOfWord):
    countList = []
    for term in terms:
        countList.append(bagOfWord.get(term, 0))
    return Vectors.dense(countList)

In [5]:
vectorList = []
counter = 1
for it in list(preprocessed.items()):
    vectorList.append([counter, createTermVector(it[1]['description'])])
    counter += 1
rdd = sc.parallelize(vectorList)

In [9]:
def pTop(ldaModel, nWords=10):
    for topic in ldaModel.describeTopics():
        termList = []
        for i in range(nWords):
            termList.append(terms[topic[0][i]])
        print(termList)

### Variying ALPHA

In [93]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=1.01)
pTop(ldaModel)

['heat', 'communic', 'chemic', 'electron', 'applic', 'reaction', 'algorithm', 'linear', 'comput', 'flow']
['energi', 'report', 'skill', 'evalu', 'data', 'plan', 'convers', 'comput', 'week', 'technolog']
['physic', 'equat', 'problem', 'mass', 'properti', 'structur', 'group', 'applic', 'solv', 'mechan']
['energi', 'paper', 'engin', 'discuss', 'industri', 'busi', 'cell', 'assess', 'team', 'technolog']
['problem', 'program', 'comput', 'optim', 'numer', 'research', 'linear', 'algorithm', 'plan', 'signal']
['data', 'network', 'robot', 'control', 'algorithm', 'magnet', 'water', 'assess', 'exam', 'architectur']
['electron', 'stochast', 'financi', 'control', 'architectur', 'applic', 'function', 'comput', 'linear', 'theori']
['simul', 'studi', 'case', 'risk', 'data', 'manag', 'assess', 'case studi', 'mechan', 'activ']
['biolog', 'chemic', 'organ', 'structur', 'chemistri', 'develop', 'reaction', 'scientif', 'evalu', 'molecular']
['optic', 'microscopi', 'electron', 'mechan', 'quantum', 'spectrosco

In [85]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=3.0)
pTop(ldaModel)

['report', 'flow', 'linear', 'problem', 'skill', 'numer', 'data', 'scientif', 'optim', 'experiment']
['electron', 'quantum', 'theori', 'cell', 'microscopi', 'applic', 'magnet', 'structur', 'introduct', 'risk']
['circuit', 'architectur', 'devic', 'field', 'sensor', 'activ', 'signal', 'robot', 'integr', 'techniqu']
['optic', 'biolog', 'paper', 'laser', 'discuss', 'chemic', 'protein', 'physic', 'chemistri', 'interact']
['case', 'studi', 'manag', 'cell', 'power', 'applic', 'case studi', 'engin', 'group', 'wast']
['energi', 'engin', 'thermodynam', 'numer', 'physic', 'dure', 'convers', 'environment', 'comput', 'technolog']
['control', 'mechan', 'structur', 'properti', 'function', 'fundament', 'fractur', 'measur', 'problem', 'statist']
['organ', 'structur', 'engin', 'machin', 'network', 'properti', 'transport', 'physic', 'program', 'metal']
['data', 'develop', 'week', 'innov', 'research', 'technolog', 'plan', 'assess', 'evalu', 'differ']
['imag', 'stochast', 'theori', 'probabl', 'introduct', 

In [86]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=5.0)
pTop(ldaModel)

['statist', 'program', 'data', 'linear', 'optim', 'problem', 'algorithm', 'theori', 'comput', 'function']
['control', 'engin', 'circuit', 'robot', 'filter', 'oper', 'week', 'power', 'imag', 'integr']
['paper', 'electron', 'control', 'field', 'data', 'water', 'discuss', 'solid', 'energi', 'devic']
['report', 'skill', 'laser', 'research', 'quantum', 'evalu', 'laboratori', 'data', 'activ', 'organ']
['simul', 'market', 'price', 'flow', 'scientif', 'architectur', 'develop', 'research', 'theori', 'physic']
['optic', 'energi', 'fourier', 'risk', 'solv', 'algebra', 'imag', 'physic', 'problem', 'measur']
['mechan', 'magnet', 'applic', 'structur', 'properti', 'physic', 'organ', 'metal', 'communic', 'electron']
['stochast', 'numer', 'reactor', 'architectur', 'develop', 'visual', 'final', 'imag', 'assess', 'urban']
['biolog', 'cell', 'protein', 'signal', 'structur', 'molecular', 'optic', 'dynam', 'applic', 'note']
['studi', 'case', 'chemic', 'assess', 'group', 'case studi', 'reaction', 'class', 'p

In [87]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=7.5)
pTop(ldaModel)

['electron', 'sensor', 'techniqu', 'applic', 'product', 'industri', 'communic', 'devic', 'principl', 'technolog']
['linear', 'statist', 'probabl', 'introduct', 'stochast', 'price', 'financi', 'imag', 'measur', 'market']
['paper', 'teach', 'scienc', 'space', 'theori', 'architectur', 'comput', 'research', 'group', 'scale']
['biolog', 'field', 'engin', 'structur', 'problem', 'assess', 'physic', 'research', 'skill', 'energi']
['optic', 'structur', 'mechan', 'quantum', 'laser', 'molecular', 'protein', 'function', 'statist', 'applic']
['energi', 'control', 'optim', 'program', 'data', 'risk', 'network', 'engin', 'communic', 'power']
['case', 'imag', 'studi', 'integr', 'function', 'group', 'activ', 'topic', 'fourier', 'circuit']
['chemic', 'flow', 'reaction', 'equat', 'chemistri', 'magnet', 'heat', 'physic', 'thermodynam', 'transfer']
['report', 'data', 'experiment', 'week', 'skill', 'scientif', 'laboratori', 'robot', 'written', 'stabil']
['develop', 'semest', 'innov', 'architectur', 'numer', 

In [88]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=10.0)
pTop(ldaModel)

['optic', 'laser', 'reaction', 'studi', 'theori', 'microscopi', 'problem', 'reactor', 'evalu', 'measur']
['data', 'imag', 'program', 'comput', 'statist', 'algorithm', 'signal', 'final', 'practic', 'visual']
['electron', 'communic', 'simul', 'robot', 'energi', 'flow', 'equat', 'introduct', 'circuit', 'price']
['biolog', 'protein', 'comput', 'molecular', 'interact', 'quantum', 'architectur', 'function', 'dynam', 'theori']
['numer', 'engin', 'magnet', 'teach', 'equat', 'mathemat', 'probabl', 'biolog', 'practic', 'properti']
['paper', 'semest', 'discuss', 'product', 'assess', 'environment', 'econom', 'research', 'innov', 'studi']
['report', 'evalu', 'develop', 'research', 'scientif', 'data', 'problem', 'experiment', 'optim', 'skill']
['cell', 'control', 'chemic', 'energi', 'metal', 'thermodynam', 'mechan', 'organ', 'electron', 'convers']
['structur', 'energi', 'mechan', 'manag', 'control', 'group', 'risk', 'inform', 'assess', 'dynam']
['physic', 'field', 'devic', 'differ', 'industri', 'act

### Variying BETA

In [89]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=6.0)
pTop(ldaModel)

['structur', 'mechan', 'comput', 'quantum', 'numer', 'cell', 'dynam', 'simul', 'applic', 'solv']
['control', 'protein', 'biolog', 'imag', 'research', 'network', 'develop', 'activ', 'chemic', 'studi']
['report', 'problem', 'evalu', 'plan', 'data', 'skill', 'optim', 'scientif', 'assess', 'studi']
['flow', 'physic', 'practic', 'teach', 'skill', 'theori', 'test', 'equat', 'product', 'technolog']
['electron', 'devic', 'imag', 'microscopi', 'structur', 'techniqu', 'properti', 'chemistri', 'integr', 'organ']
['risk', 'manag', 'magnet', 'market', 'differ', 'financi', 'case', 'price', 'assess', 'introduct']
['engin', 'paper', 'field', 'biolog', 'discuss', 'reaction', 'communic', 'chemic', 'activ', 'circuit']
['energi', 'mass', 'industri', 'thermodynam', 'heat', 'environment', 'evalu', 'water', 'chemic', 'convers']
['statist', 'signal', 'linear', 'probabl', 'stochast', 'theori', 'data', 'robot', 'time', 'control']
['optic', 'data', 'cell', 'algorithm', 'laser', 'note', 'function', 'program', 'ap

In [90]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=3.0, docConcentration=6.0)
pTop(ldaModel)

['cell', 'biolog', 'applic', 'organ', 'chemic', 'physic', 'signal', 'function', 'reaction', 'activ']
['problem', 'numer', 'comput', 'linear', 'theori', 'optim', 'algorithm', 'program', 'statist', 'equat']
['optic', 'microscopi', 'electron', 'applic', 'imag', 'physic', 'principl', 'equat', 'control', 'theori']
['energi', 'heat', 'convers', 'thermodynam', 'architectur', 'magnet', 'transfer', 'physic', 'case', 'technolog']
['network', 'technolog', 'applic', 'theori', 'control', 'activ', 'communic', 'techniqu', 'imag', 'develop']
['data', 'report', 'research', 'evalu', 'scientif', 'skill', 'plan', 'assess', 'experiment', 'problem']
['structur', 'week', 'class', 'engin', 'control', 'assess', 'group', 'activ', 'develop', 'evalu']
['laser', 'quantum', 'physic', 'mechan', 'biolog', 'architectur', 'chemic', 'electron', 'engin', 'principl']
['electron', 'applic', 'theori', 'mechan', 'signal', 'structur', 'techniqu', 'introduct', 'devic', 'properti']
['imag', 'applic', 'physic', 'mechan', 'engin'

In [91]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=5.0, docConcentration=6.0)
pTop(ldaModel)

['data', 'problem', 'assess', 'activ', 'control', 'optic', 'evalu', 'comput', 'energi', 'applic']
['electron', 'energi', 'imag', 'applic', 'structur', 'physic', 'biolog', 'control', 'activ', 'mechan']
['problem', 'comput', 'data', 'assess', 'applic', 'evalu', 'activ', 'theori', 'structur', 'optim']
['biolog', 'structur', 'physic', 'mechan', 'applic', 'chemic', 'energi', 'assess', 'cell', 'engin']
['energi', 'applic', 'structur', 'data', 'optic', 'electron', 'theori', 'assess', 'problem', 'physic']
['structur', 'optic', 'problem', 'applic', 'electron', 'data', 'biolog', 'physic', 'theori', 'activ']
['data', 'engin', 'problem', 'structur', 'activ', 'theori', 'assess', 'applic', 'evalu', 'comput']
['electron', 'optic', 'energi', 'applic', 'structur', 'physic', 'mechan', 'control', 'theori', 'properti']
['structur', 'optic', 'problem', 'applic', 'comput', 'theori', 'biolog', 'physic', 'mechan', 'activ']
['report', 'data', 'evalu', 'research', 'assess', 'problem', 'technolog', 'skill', 'act

In [92]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=7.0, docConcentration=6.0)
pTop(ldaModel)

['structur', 'electron', 'applic', 'data', 'problem', 'physic', 'mechan', 'energi', 'assess', 'activ']
['energi', 'data', 'problem', 'assess', 'structur', 'applic', 'evalu', 'activ', 'comput', 'theori']
['data', 'energi', 'evalu', 'assess', 'problem', 'activ', 'report', 'applic', 'structur', 'comput']
['optic', 'applic', 'problem', 'energi', 'activ', 'electron', 'assess', 'theori', 'structur', 'comput']
['data', 'optic', 'structur', 'applic', 'problem', 'theori', 'assess', 'activ', 'comput', 'evalu']
['structur', 'applic', 'problem', 'theori', 'assess', 'data', 'activ', 'comput', 'evalu', 'physic']
['problem', 'structur', 'applic', 'comput', 'theori', 'data', 'assess', 'activ', 'physic', 'mechan']
['structur', 'data', 'problem', 'assess', 'applic', 'engin', 'activ', 'evalu', 'theori', 'comput']
['data', 'applic', 'structur', 'problem', 'assess', 'engin', 'energi', 'activ', 'evalu', 'comput']
['assess', 'structur', 'evalu', 'problem', 'applic', 'data', 'activ', 'engin', 'theori', 'resea

### Optimized parameters

In [137]:
# k = number of topics, alpha(doc) higher alpha means more different topics per document
# beta(topic) higher beta means more words per topic
ldaModel = LDA.train(rdd, k=40, docConcentration=1.51, topicConcentration=1.01, seed=10)

# print the top x terms for each topic
pTop(ldaModel, 10)

['measur', 'field', 'wast', 'snow', 'solid', 'physic', 'properti', 'solid wast', 'manag', 'engin']
['laser', 'network', 'digit', 'architectur', 'generat', 'studio', 'form', 'communic', 'urban', 'applic']
['imag', 'diffract', 'structur', 'mass', 'scienc', 'electron', 'xray', 'organ', 'spectrometri', 'mass spectrometri']
['case', 'assess', 'communic', 'class', 'surfac', 'chain', 'suppli', 'reactor', 'studi', 'oper']
['polici', 'time', 'environment', 'evalu', 'power', 'communic', 'modul', 'regul', 'solut', 'recycl']
['space', 'product', 'inform', 'neuron', 'function', 'modul', 'communic', 'biominer', 'algebra', 'applic']
['structur', 'mechan', 'protein', 'properti', 'studi', 'applic', 'local', 'biolog', 'optim', 'chemic']
['solv', 'numer', 'robot', 'problem', 'equat', 'practic', 'comput', 'program', 'distribut', 'linear']
['cell', 'stabil', 'algorithm', 'structur', 'function', 'data', 'week', 'dynam', 'cancer', 'lim']
['paper', 'quantum', 'discuss', 'topic', 'oral', 'assess', 'molecular',

In [136]:
def extractTopics(ldaModel, nWords=10):
    topicList = []
    for topic in ldaModel.describeTopics():
        termList = []
        for i in range(nWords):
            termList.append(terms[topic[0][i]])
        topicList.append(termList)
    return topicList

In [129]:
courseTopics = extractTopics(ldaModel, 25)
topicLabels = range(len(courseTopics))
#topicLabels = ['Imagery', 'Scientific methodology', 'Statistics', 'Biology/Chemistry', 'Signal/Image processing', 'Energy', 'Lab environment', 'Maths', 'Stochastic models', 'Programming', 'Business', 'Organization']

In [131]:
for course in preprocessed.items():
    topics = []
    for i in range(len(courseTopics)):
        count = 0
        t = courseTopics[i]
        for word in course[1]['description']:
            if word in t:
                count += 1
        if count > 4:
            topics.append(topicLabels[i])
    print(course[1]['name'], topics)

Composites technology []
Image Processing for Life Science []
Global business environment [2, 5, 7, 13, 16, 27, 33]
Electrochemical nano-bio-sensing and bio/CMOS interfaces []
Structural mechanics (for MT) []
Théorie et critique du projet MA2 (Boltshauser) [0, 1, 12, 16, 31, 32]
Advanced principles and applications of systems biology [4]
Mass spectrometry [30]
Principles of digital communications []
Hardware systems modeling I [13]
Quantitative systems modeling techniques [21]
Medical radiation physics []
Bio-nano-chip design []
Principles of powder and densification processing []
Fundamentals of solid-state materials [29]
Microeconomics []
Polymer chemistry and macromolecular engineering []
Physical chemistry of polymeric materials [30]
Fundamentals of neuroengineering []
In Silico neuroscience []
Materials selection [2, 5, 31]
Optics laboratories I [3, 4, 7, 9, 13, 25, 31, 33]
Fracture mechanics [12, 23, 27]
Air pollution and climate change
 [8]
Philosophy of life sciences I [5, 7, 1

Financial Econometrics (EDFI) []
Interdisciplinary / disciplinary project for chemical master []
Social media [0]
Fixed income analysis [17]
Parallelism and concurrency []
Optical fibers and fiber devices []
Surface and thin films processes []
Optimization and simulation [21]
Financial & managerial accounting []
Innovation management [5, 7, 13, 16, 22, 27]
Image analysis and pattern recognition []
Stochastic calculus I [11, 17]
Piezoelectric materials, properties and devices []
Limnology []
Fundamentals of biosensors and electronic biochips [33]
Signal processing for communications [10, 19, 21, 27, 33]
Soft Microsystems Processing and Devices [10, 12, 24, 26, 29]
A Network Tour of Data Science [15, 21, 31, 34]
Mass spectrometry, principles and applications []
Development engineering [0, 6, 7, 16, 31, 33]
Assembly techniques [5, 6, 10, 12, 27, 31]
Nonlinear Optics []
Nonlinear Spectroscopy []
Advanced analog and RF integrated circuits design II []
RF MEMS for communications applications

## Exercise 4.9: Dirichlet hyperparameters

## Exercise 4.10: EPFL's taught subjects

## Exercise 4.11: Wikipedia structure