# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *O*

**Names:**

* *Argelaguet Franquelo, Pau*
* *du Bois de Dunilac, Vivien*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

import json
import pickle
import numpy as np
import string
import collections
import operator
import math

from functools import reduce
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_pkl

In [2]:
courses = load_json('data/courses.txt')
preprocessed = load_pkl('data/preprocess.pckl')
terms = load_pkl('data/terms.pckl')

In [3]:
mat = load_pkl('data/mat.pckl')
doc = load_pkl('data/documents.pckl')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.8: Topics extraction

In [4]:
# Creates a dense vector containing (wordIndex, wordCount) pairs from a dict of (word, wordCount) k,v pairs
# The index of the word is its index in the alphabet
def createTermVector(bagOfWord):
    countList = []
    for term in terms:
        countList.append(bagOfWord.get(term, 0))
    return Vectors.dense(countList)

In [5]:
# For each topic of the model print the nWords top words
def pTop(ldaModel, nWords=10):
    for topic in ldaModel.describeTopics():
        termList = []
        for i in range(nWords):
            termList.append(terms[topic[0][i]])
        print(termList)

In [6]:
# Create an rdd of term vectors for the documents
vectorList = []
counter = 1
for it in list(preprocessed.items()):
    vectorList.append([counter, createTermVector(it[1]['description'])])
    counter += 1
rdd = sc.parallelize(vectorList)

In [7]:
# Extract 10 topics with default parameters
ldaModel = LDA.train(rdd, k=10)
pTop(ldaModel)

['case', 'risk', 'manag', 'innov', 'studi', 'exam', 'assess', 'class', 'market', 'evalu']
['structur', 'program', 'algorithm', 'comput', 'solut', 'properti', 'languag', 'cell', 'machin', 'techniqu']
['statist', 'cell', 'linear', 'probabl', 'biolog', 'mass', 'engin', 'time', 'data', 'physic']
['report', 'data', 'skill', 'energi', 'evalu', 'numer', 'problem', 'laboratori', 'heat', 'optim']
['chemic', 'chemistri', 'electron', 'reaction', 'organ', 'product', 'thermodynam', 'kinet', 'engin', 'theori']
['engin', 'space', 'introduct', 'integr', 'dure', 'theori', 'stochast', 'differenti', 'equat', 'differ']
['structur', 'research', 'circuit', 'test', 'mechan', 'energi', 'architectur', 'session', 'develop', 'speech']
['paper', 'control', 'dynam', 'protein', 'molecular', 'quantum', 'robot', 'discuss', 'oral', 'biolog']
['optic', 'laser', 'applic', 'control', 'communic', 'power', 'physic', 'network', 'digit', 'signal']
['imag', 'magnet', 'physic', 'develop', 'applic', 'technolog', 'teach', 'techn

With the default parameters we get topics that are not very precise. Some of them cumulate several subjects while others contain words that have little significance like 'semest', 'includ' and 'scientif'. Overall LSI seems to perform better than LDA with its default parameters, the topics obtained by LSI are more precise and do not group completely unrelated words together.

Labels for the LDA topics could be:
1. Physics
2. General Engineering
3. Data and statistics
4. Biology and chemistry
5. Risks, management
6. Electronics and robotics
7. Imagery
8. Biology
9. Optics
10. Computing and signal processing

## Exercise 4.9: Dirichlet hyperparameters

### Variying ALPHA

The alpha parameter is the parameter of the prior distribution of topics per document. Increasing alpha means searching for topics more uniformly accross the documents (more topics per document).

In [8]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=1.01, seed=10)
pTop(ldaModel)

['architectur', 'reaction', 'develop', 'organ', 'chemistri', 'differ', 'studi', 'assess', 'semest', 'comput']
['biolog', 'cell', 'chemic', 'protein', 'engin', 'structur', 'molecular', 'develop', 'research', 'cancer']
['optic', 'laser', 'technolog', 'architectur', 'applic', 'wast', 'electron', 'principl', 'theori', 'physic']
['physic', 'paper', 'electron', 'chemic', 'organ', 'properti', 'techniqu', 'week', 'applic', 'transport']
['signal', 'imag', 'equat', 'filter', 'engin', 'theori', 'problem', 'chemistri', 'applic', 'data']
['energi', 'heat', 'circuit', 'transfer', 'risk', 'convers', 'principl', 'molecular', 'assess', 'metal']
['report', 'scientif', 'mechan', 'skill', 'experiment', 'structur', 'laboratori', 'electron', 'problem', 'evalu']
['algorithm', 'comput', 'problem', 'linear', 'numer', 'optim', 'simul', 'case', 'theori', 'studi']
['microscopi', 'imag', 'optic', 'theori', 'electron', 'introduct', 'structur', 'fourier', 'stochast', 'applic']
['data', 'control', 'program', 'magnet'

In [9]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=3.0, seed=10)
pTop(ldaModel)

['reaction', 'develop', 'architectur', 'differ', 'data', 'assess', 'communic', 'semest', 'studi', 'chemistri']
['biolog', 'cell', 'protein', 'chemic', 'engin', 'develop', 'research', 'molecular', 'robot', 'practic']
['optic', 'laser', 'architectur', 'technolog', 'reactor', 'principl', 'applic', 'theori', 'solid', 'wast']
['paper', 'chemic', 'organ', 'physic', 'network', 'techniqu', 'week', 'electron', 'busi', 'industri']
['equat', 'signal', 'engin', 'imag', 'theori', 'filter', 'space', 'problem', 'differenti', 'chemistri']
['energi', 'heat', 'report', 'technolog', 'circuit', 'transfer', 'molecular', 'thermodynam', 'assess', 'risk']
['mechan', 'electron', 'report', 'structur', 'skill', 'experiment', 'scientif', 'flow', 'properti', 'measur']
['algorithm', 'comput', 'linear', 'numer', 'problem', 'optim', 'simul', 'program', 'case', 'studi']
['optic', 'microscopi', 'stochast', 'imag', 'theori', 'introduct', 'applic', 'fourier', 'electron', 'financi']
['data', 'control', 'magnet', 'program'

In [10]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=5.0, seed=10)
pTop(ldaModel)

['data', 'reaction', 'develop', 'communic', 'differ', 'architectur', 'assess', 'semest', 'research', 'studi']
['biolog', 'cell', 'protein', 'chemic', 'engin', 'develop', 'teach', 'research', 'practic', 'robot']
['optic', 'laser', 'architectur', 'technolog', 'reactor', 'principl', 'manag', 'solid', 'theori', 'state']
['paper', 'chemic', 'organ', 'physic', 'week', 'techniqu', 'industri', 'network', 'busi', 'electron']
['equat', 'engin', 'imag', 'signal', 'theori', 'space', 'problem', 'filter', 'chemistri', 'differenti']
['energi', 'report', 'heat', 'molecular', 'technolog', 'thermodynam', 'transfer', 'circuit', 'evalu', 'assess']
['mechan', 'electron', 'structur', 'skill', 'flow', 'report', 'measur', 'properti', 'experiment', 'scientif']
['algorithm', 'comput', 'linear', 'numer', 'problem', 'optim', 'program', 'simul', 'case', 'studi']
['optic', 'stochast', 'microscopi', 'introduct', 'theori', 'applic', 'imag', 'fourier', 'financi', 'price']
['data', 'control', 'magnet', 'plan', 'group',

In [11]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=7.5, seed=10)
pTop(ldaModel)

['data', 'communic', 'reaction', 'develop', 'differ', 'assess', 'architectur', 'signal', 'research', 'semest']
['biolog', 'cell', 'chemic', 'engin', 'protein', 'develop', 'teach', 'practic', 'scienc', 'robot']
['optic', 'laser', 'architectur', 'technolog', 'manag', 'reactor', 'integr', 'polici', 'solid', 'principl']
['paper', 'organ', 'chemic', 'techniqu', 'week', 'structur', 'electron', 'physic', 'industri', 'busi']
['equat', 'engin', 'theori', 'imag', 'risk', 'space', 'problem', 'chemistri', 'signal', 'probabl']
['energi', 'report', 'molecular', 'heat', 'transfer', 'evalu', 'thermodynam', 'technolog', 'research', 'assess']
['mechan', 'electron', 'structur', 'flow', 'report', 'experiment', 'measur', 'skill', 'properti', 'scientif']
['algorithm', 'comput', 'linear', 'numer', 'problem', 'simul', 'program', 'control', 'studi', 'case']
['optic', 'stochast', 'microscopi', 'introduct', 'applic', 'fourier', 'structur', 'theori', 'financi', 'physic']
['data', 'magnet', 'group', 'plan', 'dure'

In [12]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=10.0, seed=10)
pTop(ldaModel)

['data', 'communic', 'reaction', 'differ', 'develop', 'assess', 'signal', 'architectur', 'research', 'semest']
['biolog', 'cell', 'chemic', 'engin', 'protein', 'develop', 'teach', 'practic', 'scienc', 'robot']
['laser', 'optic', 'architectur', 'technolog', 'manag', 'reactor', 'polici', 'solid', 'integr', 'generat']
['organ', 'paper', 'structur', 'chemic', 'techniqu', 'electron', 'week', 'applic', 'busi', 'optim']
['engin', 'equat', 'theori', 'risk', 'imag', 'probabl', 'space', 'chemistri', 'problem', 'differenti']
['energi', 'report', 'molecular', 'heat', 'transfer', 'evalu', 'thermodynam', 'research', 'technolog', 'comput']
['electron', 'mechan', 'structur', 'report', 'flow', 'experiment', 'scientif', 'skill', 'measur', 'properti']
['algorithm', 'comput', 'linear', 'numer', 'problem', 'simul', 'control', 'program', 'case', 'studi']
['optic', 'stochast', 'microscopi', 'applic', 'introduct', 'properti', 'mechan', 'physic', 'fourier', 'electron']
['data', 'magnet', 'group', 'plan', 'dure

We can observe that a low value of alpha results in topics that are very different from each other but they do not always make intuitive sense. Several subjects end up in the same topic because the algorithm thinks that the probability of there being more than one topic in a document is very low.

A high value of alpha gives more mixed topics. The algorithm creates topics for words of little meaning because it tries to find too many topics in some documents. A balance needs to be found between identifying too few or too many topics in the documents to produce meaningful topics.

### Variying BETA

The beta parameter is the parameter of the prior distribution of words per topics. Increasing beta means considering that more words of a document belong to its topics.

In [13]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=1.01, docConcentration=6.0, seed=10)
pTop(ldaModel)

['data', 'communic', 'reaction', 'develop', 'differ', 'assess', 'architectur', 'research', 'semest', 'signal']
['biolog', 'cell', 'protein', 'chemic', 'engin', 'develop', 'practic', 'teach', 'research', 'scienc']
['optic', 'laser', 'architectur', 'technolog', 'reactor', 'manag', 'principl', 'solid', 'integr', 'state']
['paper', 'chemic', 'organ', 'physic', 'week', 'techniqu', 'industri', 'electron', 'busi', 'structur']
['equat', 'engin', 'theori', 'imag', 'signal', 'space', 'problem', 'chemistri', 'differenti', 'filter']
['energi', 'report', 'heat', 'molecular', 'technolog', 'thermodynam', 'transfer', 'evalu', 'circuit', 'assess']
['mechan', 'electron', 'structur', 'flow', 'skill', 'report', 'measur', 'properti', 'experiment', 'scientif']
['algorithm', 'comput', 'linear', 'numer', 'problem', 'simul', 'program', 'optim', 'case', 'studi']
['optic', 'stochast', 'microscopi', 'introduct', 'applic', 'fourier', 'theori', 'financi', 'structur', 'imag']
['data', 'magnet', 'control', 'group', '

In [14]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=3.0, docConcentration=6.0, seed=10)
pTop(ldaModel)

['data', 'architectur', 'electron', 'circuit', 'techniqu', 'comput', 'structur', 'problem', 'research', 'applic']
['biolog', 'protein', 'chemic', 'cell', 'molecular', 'structur', 'mechan', 'physic', 'function', 'applic']
['optic', 'laser', 'principl', 'physic', 'applic', 'technolog', 'activ', 'architectur', 'theori', 'wast']
['structur', 'week', 'mechan', 'equat', 'electron', 'numer', 'applic', 'stabil', 'paper', 'theori']
['imag', 'signal', 'applic', 'theori', 'structur', 'physic', 'comput', 'problem', 'techniqu', 'algorithm']
['report', 'engin', 'evalu', 'data', 'assess', 'plan', 'research', 'technolog', 'problem', 'skill']
['mechan', 'structur', 'problem', 'comput', 'electron', 'chemic', 'reaction', 'theori', 'organ', 'molecular']
['energi', 'risk', 'problem', 'linear', 'applic', 'comput', 'studi', 'control', 'theori', 'techniqu']
['cell', 'structur', 'biolog', 'theori', 'electron', 'applic', 'mechan', 'physic', 'properti', 'stochast']
['data', 'assess', 'develop', 'control', 'progr

In [15]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=5.0, docConcentration=6.0, seed=10)
pTop(ldaModel)

['data', 'structur', 'problem', 'applic', 'comput', 'assess', 'electron', 'theori', 'activ', 'techniqu']
['structur', 'biolog', 'applic', 'cell', 'physic', 'mechan', 'data', 'chemic', 'theori', 'activ']
['optic', 'applic', 'physic', 'laser', 'activ', 'theori', 'energi', 'assess', 'evalu', 'problem']
['structur', 'data', 'applic', 'problem', 'electron', 'energi', 'mechan', 'comput', 'assess', 'theori']
['structur', 'applic', 'problem', 'theori', 'physic', 'imag', 'comput', 'data', 'mechan', 'assess']
['data', 'report', 'engin', 'evalu', 'assess', 'problem', 'energi', 'activ', 'technolog', 'applic']
['structur', 'problem', 'theori', 'mechan', 'comput', 'applic', 'data', 'assess', 'activ', 'evalu']
['energi', 'problem', 'applic', 'comput', 'assess', 'structur', 'activ', 'evalu', 'theori', 'control']
['structur', 'cell', 'applic', 'electron', 'theori', 'mechan', 'physic', 'biolog', 'activ', 'chemic']
['data', 'assess', 'problem', 'evalu', 'energi', 'activ', 'applic', 'comput', 'structur', 

In [16]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=7.0, docConcentration=6.0, seed=10)
pTop(ldaModel)

['data', 'structur', 'problem', 'applic', 'assess', 'activ', 'comput', 'evalu', 'theori', 'electron']
['structur', 'applic', 'data', 'problem', 'physic', 'assess', 'activ', 'biolog', 'energi', 'mechan']
['optic', 'applic', 'energi', 'activ', 'assess', 'problem', 'structur', 'evalu', 'theori', 'data']
['data', 'structur', 'problem', 'applic', 'energi', 'assess', 'comput', 'evalu', 'activ', 'electron']
['structur', 'applic', 'problem', 'data', 'theori', 'comput', 'assess', 'physic', 'activ', 'mechan']
['data', 'problem', 'evalu', 'energi', 'assess', 'engin', 'report', 'activ', 'applic', 'comput']
['structur', 'problem', 'data', 'applic', 'assess', 'evalu', 'activ', 'comput', 'theori', 'mechan']
['energi', 'problem', 'applic', 'structur', 'assess', 'activ', 'evalu', 'comput', 'data', 'theori']
['structur', 'applic', 'theori', 'physic', 'electron', 'mechan', 'problem', 'optic', 'activ', 'assess']
['data', 'problem', 'assess', 'applic', 'energi', 'structur', 'evalu', 'activ', 'comput', 'the

In [17]:
ldaModel = LDA.train(rdd, k=10, topicConcentration=10.0, docConcentration=6.0, seed=10)
pTop(ldaModel)

['data', 'structur', 'problem', 'applic', 'assess', 'activ', 'evalu', 'energi', 'comput', 'theori']
['structur', 'data', 'applic', 'problem', 'assess', 'energi', 'activ', 'evalu', 'theori', 'comput']
['optic', 'applic', 'data', 'problem', 'structur', 'assess', 'energi', 'activ', 'evalu', 'theori']
['data', 'structur', 'problem', 'applic', 'assess', 'energi', 'activ', 'evalu', 'comput', 'theori']
['data', 'structur', 'applic', 'problem', 'assess', 'activ', 'theori', 'comput', 'evalu', 'physic']
['data', 'problem', 'applic', 'assess', 'energi', 'evalu', 'structur', 'activ', 'engin', 'comput']
['data', 'structur', 'problem', 'applic', 'assess', 'evalu', 'activ', 'comput', 'theori', 'optic']
['energi', 'problem', 'applic', 'structur', 'data', 'assess', 'activ', 'evalu', 'comput', 'theori']
['structur', 'applic', 'problem', 'assess', 'theori', 'activ', 'data', 'energi', 'evalu', 'physic']
['data', 'problem', 'structur', 'applic', 'assess', 'energi', 'evalu', 'activ', 'comput', 'theori']


With a low value of beta we can see that the topics are very different from one another for the five most significant words. For less significant words they make less sense and share many words. This makes sense because a low value of beta means that only the most frequent words in each document are considered meaningful.

With a high value of beta the topics become more similar. This makes sense as well because considering too many words as being important will add meaningless words to the topics. A balance needs to be found between making sure that all of the important words are considered as such and making sure that no meaningless words are considered important.

## Exercise 4.10: EPFL's taught subjects

### Optimized parameters

In [18]:
# k = number of topics, alpha(doc) higher alpha means more different topics per document
# beta(topic) higher beta means more words per topic
ldaModel = LDA.train(rdd, k=12, docConcentration=1.8, topicConcentration=3.5, seed=10)

# print the top x terms for each topic
pTop(ldaModel, 10)

['electron', 'microscopi', 'spectroscopi', 'electron microscopi', 'devic', 'forc', 'print', 'scan', 'transmiss electron', 'polici']
['data', 'problem', 'optim', 'statist', 'program', 'linear', 'exam', 'skill', 'evalu', 'teach']
['circuit', 'analog', 'convert', 'rf', 'diffract', 'amplifi', 'nois', 'tem', 'transistor', 'mos']
['cell', 'chemic', 'chemistri', 'reaction', 'organ', 'molecular', 'structur', 'kinet', 'note', 'biolog']
['optic', 'control', 'signal', 'applic', 'communic', 'laser', 'sensor', 'principl', 'activ', 'practic']
['energi', 'heat', 'power', 'transfer', 'thermodynam', 'reactor', 'convers', 'mass', 'main', 'principl']
['biolog', 'research', 'structur', 'discuss', 'report', 'scientif', 'paper', 'week', 'semest', 'evalu']
['equat', 'theori', 'comput', 'function', 'physic', 'flow', 'mechan', 'numer', 'stochast', 'algorithm']
['imag', 'magnet', 'speech', 'recognit', 'applic', 'code', 'plasma', 'digit', 'techniqu', 'magnet materi']
['quantum', 'parallel', 'algorithm', 'memori'

After some testing we found values for alpha and beta that produce twelve relatively meaningful topics. Labels for those topics are :
1. Imagery
2. Data
3. Electronics
4. Biology and chemistry
5. Optics
6. Thermodynamics
7. Research, scientific methodology
8. Maths
9. Image and speech processing
10. Computing
11. Management
12. Hardware and metabolism

The first eleven are all meaningful and coherent, they do not seem to contain out of topic words and do not overlap, but the twelfth topic looks like a mixup between computer hardware and biology.

In [19]:
topicLabels = dict(zip(range(12), ['Imagery','Data','Electronics','Bio/Chem','Optics','Thermodynamics','Research',
                                  'Math','Image/Speech processing','Computing','Management','Hardware/Brain?']))

In [20]:
tm = ldaModel.topicsMatrix()

These next cells will compute a score for each topic/course pair and display a list of the top topics for every course.

In [21]:
# Compute a score for each topic/course pair
# The score is the sum of the bag of words' terms multiplied by their weight in the model's topic matrix
# normalized by the number of words
def computeTopicScores(topicsMatrix, bagOfWords):
    topicScores = np.zeros(topicsMatrix.shape[1])
    wordCount = 0
    for i in range(topicsMatrix.shape[0]):
        occ = bagOfWords.get(terms[i], 0)
        wordCount += occ
        for j in range(topicsMatrix.shape[1]):
            topicScores[j] += occ * topicsMatrix[i][j]
    
    return topicScores / wordCount

# Create a list of courses with a sorted list of topics indices and scores
def associateTopicScores(topicsMatrix):
    res = list()
    for k,v in preprocessed.items():
        scores = computeTopicScores(topicsMatrix, v['description'])
        res.append((v['name'], sorted(list(zip(list(range(topicsMatrix.shape[1])), scores)), key=lambda x:x[1])))
        
    return res

# Print all the topics that have a score above the given threshold or the best topic if none satisfy the threshold
def printCourse(course, thresh = 10):
    tList = []
    for tup in course[1]:
        if tup[1] > thresh:
            tList.append(tup)
    if len(tList) > 0:
        print(course[0], "->", list(map(lambda x: topicLabels[x[0]], sorted(tList, key=lambda x:x[1], reverse=True))))
    else:
        print(course[0], "->", topicLabels[course[1][-1][0]])

In [22]:
courseToScores = associateTopicScores(tm)

In [23]:
print("List of courses with top topics for each of them \n")
for c in courseToScores:
    printCourse(c)

List of courses with top topics for each of them 

Composites technology -> ['Management', 'Data', 'Optics']
Image Processing for Life Science -> ['Data', 'Image/Speech processing', 'Optics']
Global business environment -> ['Management', 'Data']
Electrochemical nano-bio-sensing and bio/CMOS interfaces -> Imagery
Structural mechanics (for MT) -> ['Data', 'Research']
Théorie et critique du projet MA2 (Boltshauser) -> ['Management', 'Research', 'Data']
Advanced principles and applications of systems biology -> ['Data', 'Research', 'Optics', 'Math', 'Management']
Mass spectrometry -> ['Math', 'Research', 'Data', 'Thermodynamics', 'Bio/Chem']
Principles of digital communications -> ['Optics', 'Math']
Hardware systems modeling I -> Math
Quantitative systems modeling techniques -> ['Data', 'Research', 'Math', 'Management', 'Optics']
Medical radiation physics -> ['Optics', 'Image/Speech processing', 'Math']
Bio-nano-chip design -> ['Imagery', 'Bio/Chem']
Principles of powder and densification 

General physics III -> ['Math', 'Optics', 'Management', 'Research']
Methods of asymptotic analysis in mechanics -> ['Math']
Environment chemical and biological technology -> ['Research', 'Data', 'Bio/Chem', 'Management']
Evolutionary robotics -> ['Optics', 'Data', 'Math']
Analysis and modelling of locomotion -> ['Optics']
Advanced topics in electromagnetic compatibility -> ['Optics', 'Research', 'Management', 'Data']
Fundamentals of Biometrics -> ['Data']
Advanced computer architecture -> ['Management']
Laboratory information management systems (LIMS) -> ['Management', 'Data', 'Research']
Advanced topics in financial econometrics -> ['Data', 'Math', 'Management']
Harmonic analysis -> ['Optics', 'Math']
Fundamentals of biophotonics -> ['Optics', 'Data', 'Research', 'Management']
Engineering of musculoskeletal system and rehabilitation -> ['Management']
Corporate strategy -> ['Management', 'Data', 'Research']
Introduction to the finite elements method -> ['Math', 'Data', 'Optics']
Transi

## Exercise 4.11: Wikipedia structure

In [24]:
rdd_wiki = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads) # load the wikipedia data

In [25]:
# First we need to transform the data into the correct format for LDA

# Aggregate the rdd to get a set of all words
def seq(a,b):
    for el in b['tokens']:
        a.add(el)
    return a
def comb(a,b):
    return a.union(b)
wikiTerms = rdd_wiki.aggregate(set(), seq, comb)

In [26]:
# Sorted list of all words
wikiTermsList = sorted(list(wikiTerms))
# Mappings from words to their alphabet index
wikiTermsDict = dict(zip(wikiTermsList, range(len(wikiTerms))))

In [27]:
# Filters for the data

# Checks if given word is a number
def is_number(word):
    try:
        float(word)
        return True
    except ValueError:
        return False

    
# If the word passes the filters and should be in the dataset, returns True, False otherwise
def filter_word(word):
    # Words of len 1
    if len(word) < 2:
        return False
    # Months, directions
    rem_words = set(['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november'
                ,'december', 'north', 'south', 'east', 'west'])
    if word.lower() in rem_words:
        return False
    
    # Removing words consisting of a punctuation sign
    if word in string.punctuation:
        return False
    # Removing numbers
    if is_number(word):
        return False
    return True

In [28]:
# Create a bag of words from an rdd entry
def createBOW(termList):
    resDict = {}
    for term in termList['tokens']:
        if (filter_word(term)):
            resDict[term] = resDict.get(term, 0) + 1
    return resDict

# Create a sparse term vector from a bag of words
def createSparseTermVector(bow):
    values = []
    indices = []
    for k,v in sorted(bow.items(), key=lambda x:x[0]):
        values.append(v)
        indices.append(wikiTermsDict[k])
    return Vectors.sparse(len(values), indices, values)

def mapFunc(el):
    return [el['page_id'], createSparseTermVector(createBOW(el))]

vectorsRDD = rdd_wiki.map(mapFunc).persist()

In [29]:
# Print the top 10 words for each topic from the lda model's topic description
def topWords(ldaTopicDesc, wordList):
    for topic in ldaTopicDesc:
        termList = []
        for i in range(10):
            termList.append(wordList[topic[0][i]])
            
        print(termList)

We had two ideas of hyperparameters for the wikipedia dataset. The goal of the first one is to find more precise topics for instance a country or different kinds of art and to this end it uses balanced values of alpha (3.4) and beta (3.5) while the second one uses a high value of alpha (10) and a low value of beta (1.91) with the aim of finding very broad topics like history, sports or science. We chose a higher value of k for the first idea (20) than for the second (10).

In [30]:
# Train the model and print the top words for each topic

wiki_ldaModel = LDA.train(vectorsRDD, k=20, docConcentration=3.4, topicConcentration=3.5, seed=10)
topDesc = wiki_ldaModel.describeTopics()
topWords(topDesc, wikiTermsList)

['war', 'government', 'army', 'battle', 'empire', 'military', 'republic', 'king', 'forces', 'city']
['time', 'universe', 'theory', 'years', 'matter', 'number', 'made', 'work', 'series', 'big']
['river', 'world', 'century', 'population', 'german', 'state', 'area', 'soviet', 'people', 'part']
['light', 'form', 'glass', 'called', 'system', 'number', 'made', 'time', 'high', 'standard']
['blood', 'disease', 'cells', 'cell', 'common', 'process', 'windows', 'treatment', 'system', 'cancer']
['united', 'england', 'war', 'world', 'american', 'city', 'president', 'british', 'government', 'won']
['species', 'water', 'oil', 'plants', 'acid', 'found', 'food', 'form', 'plant', 'birds']
['game', 'time', 'work', 'system', 'theory', 'life', 'called', 'law', 'number', 'set']
['number', 'numbers', 'theory', 'function', 'set', 'mathematics', 'real', 'space', 'called', 'time']
['energy', 'nuclear', 'gas', 'number', 'reaction', 'chemical', 'water', 'element', 'form', 'temperature']
['music', 'album', 'time',

The idea behind the first set of parameters was to find a good balance bewteen alpha and beta to find a lot of meaningful topics. Its parameters values were inspired by the results of the epfl courses topic extraction. We raised alpha because we estimated that wikipedia articles would probably contain more topics than epfl courses and thought about lowering beta to prevent the many words of the articles from polluting the topics but decided against it because we thought that the higher number of meaningful words would balance it.

It worked well for some topics and less well for others. We can see that several topics seem to be related to war and governments in general (1, 6, 13, 18) and a few others seem linked to history and countries (12, 14). Some of the topics on the other hand are very precise and not corrupted by weird outliers like topic 11 that is clearly about music and topic 20 that talks about planets, stars and space in general.

Overall the results are not very good and too many topics are nonsensical or not distinguishable enough to be interpreted in a meanigful way.

In [31]:
# Train the model and print the top words for each topic

wiki_ldaModel = LDA.train(vectorsRDD, k=10, docConcentration=10.0, topicConcentration=1.91, seed=10)
topDesc = wiki_ldaModel.describeTopics()
topWords(topDesc, wikiTermsList)

['time', 'states', 'hurricane', 'storm', 'people', 'island', 'tropical', 'system', 'number', 'united']
['water', 'energy', 'form', 'chemical', 'acid', 'called', 'number', 'temperature', 'process', 'high']
['music', 'american', 'film', 'time', 'played', 'years', 'player', 'play', 'year', 'world']
['king', 'english', 'century', 'church', 'england', 'calendar', 'war', 'great', 'empire', 'god']
['government', 'war', 'united', 'states', 'world', 'british', 'union', 'africa', 'country', 'economic']
['chinese', 'china', 'language', 'century', 'population', 'islands', 'world', 'river', 'republic', 'number']
['species', 'found', 'years', 'early', 'time', 'birds', 'made', 'large', 'family', 'genus']
['time', 'earth', 'sun', 'years', 'star', 'stars', 'mass', 'solar', 'system', 'planet']
['war', 'american', 'city', 'united', 'president', 'british', 'states', 'army', 'john', 'time']
['city', 'area', 'time', 'set', 'theory', 'number', 'centre', 'lake', 'large', 'called']


The idea behind the second set of parameters was to find all of the topics that were mentioned in the documents (high alpha) but only consider the most relevant words of each documents (low beta). We hoped that this would result in very general topics.

The results are quite good with three topics that look similar (4, 5 and 9) although they are not exactly the same. Topic 7 looks weird as well and does not include a lot of meaningful words that could give us a good idea of what its subject is.

Labels for the produced topics could be:
1. Tropical weather
2. Chemistry
3. Entertainment
4. War (older, american independance)
5. War (newer, ww2)
6. China
7. Species classification
8. Space
9. War
10. Urban environment

Overall these results are quite promising and we think that the second idea is the better one to interpret this dataset.