# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Keijiro Tajima*
* *Mahammad Shirinov*
* *Stephen Zhao*


## Exercise 4.11: Wikipedia structure


The time to train LDA on the entire Wikipedia dataset was prohibitive (only one iteration was completed after 6 hours of running, with uninteresting results). This was probably due to the immense vocabulary asize of the data (~500K), which defines the size of the input vectors, slowing down the fitting process drastically. In this notebook, we reduced the dictionary size and removed very infrequent words (as well as a few very frequent ones) and fitted LDA on the resulting dataset. The results are quite impressive.

### 1. Report the values for k, α and β that you chose a priori and why you picked them.

Before any testing, here is our intuition. Wikipedia for Schools would probably have 20 to 30 topics covered, which would include the subjects in an average school curriculum with additional topics such as Art or Movies/Entertainment(?). Having in mind that these articles will probably be longer and more general than epfl course desciptions, we  might use a larger $\alpha$ value here (a doc likely to touch a lot of topics). In case of terms, since the content is designed for schools, there would probably be less jargon in articles, and most articles would use most of the common words in english. Thus, we can use a larger $\beta$ value (eccentric words less likely). <br/>
We will go with $k=20$, $\alpha=6.0$ and $\beta=1.4$

In [45]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import pickle
import json
import copy
import numpy as np
from utils import load_json, load_pkl, save_pkl

In [2]:
wikipedia = sc.textFile('/ix/wikipedia-for-schools.txt')

In [3]:
c = wikipedia.map(lambda x: 1).reduce(lambda a, b: a+b)
c

5554

In [112]:
# given a trained model and the dictionary used in its dataset, 
# prints the most relevant terms for each topic classified by the model
def describe_topics(model, terms, n_words_per_topic=10, topic_labels=None):
    # get the predictions of topic term distributions from the LDAmodel
    lda_topic_predictions = model.describeTopics(maxTermsPerTopic=n_words_per_topic)
    for i, topic in enumerate(lda_topic_predictions):
        if topic_labels:
            print('== Topic {} - {} =='.format(i+1, topic_labels[i]))
        else:
            print('== Topic {} =='.format(i+1))
        most_relevant_terms = topic[0]
        most_relevant_terms_relevance = topic[1]
        
        print('{:>12} | {:10}'.format('Term', 'Score'))
        for term, relevance in zip(most_relevant_terms, most_relevant_terms_relevance):
            print('{:>12} | {:10}'.format(terms[term], relevance))

# given a wikipedia article as a json string (json dump), returns its list of tokens
def extract_tokens(json_dump):
    article = json.loads(json_dump)
    return article['tokens']

# given an array of tokens, returns the frequency dict of the tokens
def to_freq_dict(tokens_array):
    freq = {} 
    for term in tokens_array: 
        if (term in freq): 
            freq[term] += 1
        else: 
            freq[term] = 1
    return freq

# given two frequency dicts (containing keys and their frequencies), returns their 'union'
def freq_dict_union(d1, d2):
    for x in d1:
        if x in d2:
            d2[x] += d1[x]
        else:
            d2[x] = d1[x]
            
    return d2

In [6]:
wiki_tokens = wikipedia.map(extract_tokens).map(to_freq_dict).reduce(freq_dict_union)

In [94]:
len(wiki_tokens)

494494

This is too large. And in what follows, we discover that too many of these words are extremely rare.

In [104]:
print('Number of tokens that appear less than 10 times in the entire dataset: ',len([1 for x in wiki_tokens if wiki_tokens[x] < 10]))

Number of tokens that appear less than 10 times in the entire dataset:  441918


In [105]:
print('Number of tokens that appear less than 100 times in the entire dataset: ',len([1 for x in wiki_tokens if wiki_tokens[x] < 100]))

Number of tokens that appear less than 100 times in the entire dataset:  482172


So we set to remove very infrequent words. We will generate and play with 2 different datasets, one very small, containing only tokens with frequencies between $100$ and $7313$, and the other one a bit larger, containing tokens with frequencies between $30$ and $10000$. The upper bounds were obtained by looking at the most frequent terms that can be seen below.

In [107]:
import operator
sorted_wiki_tokens = sorted(wiki_tokens.items(), key=operator.itemgetter(1), reverse=True)

In [108]:
sorted_wiki_tokens[:100]

[('–', 22751),
 ('time', 22247),
 ('years', 19551),
 ('world', 19172),
 ('war', 18962),
 ('united', 17844),
 ('american', 17580),
 ('states', 17140),
 ('city', 17107),
 ('century', 15808),
 ('called', 14269),
 ('number', 14253),
 ('made', 14231),
 ('government', 13835),
 ('people', 13312),
 ('british', 13210),
 ('state', 12877),
 ('early', 12791),
 ('part', 12745),
 ('including', 12743),
 ('south', 12339),
 ('system', 12160),
 ('year', 11546),
 ('large', 11530),
 ('national', 11276),
 ('north', 11180),
 ('water', 10860),
 ('english', 10710),
 ('form', 10705),
 ('great', 10623),
 ('found', 10395),
 ('french', 10188),
 ('high', 10113),
 ('area', 10071),
 ('work', 9853),
 ('major', 9624),
 ('population', 9615),
 ('life', 9217),
 ('march', 9161),
 ('general', 9029),
 ('january', 8934),
 ('common', 8922),
 ('history', 8885),
 ('john', 8846),
 ('power', 8789),
 ('species', 8727),
 ('related', 8711),
 ('small', 8668),
 ('king', 8654),
 ('modern', 8571),
 ('de', 8398),
 ('july', 8395),
 ('engl

### The smaller dataset

In [80]:
# create a new dictionary of the needed words
wiki_tokens_filtered_alot = {}
for x in wiki_tokens:
    if wiki_tokens[x] >= 100 and wiki_tokens[x] < 7313:
        wiki_tokens_filtered_alot[x] = wiki_tokens[x]

In [81]:
new_small_dict = list(wiki_tokens_filtered_alot.keys())
len(wiki_tokens_filtered_alot)

12241

In [65]:
# given a json dump of a wiki article, removes unneeded tokens from it and returns article object
def remove_infreq_terms(json_dump):
    article = json.loads(json_dump)
    article['tokens'] = [token for token in article['tokens'] if token in new_dict]
    return article

# same as above, only for the smaller dataset
# note that in this case, a smaller dictionary is used to get rid of tokens
def remove_infreq_terms_2(json_dump):
    article = json.loads(json_dump)
    article['tokens'] = [token for token in article['tokens'] if token in new_small_dict]
    return article

# given an article object, returns an (id, bag_of_words_vector) pair necessary for the spark LDA fucntin
def prepare_for_lda(article):
    bag_of_words = [0 for _ in range(len(new_dict))]
    for term in article['tokens']:
        bag_of_words[new_dict.index(term)] += 1
    return [article['page_id'], Vectors.dense(bag_of_words)]

# same as above, only for the smaller dataset
# note that in this case, a smaller dictionary is used to get rid of tokens
def prepare_for_lda_small(article):
    bag_of_words = [0 for _ in range(len(new_small_dict))]
    for term in article['tokens']:
        bag_of_words[new_small_dict.index(term)] += 1
    return [article['page_id'], Vectors.dense(bag_of_words)]

In [33]:
# filter data and transform into form required by the LDA.train function
wikipedia_filtered_alot = wikipedia.map(remove_infreq_terms_2)
small_wikipedia_ready_for_lda = wikipedia_filtered_alot.map(prepare_for_lda_small)
small_wiki_lda_models = {}

In [None]:
# note: this piece of code was run in the cell below, 
# and was carried here to rerun describe_topics with labels, without losing the model
n_topics = 20
small_wiki_lda_models['b=1.3,a=6.0,high-freq-words-removed'] = LDA.train(small_wikipedia_ready_for_lda, 
                                                                         seed=None, 
                                                                         docConcentration=6.0, 
                                                                         topicConcentration=1.3, 
                                                                         maxIterations=20, 
                                                                         k=n_topics)

In [113]:
# Topic labels added later after analysis
topic_labels = ['(?) Italy', '(?)', 'Cosmos / Solar System', 'Zoology', 'Wars', 'Literature', '(ancient) History', 'Politics', 'Economics', '(?) General civitizenship info ?', 'Geography', 'Geography', 'Enteratinment', 'Religion', '(?) Wales and Islam', 'Technology/Computers', 'Microbiology and Natural Phenomena', 'Energy/General Science', 'Art', 'Mathematics']
describe_topics(small_wiki_lda_models['b=1.3,a=6.0,high-freq-words-removed'], new_small_dict, n_words_per_topic=10, topic_labels=topic_labels)

== Topic 1 - (?) Italy ==
        Term | Score     
    calendar | 0.012410440632909536
    february | 0.010220568025804505
       italy | 0.004590426379301236
          ii | 0.004523965758868512
     italian | 0.004391296723165262
        pope | 0.0042627814455455226
        flag | 0.004135060955597644
        paul | 0.0041176258016767065
        died | 0.0038478387588980695
      writer | 0.0035922444893741394
== Topic 2 - (?) ==
        Term | Score     
    february | 0.008847059776020032
       actor | 0.007315594107590482
     actress | 0.004860964123977584
      soviet | 0.004680961898359112
           $ | 0.004510262120405354
       prize | 0.0041271193560028915
    calendar | 0.004055931021639767
       nobel | 0.003981890000821442
    canadian | 0.003807959056410869
        york | 0.0034224949845110225
== Topic 3 - Cosmos / Solar System ==
        Term | Score     
       earth | 0.010664187092534068
         sun | 0.007507146007678085
        moon | 0.007460918605179476
    

### 2. Are you convinced by the results? Give labels to the topics if possible. 

We are indeed very pleased by the results. 16 out of the 20 topics can be easily labelled (you can see it above).

Below are all the other test we did by changing tha dataset and/or model hyperparameters. They are experimenal and not annotated like the chosen one above.

In [116]:
print('Tried combinations:')
list(small_wiki_lda_models.keys())

Tried combinations:


['b=1.4,a=5.0,high-freq-words-removed',
 'b=1.6,a=6.5,high-freq-words-removed',
 'b=1.3,a=6.0,high-freq-words-removed']

In [117]:
list(wiki_lda_models.keys())

['b=1.4,a=5.0']

### Appendix A: The bigger dataset

In [97]:
# create a new dictionary of the needed words
wiki_tokens_filtered = {}
for x in wiki_tokens:
    if wiki_tokens[x] >= 30 and wiki_tokens[x] <= 10000:
        wiki_tokens_filtered[x] = wiki_tokens[x]

In [106]:
new_dict = list(wiki_tokens_filtered.keys())
len(new_dict)

26445

In [99]:
wikipedia_filtered = wikipedia.map(remove_infreq_terms)
wikipedia_ready_for_lda = wikipedia_filtered.map(prepare_for_lda)

wiki_lda_models = {}

In [101]:
n_topics = 20
wiki_lda_models['b=1.4,a=5.0'] = LDA.train(wikipedia_ready_for_lda, seed=None, docConcentration=5.0, topicConcentration=1.4, maxIterations=15, k=n_topics)

In [109]:
describe_topics(wiki_lda_models['b=1.4,a=5.0'], new_dict, n_words_per_topic=10)

== Topic 1 ==
        Term | Score     
         set | 0.0031196105561355912
           = | 0.0028349265739742433
     numbers | 0.0022906792451880794
        work | 0.002119326411685812
       group | 0.0018831516575068307
     general | 0.0017318297809583406
       sheep | 0.0017237004313329944
        life | 0.001534086612159949
       power | 0.0015292081916620515
         art | 0.001492103603869243
== Topic 2 ==
        Term | Score     
   president | 0.0030990026721133606
       march | 0.0029861868500002794
      battle | 0.002788773516528671
        june | 0.0025196996659033306
        army | 0.00248045098824769
     january | 0.002474489499467883
        king | 0.0024387891891476158
       house | 0.0023444145023939485
        july | 0.0023279129159369436
    november | 0.002307022112571861
== Topic 3 ==
        Term | Score     
        acid | 0.005527571968248706
    chemical | 0.004576397673119443
      carbon | 0.004103875247906003
         gas | 0.003976860087041761
    

### Appendix B: other models (on small dataset) with slightly worse performance.

In [85]:
n_topics = 20
small_wiki_lda_models['b=1.4,a=5.0,high-freq-words-removed'] = LDA.train(small_wikipedia_ready_for_lda, seed=None, docConcentration=5.0, topicConcentration=1.4, maxIterations=15, k=n_topics)

In [87]:
describe_topics(small_wiki_lda_models['b=1.4,a=5.0,high-freq-words-removed'], new_small_dict, n_words_per_topic=10)

== Topic 1 ==
        Term | Score     
        game | 0.00415069804290561
    computer | 0.003459626522167465
     systems | 0.0024257744338514802
       games | 0.002402588222501744
    software | 0.0020078039115267567
       shark | 0.0018905574898610304
   computers | 0.0018869806879530852
     company | 0.0018509332645076924
      sharks | 0.0017422855479495732
      famine | 0.0016537680320677576
== Topic 2 ==
        Term | Score     
     disease | 0.005157199232101361
        food | 0.0024236365316099885
       human | 0.002300684014700783
      island | 0.0021607708878110116
       virus | 0.002052227992131479
     animals | 0.002031851427541036
   treatment | 0.0019568643346892767
   infection | 0.0019564194835861976
         ice | 0.001941484119743118
       cases | 0.0018516428847988903
== Topic 3 ==
        Term | Score     
       henry | 0.004026632001088412
         air | 0.002660808907688986
    pressure | 0.0019752782063153116
       blood | 0.0017563704454166495
   

In [92]:
n_topics = 18
small_wiki_lda_models['b=1.6,a=6.5,high-freq-words-removed'] = LDA.train(small_wikipedia_ready_for_lda, seed=None, docConcentration=6.5, topicConcentration=1.6, maxIterations=15, k=n_topics)
describe_topics(small_wiki_lda_models['b=1.6,a=6.5,high-freq-words-removed'], new_small_dict, n_words_per_topic=10)

== Topic 1 ==
        Term | Score     
           • | 0.0038233101032554323
     windows | 0.003556828555775314
         law | 0.002584417188352744
    economic | 0.002396568152885979
   countries | 0.0019435304404840257
       trade | 0.0017372447725013144
distribution | 0.001727067714302235
      public | 0.001712602875972321
      soviet | 0.0016915539759367166
     central | 0.0016422084623953087
== Topic 2 ==
        Term | Score     
      church | 0.007612197257441332
       roman | 0.0038473283341109287
           = | 0.00209058556682233
         god | 0.002055196486715275
      empire | 0.002027453672713741
    churches | 0.0019716634911392814
    catholic | 0.001919159955027288
    function | 0.001870833331571155
        book | 0.0018679871075588028
       greek | 0.0018377441300623776
== Topic 3 ==
        Term | Score     
       party | 0.0027194424933237355
       house | 0.002587449813092005
        army | 0.002320826103842772
      church | 0.0021880148926918834
      