# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Keijiro Tajima*
* *Mahammad Shirinov*
* *Stephen Zhao*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import pickle
import json
import numpy as np
from utils import load_json, load_pkl, save_pkl

## Exercise 4.8: Topics extraction

In [2]:
# load data
courses = load_pkl('courses_prepped.pkl')
course_ids = load_pkl('courses.pkl')
terms = load_pkl('terms.pkl')
TFIDF_matrix = load_pkl('tfidx_matrix.pkl')

In [3]:
courses[0]

{'courseId': 'MSE-440',
 'description': 'latest develop gener organ composit discuss nanocomposit adapt composit biocomposit product develop cost analysi studi market practic team work basic composit materi constitu composit composit structur current develop nanocomposit textil composit biocomposit adapt composit applic drive forc market cost analysi aerospac automot sport keyword composit applic nanocomposit biocomposit adapt composit cost prerequisit requir cours notion polym recommend cours polym composit outcom end propos suitabl product perform criteria product composit part appli basic equat mechan properti composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu one perform team receiv respond appropri feedback teach cathedra invit speaker group session exercis work project expect activ attend lectur composit part bibliographi search written exam report oral class',
 'name'

In [4]:
n_topics = 10
courses_rdd = sc.parallelize(courses)

# transform each article (in form of json dump) to an 
# (id, bag_of_words_vector) pair necessary for the spark LDA fucntion
course_term_counts_rdd = courses_rdd.map(
    lambda x: [course_ids.index(x['courseId']), Vectors.dense([x['description'].split(" ").count(t) for t in terms])]
)

In [34]:
# given a trained model and the dictionary used in its dataset, 
# prints the most relevant terms for each topic classified by the model
def describe_topics(model, terms=terms, n_words_per_topic=10, topic_labels=None):
    # get the predictions of topic term distributions from the LDAmodel
    lda_topic_predictions = model.describeTopics(maxTermsPerTopic=n_words_per_topic)
    for i, topic in enumerate(lda_topic_predictions):
        if topic_labels:
            print('== Topic {} - {} =='.format(i+1, topic_labels[i]))
        else:
            print('== Topic {} =='.format(i+1))
        most_relevant_terms = topic[0]
        most_relevant_terms_relevance = topic[1]
        
        print('{:>12} | {:10}'.format('Term', 'Score'))
        for term, relevance in zip(most_relevant_terms, most_relevant_terms_relevance):
            print('{:>12} | {:10}'.format(terms[term], relevance))

### 1. Print k = 10 topics extracted using LDA and give them labels.

In [66]:
# train LDA and analyze
lda_model = LDA.train(course_term_counts_rdd, k=n_topics)
describe_topics(lda_model)

==Topic 1==
evalu 0.010669837590552034
discuss 0.009675077506243876
activ 0.009392266510662063
protein 0.009241553811236806
paper 0.008792438999123804
class 0.008718554152576913
innov 0.00831550010084958
econom 0.008036919126768158
inform 0.007766603228178431
case 0.007562558337403219
==Topic 2==
cell 0.02208054589687132
structur 0.019984748784781906
mechan 0.012756361951031964
lectur 0.011724059844961589
molecular 0.010650153218159477
biolog 0.010214146327898332
cours 0.006710692532689354
organ 0.006246905843776829
paper 0.006212328874749429
data 0.006086404713174448
==Topic 3==
materi 0.03922285083880257
electron 0.0185608444942165
properti 0.015869220629654988
applic 0.01396705228201581
physic 0.012823439248618127
devic 0.011861043239506496
mechan 0.008913664667393567
metal 0.008911402816150764
organ 0.008729623782723489
technolog 0.008617578564704034
==Topic 4==
chemic 0.01910028211790694
reaction 0.014122305038484259
equat 0.013224979652019124
energi 0.011444030291372758
quantum 0

In [67]:
lda_topic_labels = ['Mechanical Engineering',
                    'Biology',
                    'Material Science/Mechanical Engineering',
                    'Chemistry/Theoretical Physics',
                    'Control Theory/Robotics',
                    'Optics',
                    'Statistics/Data Sceince/Optimization',
                    'Industrial Engineering',
                    'Semester project course',
                    'Architecture'
                   ]

In [69]:
for i, label in enumerate(lda_topic_labels):
    print("Topic {:2d}: {}".format(i+1, label))

Topic  1: Mechanical Engineering
Topic  2: Biology
Topic  3: Material Science/Mechanical Engineering
Topic  4: Chemistry/Theoretical Physics
Topic  5: Control Theory/Robotics
Topic  6: Optics
Topic  7: Statistics/Data Sceince/Optimization
Topic  8: Industrial Engineering
Topic  9: Semester project course
Topic 10: Architecture


### 2. How does it compare with LSI?

LDA shows better results than LSI in this case, as it can identify more 'stand-alone' topics, which together cover a larger area of all the topics.

### Appendix A: Experimenting with TF-IDF matrix in LDA (no good results)

In [8]:
lda_model_tfidf = LDA.train(courses_rdd.map(lambda x: [course_ids.index(x['courseId']), Vectors.dense(TFIDF_matrix[course_ids.index(x['courseId'])])])
)

In [10]:
lda_tfidf_topic_predictions = lda_model_tfidf.describeTopics(maxTermsPerTopic=7)

In [11]:
lda_tfidf_topic_predictions[2][0]

[449, 415, 94, 594, 605, 45, 352]

In [12]:
for i, topic in enumerate(lda_tfidf_topic_predictions):
    print('==Topic {}=='.format(i+1))
    most_relevant_terms = topic[0]
    most_relevant_terms_relevance = topic[1]
    for term, relevance in zip(most_relevant_terms, most_relevant_terms_relevance):
        print(terms[term], relevance)

==Topic 1==
seepag 0.004185591890461483
divis 0.003932087156210787
ludwig 0.003932074920224155
manufactur 0.0028498377731108328
opac 0.0028227959654593677
preciou 0.0026215709153107974
secret 0.0025396975833506066
==Topic 2==
seepag 0.004185699216141779
ludwig 0.003932364834475524
divis 0.00393225468082244
manufactur 0.0028498371099158477
opac 0.002822787331343592
preciou 0.0026215626225565007
secret 0.002539695172621642
==Topic 3==
seepag 0.004185351720065871
divis 0.003932276916026282
ludwig 0.003932116112949313
manufactur 0.0028498379556460714
opac 0.002822791811652461
preciou 0.0026215653018713445
secret 0.0025396972914469816
==Topic 4==
seepag 0.004186252524402281
ludwig 0.003932348455717187
divis 0.003932328017467629
manufactur 0.002849836814659058
opac 0.002822784552920593
preciou 0.0026215642749753696
secret 0.0025396918500062374
==Topic 5==
seepag 0.004185793747017829
divis 0.003932462854878184
ludwig 0.003932276979960304
manufactur 0.002849836529185334
opac 0.0028227877449402

## Exercise 4.9: Dirichlet hyperparameters

### Fixed $\beta$, varying $\alpha$

In [21]:
lda_models = {}

In [77]:
lda_models = {}
lda_models['b=1.01,a=1.01']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=1.01, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=1.01'])

==Topic 1==
data 0.01693356916839985
algorithm 0.016653920324806667
statist 0.013899380945637547
analysi 0.011403829595696597
linear 0.011395347705165598
commun 0.01081830701791149
program 0.010580411989205207
basic 0.008768534359729576
network 0.007797193154697797
cours 0.007406423173374682
==Topic 2==
theori 0.015320227863476504
optic 0.015176692476802194
equat 0.01501814341936468
problem 0.01307034480592989
basic 0.011722357987655202
comput 0.010921238040423638
numer 0.009355462464348789
time 0.008885440975917381
linear 0.008822223440801883
solv 0.008604499299498225
==Topic 3==
work 0.012177308657006508
project 0.011127074457116097
manag 0.010452322985413265
develop 0.01024204247209789
data 0.010020777107118801
group 0.009250379496782424
teach 0.009187733416662139
evalu 0.008933236193114412
plan 0.007940015195277253
skill 0.007629996272806817
==Topic 4==
energi 0.02263552608500508
circuit 0.01482570794487184
devic 0.011511060741240083
power 0.011357120336288358
materi 0.011318736522

In [82]:
lda_models['b=1.01,a=2.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=2.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=2.0'])

==Topic 1==
data 0.01791278626056386
algorithm 0.01728384013879444
statist 0.013870164775550602
commun 0.012136109594508443
program 0.011447027689368902
analysi 0.011265083556964152
linear 0.010486287817560964
network 0.010251011647280477
machin 0.008328979819095063
basic 0.008245030474640203
==Topic 2==
theori 0.018756452433457322
optic 0.01599780965268055
equat 0.014988045786638867
problem 0.014430287506702881
basic 0.013103234346978816
comput 0.011309741269538319
linear 0.010435406768282978
time 0.009752825176744143
numer 0.009493293039480363
solv 0.009261244117899683
==Topic 3==
work 0.012819176963929268
project 0.012553841748684612
develop 0.011146558177129841
manag 0.011088542713840098
data 0.010228886419625912
teach 0.010190086862332436
group 0.010140961552322057
evalu 0.009732045842941692
plan 0.008680067882409168
skill 0.008599185869545193
==Topic 4==
energi 0.023572886750785817
circuit 0.015558753449222586
materi 0.012675709548761413
power 0.01238182869295958
devic 0.01205269

In [83]:
lda_models['b=1.01,a=5.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=5.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=5.0'])

==Topic 1==
algorithm 0.014738860591886463
commun 0.014142654754678104
data 0.013725234998361258
network 0.012831099078895165
program 0.010576402713085836
machin 0.009077989471487819
statist 0.00892872534700303
analysi 0.007792414721466698
topic 0.007718782172910809
cours 0.007468790288550348
==Topic 2==
theori 0.019611441277177517
optic 0.015542205976591006
basic 0.014715076948032481
problem 0.014256359879202995
linear 0.013886607459813457
equat 0.012228280324249944
time 0.01108779629356173
comput 0.010679397764261326
function 0.009131780426212486
probabl 0.009011895282834501
==Topic 3==
teach 0.012730408647083417
work 0.012546286925835344
data 0.011365802706802547
develop 0.011197344239032299
project 0.010706498330717481
manag 0.009962079414831336
evalu 0.009563876861238798
group 0.009559425547405843
skill 0.009147942200462332
inform 0.008074808875649422
==Topic 4==
energi 0.022174469589021
circuit 0.015200197766376156
materi 0.014462315845079366
technolog 0.011999962207657294
power 

In [86]:
lda_models['b=1.01,a=10.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=10.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=10.0'])

==Topic 1==
commun 0.011478016541771477
architectur 0.010056359202580894
network 0.007786693805789433
develop 0.007719784508890853
cover 0.00727440328654429
lectur 0.006804253658142563
machin 0.006582675652335703
topic 0.006363468410019855
program 0.0063355600370064765
cours 0.006278556335708221
==Topic 2==
theori 0.013305248363343295
optic 0.012610420766109619
basic 0.011587816668166539
linear 0.01108477860669207
time 0.009711505133986269
stochast 0.009214974228282408
statist 0.00879133167487063
problem 0.008121044640306866
probabl 0.007942832079146162
algorithm 0.007829820551499823
==Topic 3==
teach 0.013835721778608957
work 0.010048474241582249
data 0.009766046562345735
develop 0.009087893128286909
project 0.00895315437029976
skill 0.008317393292021249
evalu 0.008183986134858798
scienc 0.007953684419569259
practic 0.006934879947068465
polici 0.006816284741992784
==Topic 4==
energi 0.015646545955890635
materi 0.012837256319952058
circuit 0.012234959847241978
mechan 0.0119365037627397

In [84]:
lda_models['b=1.01,a=20.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=20.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=20.0'])

==Topic 1==
architectur 0.011676844435610591
develop 0.006480080293946295
project 0.006147177182420959
lectur 0.006142327967808092
materi 0.006068400002464426
commun 0.005805321846210051
cours 0.005585146356524503
cover 0.005308077035720984
field 0.004819764522609284
work 0.00473744964590716
==Topic 2==
optic 0.009798802520715523
basic 0.0071152025835888606
cours 0.006343790636957809
theori 0.0060034585884822175
analysi 0.005838108922294955
lectur 0.005712410407273863
exercis 0.005494660848239926
concept 0.005417467605412012
applic 0.005221417257814845
teach 0.005168260193495428
==Topic 3==
teach 0.009038384794035457
project 0.00784599400958329
work 0.006751933605056293
lectur 0.006464584817978856
skill 0.006114293505875337
evalu 0.0059872763669786095
data 0.005965000360445143
develop 0.005939814710851218
analysi 0.0058218053702770185
polici 0.005535797995269567
==Topic 4==
energi 0.0068952156918536504
materi 0.006573075958228096
mechan 0.0063613612199348015
applic 0.00575882809029164


In [85]:
lda_models['b=1.01,a=200.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=200.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=200.0'])

==Topic 1==
lectur 0.005958053279532689
cours 0.005828513366728731
project 0.005639014783610474
basic 0.00551368272256892
analysi 0.005279701279677533
concept 0.005059937866601577
activ 0.005058728193898165
end 0.005005191039315808
outcom 0.004876034960734945
work 0.004875021244762496
==Topic 2==
project 0.005882747755461574
lectur 0.005832918975759106
cours 0.00575202421847043
analysi 0.00557071682396475
basic 0.005286359581550977
work 0.0052789778307774414
concept 0.005121959371809552
teach 0.005089381165511895
end 0.0050658494979980056
outcom 0.0048248998312704705
==Topic 3==
project 0.006367574419657913
lectur 0.006058448874913022
cours 0.005792866709488918
analysi 0.005370812267057648
basic 0.00533559543322523
concept 0.005259781438367095
end 0.005134021372663105
work 0.005128730592043604
teach 0.004965793430867089
exercis 0.004894768264699
==Topic 4==
analysi 0.005890650736790517
cours 0.005796660722231947
project 0.005741996021602136
lectur 0.005702235232277099
end 0.00550350462

A very high alpha value forces every document to have an almost uniform topic distribution: each document contains an equal amount of each topic. As can be seen from $\alpha=200$ case, this forces the topic to be defined by very general terms that are present in almost all documents in equal amounts. The same effect is also present in the case of $\alpha=20$, although in this case some topic-specific words like *optic* and *communi* manage to survive.

There is no observable difference in the topics for $\alpha \in \{1.01, 2, 5\}$, at least with regards of the ten most relevant terms in each topic - the set of terms (and often the order too!) stays the same. For $\alpha=10$, one could still recover the same topics as for the smaller $\alpha$ values, but we can already see how terms (e.g. 'algorithm') that are very specific to certain domains diminish and sometimes disappear from top terms, and the relevance scores for all topics get smaller, closer to uniform.

### Fixed $\alpha$, varying $\beta$

In [9]:
lda_models['b=1.01,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=1.01, k=n_topics)
describe_topics(lda_models['b=1.01,a=6.0'])

==Topic 1==
commun 0.014037739220928654
algorithm 0.01278290753201815
network 0.01230586060226602
data 0.0115893591939151
program 0.009711015898293995
machin 0.008840724177645533
topic 0.007756155568422742
architectur 0.007707950293464971
cours 0.0072066937645633294
cover 0.00719039700972224
==Topic 2==
theori 0.0187819824820313
optic 0.01480961585529588
basic 0.014464912030173485
linear 0.014187644724877217
problem 0.013130108749837888
time 0.01116233804623396
equat 0.011143980682588888
comput 0.010084060695819972
statist 0.009450583811176654
probabl 0.009263395724096825
==Topic 3==
teach 0.013303581315887483
work 0.012083551554819319
data 0.011485498938022668
develop 0.01090039078898205
project 0.010129598121454017
evalu 0.009296651659040563
manag 0.009068386431788655
skill 0.00903921458008353
group 0.008792777333610923
inform 0.0077446185721930195
==Topic 4==
energi 0.021114334921159272
circuit 0.014851992182929085
materi 0.014541857034849939
mechan 0.011773488263242902
technolog 0.

In [7]:
lda_models['b=2.0,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=2.0, k=n_topics)
describe_topics(lda_models['b=2.0,a=6.0'])

==Topic 1==
algorithm 0.009947864649107352
comput 0.008687859160916814
linear 0.008154186874410375
analysi 0.007990754583215487
data 0.007476387771741616
commun 0.007131159808340019
signal 0.007091357790668333
cours 0.007009998823305165
lectur 0.006830549203424182
basic 0.006658662972356873
==Topic 2==
optic 0.01956260223101345
basic 0.00853458750627916
theori 0.008120450147083302
equat 0.007510973526429205
cours 0.0068722029030414975
exercis 0.006221600932723454
applic 0.005945083114366682
concept 0.005633119383962713
lectur 0.005441908289894008
problem 0.0053307048933143595
==Topic 3==
project 0.016622114349436747
data 0.010733084356221646
evalu 0.009020210984747873
work 0.008481338680239812
develop 0.008279973448081828
plan 0.007803590473323234
report 0.0076468527588994094
skill 0.007449132278450803
teach 0.007245610132746762
research 0.0068378392629459455
==Topic 4==
energi 0.01655318870226515
technolog 0.009373808020448062
case 0.006099915313353643
studi 0.00607009884193102
integr

In [16]:
lda_models['b=4.0,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=4.0, k=n_topics)
describe_topics(lda_models['b=4.0,a=6.0'])

==Topic 1==
analysi 0.006604644943004321
lectur 0.006472158602541652
cours 0.0063481715580603155
basic 0.006028867977190321
comput 0.005863878671280857
project 0.0057990146295897665
concept 0.005559167896981306
data 0.005464938825555469
problem 0.005451892674105337
end 0.005383657514812891
==Topic 2==
optic 0.0074480392431172885
basic 0.005928640328397294
cours 0.005769265210727811
lectur 0.005191357493729489
concept 0.004935526668504323
analysi 0.004925928475343011
theori 0.004868384637416166
exercis 0.004759461722493213
applic 0.004731017924960213
end 0.004713207410358484
==Topic 3==
project 0.010493302950844993
data 0.006715064522527589
work 0.0063942989266937205
analysi 0.006195823846642419
evalu 0.006146595204100685
lectur 0.0060735917892812095
teach 0.005844870367094399
skill 0.0056565913967624665
cours 0.005633232484751072
activ 0.005535444821285911
==Topic 4==
energi 0.006695331639255742
analysi 0.00542690177992964
lectur 0.005416418271229072
work 0.005340660053170465
cours 0.0

In [12]:
lda_models['b=8.0,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=8.0, k=n_topics)
describe_topics(lda_models['b=8.0,a=6.0'])

==Topic 1==
project 0.006781382631167716
lectur 0.006623587553720297
cours 0.006497818285076023
analysi 0.006457210074147337
basic 0.005995937057102911
concept 0.005815716886013363
end 0.005672348573951081
work 0.005569965583536346
activ 0.005495611638377166
data 0.005365541078369795
==Topic 2==
cours 0.005101975933956876
basic 0.004941628617768604
lectur 0.004798837100761532
analysi 0.004664348002513567
end 0.0044404031663378175
concept 0.00443315252350136
project 0.004321613811482582
outcom 0.004241313193459503
teach 0.004203483773919622
exercis 0.004154618139997959
==Topic 3==
project 0.006983090952583957
lectur 0.005718736729902692
cours 0.0056676345967399
analysi 0.0055906090751631325
work 0.005300738984999717
end 0.005180245801786062
concept 0.005127297845874806
basic 0.005047129830491975
teach 0.005033499533246998
activ 0.004980709399219188
==Topic 4==
cours 0.0054983300562326505
lectur 0.005412789868928923
analysi 0.005399164435415369
project 0.005349073745619435
basic 0.005066

In [22]:
lda_models['b=20.0,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=20.0, k=n_topics)
describe_topics(lda_models['b=20.0,a=6.0'])

== Topic 1 ==
        Term | Score     
     project | 0.005699927451791348
       cours | 0.005675352472968332
      lectur | 0.005558527318683942
     analysi | 0.005489682581417621
       basic | 0.005304933161058678
     concept | 0.0050746027007829275
         end | 0.005030079536310235
        work | 0.004886578770034396
       activ | 0.004772441531754302
      outcom | 0.004732445703836478
== Topic 2 ==
        Term | Score     
     project | 0.00569123588499827
       cours | 0.005649642066900051
      lectur | 0.0055489268237016495
     analysi | 0.005411514982275412
       basic | 0.005256182311629518
     concept | 0.00500888367483297
         end | 0.004999375315879999
        work | 0.004855530804159495
       activ | 0.0047433640770882775
      outcom | 0.00470799558829096
== Topic 3 ==
        Term | Score     
     project | 0.005949126437614987
       cours | 0.005887021575211258
      lectur | 0.005846505303861788
     analysi | 0.005648556197456802
       basic | 0

In [23]:
lda_models['b=200.0,a=6.0']= LDA.train(course_term_counts_rdd, seed=0, docConcentration=6.0, topicConcentration=200.0, k=n_topics)
describe_topics(lda_models['b=200.0,a=6.0'])

== Topic 1 ==
        Term | Score     
     project | 0.005972819281965187
       cours | 0.005806646481399336
      lectur | 0.0057224626107993115
     analysi | 0.00559626368374838
       basic | 0.00537036852811633
     concept | 0.005179809580943456
         end | 0.0051409959857404005
        work | 0.00504894818337188
       activ | 0.00490600409126154
      outcom | 0.004823214955728706
== Topic 2 ==
        Term | Score     
     project | 0.005976824665717996
       cours | 0.005808937952166187
      lectur | 0.005726594249777617
     analysi | 0.005596967850533619
       basic | 0.005368922864393692
     concept | 0.00517959620704644
         end | 0.005142097642985559
        work | 0.005051404021266235
       activ | 0.004907014927721276
      outcom | 0.004824290327587884
== Topic 3 ==
        Term | Score     
     project | 0.0059747830900159955
       cours | 0.005807972378430564
      lectur | 0.005727300759799595
     analysi | 0.005596379793706528
       basic | 0.0

$\beta$ controls the distribution of terms per topic, and very high $\beta$ values force all topics to have a uniform distribution over all terms - every topic is (almost) equally explained by every term. We can indeed see that for $\beta \ge 8$, the topics don't mean anything and contain very general and ubiquitous terms as their most relevant terms (actually high $\beta$ only means that each topic should have a uniform distribution, but due additionally to a not-so-small $\alpha$ value and the optimization process, the ubiquitous terms get slightly higher relevance scores).

Same holds for $\beta=4$ as well, although some topic-specific words like 'optic' and 'physic' survive. For the smaller $\beta \in \{1.01, 2\}$ values, the topics seem to be about the same things, although it is a bit more uniform (in the sense that it's hard to identify this topic as a topic orthogonal to others) for $\beta=2$.

## Exercise 4.10: EPFL's taught subjects

In [62]:
# removing this weirdness that we only spotted now in the dataset - a course with a wrogn courseId
for i, course in enumerate(courses):
    if 'Caution, these contents corresponds to the coursebooks of last year' in course['courseId']:
        course['courseId'] = 'MGT-641(c)'
        course_ids[i] = 'MGT-641(c)'
        break

### Tests; please scroll further to see the insights

In [60]:
course_code_prefixes = set()
for course in course_ids:
    course_code_prefixes.add(course.split('-')[0])

In [61]:
course_code_prefixes

{'AR',
 'BIO',
 'BIOENG',
 'CH',
 'CIVIL',
 'COM',
 'CS',
 'ChE',
 'EE',
 'ENG',
 'ENV',
 'FIN',
 'HUM',
 'MATH',
 'ME',
 'MGT',
 'MICRO',
 'MSE',
 'PENS',
 'PHYS'}

From the above, it looks like there are ~14 different clusters of subjects covered. We will set k = 14. We want the distribution of topics per document to be not very close to uniform, but rather a bit selective, so we'll consider $\alpha \le 5$. Distribution of topics per document should also be rather selective, so we'll favor smaller $\beta$ values.

In [119]:
n_topics = 14
lda_models['b=1.15,a=4.5'] = LDA.train(course_term_counts_rdd, seed=None, docConcentration=4.5, topicConcentration=1.15, maxIterations=40, k=n_topics)
describe_topics(lda_models['b=1.15,a=4.5'], n_words_per_topic=5)

==Topic 1==
control 0.03593080079137186
signal 0.02633415282678221
analysi 0.013154283512921229
filter 0.012726635262974169
cours 0.012152661097704038
==Topic 2==
power 0.023410193365175712
circuit 0.021208731862498043
structur 0.0180605345265911
commun 0.014137518019115726
test 0.013871207994107506
==Topic 3==
energi 0.050831044022439514
product 0.02001230886463833
quantum 0.017733441877510366
main 0.016928277222564923
concept 0.014464179573715484
==Topic 4==
data 0.04294023848527326
comput 0.027764510568993043
program 0.0268450897238096
optim 0.026375253653842558
algorithm 0.024285479517452135
==Topic 5==
project 0.02609530043894829
research 0.02580412672603687
report 0.018561634897625242
plan 0.01448345106432537
technolog 0.013297710157601037
==Topic 6==
equat 0.02526550006453647
numer 0.02242767623457437
flow 0.022290071443818154
network 0.015987386465149807
comput 0.015203927008683853
==Topic 7==
optic 0.042807739761181386
imag 0.03260217445726084
laser 0.01636056872490135
microsc

In [132]:
def describe_term(term, model):
    topics_matrix = model.topicsMatrix()
    term_topic_distribution = topics_matrix[terms.index(term)]
    top_topics = np.argsort(term_topic_distribution)[-6:][::-1]
    print('== Term: {} =='.format(term))
    for topic in top_topics:
        print('{} * Topic {}'.format(term_topic_distribution[topic], topic+1))

In [145]:
describe_term('statist', lda_models['b=1.15,a=4.5'])

== Term: statist ==
216.17962572633863 * Topic 12
21.26798135821673 * Topic 4
8.058328555339271 * Topic 1
5.5451397114354615 * Topic 3
0.782457751731907 * Topic 10
0.4977168455872002 * Topic 6


### 1. Find the combination of k, α and β that gives the most interpretable topics.

After some testing, and in agreement with the above conjectures, we discovered that $k=14$, $\alpha=4.5$ and $\beta=1.15$ produce good results of topic extraction. (topic-term distribution can be found just above)

In [150]:
print('Tested values:')
list(lda_models.keys())

Tested values:


['b=2.0,a=6.0',
 'b=1.2,a=4.0',
 'b=1.1,a=3.0',
 'b=1.1,a=1.01',
 'b=1.01,a=1.01',
 'b=1.01,a=4.0',
 'b=1.3,a=4.0',
 'b=1.4,a=4.5',
 'b=1.15a=4.5',
 'b=1.15,a=4.5']

### 2. Explain why you chose these values.

From the course code prefixes (like `MATH-` or `COM-`), it looks like there are ~14 different clusters of subjects covered. Although there are 20 different prefixes, some of them are too close (`BIO` and `BIOENG`, `CH` and `ChE`), and others wouldn't contain many specific words (e.g. `ENG`), thus we reduced the number to 14. We want the distribution of topics per document to be rather far from uniform (thus a bit selective), so we chose  $\alpha =4.5$, using the intuition in tuning $\alpha$ from the above excerice. Distribution of topics per document should also be rather selective, even more that the distribution of topics per doc, so we chose small $\beta = 1.15$ (again, using the intuition gained above).

### 3. Report the values of the hyperparameters that you used and your labels for the topics.

$k=14$, $\alpha =4.5$, $\beta = 1.15$ 

In [174]:
topic_labels = ['Signal Processing/Control Theory', 
                'Circuits/EE', 'Energy', 
                'Computer Science', 
                'Project (/Semester Project)', 
                '(Applied) Math', 
                'Optics', 'Materials Science', 
                'Chemistry', 
                'lack of specific terms', 
                'Architecture', 
                'Probability and Statistics', 
                '¯\_(ツ)_/¯', 
                'Management and Environmental Sciences']
describe_topics(lda_models['b=1.15,a=4.5'], n_words_per_topic=6, topic_labels=topic_labels)

== Topic 1 - Signal Processing/Control Theory ==
        Term | Score     
     control | 0.03593080079137186
      signal | 0.02633415282678221
     analysi | 0.013154283512921229
      filter | 0.012726635262974169
       cours | 0.012152661097704038
     exercis | 0.011235938287739888
== Topic 2 - Circuits/EE ==
        Term | Score     
       power | 0.023410193365175712
     circuit | 0.021208731862498043
    structur | 0.0180605345265911
      commun | 0.014137518019115726
        test | 0.013871207994107506
      mechan | 0.013403908996972172
== Topic 3 - Energy ==
        Term | Score     
      energi | 0.050831044022439514
     product | 0.02001230886463833
     quantum | 0.017733441877510366
        main | 0.016928277222564923
     concept | 0.014464179573715484
      theori | 0.013493624711263079
== Topic 4 - Computer Science ==
        Term | Score     
        data | 0.04294023848527326
      comput | 0.027764510568993043
     program | 0.0268450897238096
       optim | 

## Exercise 4.11: Wikipedia structure

### NOTE: please find the proper solution of this excerice in a separate notebook called lab4-3-lda-11-wikipedia.
What follows is an initial unsuccessful attempt to run LDA on the entire dataset. The time to train LDA on this dataset was prohibitive (only one iteration was completed after 6 hours of running, and with uninteresting results). This was probably due to the immense vocabulary asize of the data (~500K), which defines the size of the input vectors, slowing down the model drastically. In the supplementary notebook, we reduced the dictionary size and removed very infrequent words (as well as a few very frequent ones) and fitted LDA on the resulting dataset. The results were very impressive.

In [6]:
wikipedia = sc.textFile('/ix/wikipedia-for-schools.txt')

In [8]:
c = wikipedia.map(lambda x: 1).reduce(lambda a, b: a+b)
c

5554

In [9]:
def extract_tokens(json_dump):
    article = json.loads(json_dump)
    return [article['page_id'], article['tokens']]

def list_union(l1, l2):
    l1.extend(l2)
    return list(set(l1))

def tokenize(json_dump):
    article = json.loads(json_dump)
    bag_of_words = [0 for _ in range(len(wiki_tokens))]
    for term in article['tokens']:
        bag_of_words[wiki_tokens.index(term)] += 1
    return [article['page_id'], Vectors.dense(bag_of_words)]

In [11]:
wiki_tokens = wikipedia.map(extract_tokens).map(lambda x: x[1]).reduce(list_union)
len(wiki_tokens)

494494

In [31]:
wiki_articles_rdd = wikipedia.map(tokenize)

Before any testing, here is our intuition. Wikipedia for Schools would probably have 20 to 30 topics covered, which would include the subjects in an average school curriculum with additional topics such as Art or Movies/Entertainment(?). Having in mind that these articles will probably be longer and more general than epfl course desciptions, we  might use a larger $\alpha$ value here (a doc likely to touch a lot of topics). In case of terms, since the content is designed for schools, there would probably be less jargon in articles, and most articles would use most of the common words in english. Thus, we can use a larger $\beta$ value (eccentric words less likely).

Let's try these hypotheses out.

In [14]:
wiki_lda_models = {}

In [32]:
n_topics = 10
wiki_lda_models['b=1.8,a=7.0'] = LDA.train(wiki_articles_rdd, seed=None, docConcentration=7.0, topicConcentration=1.8, maxIterations=1, k=n_topics)

In [35]:
describe_topics(wiki_lda_models['b=1.8,a=7.0'], terms=wiki_tokens, n_words_per_topic=10)

== Topic 1 ==
        Term | Score     
        time | 0.0023735188240640507
           – | 0.002364930242138507
       years | 0.0020634430380079482
       world | 0.002008218961200823
         war | 0.0019737366021995895
      united | 0.0018580881145960236
    american | 0.0018430435216597327
      states | 0.001832750629942985
        city | 0.0017487375824974947
     century | 0.0016448350305299621
== Topic 2 ==
        Term | Score     
           – | 0.0023655416471026615
        time | 0.0022554488678780465
       world | 0.0020824941497861217
       years | 0.00205408907571085
         war | 0.0019864630599920555
      united | 0.0018429172557723082
      states | 0.0018081105418515319
    american | 0.0018009983664294524
        city | 0.0016626472735262485
     century | 0.0016382912191693127
== Topic 3 ==
        Term | Score     
           – | 0.0023602070609512293
        time | 0.0023070381727559987
       years | 0.0020665037405691545
       world | 0.00198011861901586