# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [27]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

  and should_run_async(code)


In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'))

# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'))

print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 11314
Testing Samples: 7532


In [4]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
newsgroups_train['data'][1000]

"Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [7]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train.data)
print(X_train.shape)

(11314, 101631)


In [8]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:  1.5min remaining:    9.6s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.5min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [9]:
gs1.best_score_

0.6574153539838395

In [10]:
gs1.best_params_

{'min_samples_leaf': 2}

In [11]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 101631)

In [12]:
# Category number
gs1.predict(test_sample)[0]

9

In [13]:
# Category name
newsgroups_train['target_names'][9]

'rec.sport.baseball'

### GridSearch with BOTH the Vectoizer & Classifier

In [14]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

params_2 = {
    'vect__stop_words': (None, 'english'),
    'vect__min_df': (2, 5),
    'clf__max_depth': (10, None)
}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.5min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [15]:
gs2.best_score_

0.6607746264533867

In [16]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [17]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([9])

In [18]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [19]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


In [20]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis  # Will break jupyter lab
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(11314, 3)


  and should_run_async(code)


In [22]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

  and should_run_async(code)


Unnamed: 0,content,target,target_names
998,"\nEssential tremor is a progressive hereditary tremor that gets worse\nwhen the patient tries to use the effected member. All limbs, vocal\ncords, and head can be involved. Inderal is a beta-blocker and\nis usually effective in diminishing the tremor. Alcohol and mysoline\nare also effective, but alcohol is too toxic to use as a treatment.\n-- \n----------------------------------------------------------------------------\nGordon Banks N3JXP | ""Skepticism is the chastity of the intellect, and\ngeb@cadre.dsl.pitt.edu | it is shameful to surrender it too soon.""",13,sci.med
1210,"\n\nI recommend the book ""Adams _v_ Texas"", the story of a man (Adams) who\nwas sentenced to death for a crime he didn't commit. Most of the book\nis the story of the long appeals process, and the problems and delays\ncaused by not being able to introduce new evidence in certain courts.\n",18,talk.politics.misc
5309,"\nGlad to hear this, just a note, Osiris, Mithras and many other\ncult gods resurrected as well, so there's a good chance for all of\nus to maybe end up in a virtual reality simulator, and live forever,\nhurrah!\n\nSorry, this was a joke, some sort of one anyway. I'm the first\nthat connected Osiris with a virtual reality personality database.\nTime to write a book.\n\n\nCheers,\nKent",19,talk.religion.misc


In [50]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 

    # 1. Remove new line characters
df['clean_text'] = df['content'].apply(lambda x: re.sub('\s+', ' ', x))

    # 2. Remove Emails
df['clean_text'] = df['content'].apply(lambda x: re.sub('From: \S+@\S+', '', x))
    
    # 3. Remove non-alphanumeric characters
df['clean_text'] = df['content'].apply(lambda x: re.sub('[^a-zA-z]', ' ', x))  

    # 4. Remove extra whitespace 
df['clean_text'] = df['content'].apply(lambda x: ' '.join(x.split()))

In [51]:
df.sample(3)

Unnamed: 0,content,target,target_names,clean_text,lemmas
11241,I heard FASTMicro went out of business. Is this true? \nThey don't answer their 800 number. It's 800-821-9000.\n,3,comp.sys.ibm.pc.hardware,I heard FASTMicro went out of business. Is this true? They don't answer their 800 number. It's 800-821-9000.,"[ , hear, FASTMicro, go, business, , true, \n, answer, 800, number, , 800, 821, 9000, \n]"
9167,"\n\nThe down side is that when I'm in my cage, I have on numerous occasions\nslammed my hand into the rolled up window in an effort to wave at\na passing biker. Ow.\n",8,rec.motorcycles,"The down side is that when I'm in my cage, I have on numerous occasions slammed my hand into the rolled up window in an effort to wave at a passing biker. Ow.","[\n\n, cage, numerous, occasion, \n, slam, hand, roll, window, effort, wave, \n, pass, biker, , ow, \n]"
1461,"\n Ah, so you finally found a use for that super slo-mo and frame advance\nother than scrutinizing ""Sorority Babes in Heat"". Congrats! \n\n\n Trust me, you'd have a helluva time manipulating them. Besides, if you\nconverted the film to video you'd have all kinds of artifacts because of the\ndifference in frame rate (unless you're an expert at doing 3/2 pulldown for\na laserdisc company or something). \n\n\n Hey, no fair! What about 'Fettucine' Alfredo Griffin? The guy practically\nhas to pivot the bat around along with his body. \n\n\n Daulton doesn't strike me as all that strange. He's a little bit quiet at \nthe plate but, like Franco, gets the bat through the hitting zone on a level\nplane. The first time I watched Julio Franco, I didn't think *anyone* could\nhit like that. Now I marvel at how easy he makes it look; every time he makes\ncontact, it's *solid*. He's got good power to all fields and rarely is he\ncaught not ready for a pitch. \n\n I wonder if Phil Plantier had a severe bout with hemorrhoids and had to\npractice his swing while 'on the throne'? :-) Sure looks like it :-) \n\n How 'bout one to add to your list: Travis Fryman? The guy plants his front\nfoot and seems to swing *across* his body. He generates a lot of power, but\nI keep thinking he could generate even more if he could get a better pivot\nout of his hips. \n\n\n Well, they're already spoken for (by several people), but .. \n\n I'd add Robbie Alomar's name to the list, among others. I really like Dean\nPalmer's swing, for some twisted reason, as well as Pedro Munoz's swing. \n\n\n A thought about May: It looks like they've taught him to turn on the ball.\nIMHO, he's going to fall in love with his newfound power and start pulling\noff the ball to the point that he's going to see *lots* of sinkers/sliders\nlow and away. Unless he adjusts quickly and starts rifling doubles to left \nand left-center, IMHO you're going to see a good number of weak grounders to \nthe right side of the infield in the next month. \n",9,rec.sport.baseball,"Ah, so you finally found a use for that super slo-mo and frame advance other than scrutinizing ""Sorority Babes in Heat"". Congrats! Trust me, you'd have a helluva time manipulating them. Besides, if you converted the film to video you'd have all kinds of artifacts because of the difference in frame rate (unless you're an expert at doing 3/2 pulldown for a laserdisc company or something). Hey, no fair! What about 'Fettucine' Alfredo Griffin? The guy practically has to pivot the bat around along with his body. Daulton doesn't strike me as all that strange. He's a little bit quiet at the plate but, like Franco, gets the bat through the hitting zone on a level plane. The first time I watched Julio Franco, I didn't think *anyone* could hit like that. Now I marvel at how easy he makes it look; every time he makes contact, it's *solid*. He's got good power to all fields and rarely is he caught not ready for a pitch. I wonder if Phil Plantier had a severe bout with hemorrhoids and had to practice his swing while 'on the throne'? :-) Sure looks like it :-) How 'bout one to add to your list: Travis Fryman? The guy plants his front foot and seems to swing *across* his body. He generates a lot of power, but I keep thinking he could generate even more if he could get a better pivot out of his hips. Well, they're already spoken for (by several people), but .. I'd add Robbie Alomar's name to the list, among others. I really like Dean Palmer's swing, for some twisted reason, as well as Pedro Munoz's swing. A thought about May: It looks like they've taught him to turn on the ball. IMHO, he's going to fall in love with his newfound power and start pulling off the ball to the point that he's going to see *lots* of sinkers/sliders low and away. Unless he adjusts quickly and starts rifling doubles to left and left-center, IMHO you're going to see a good number of weak grounders to the right side of the infield in the next month.","[\n , ah, finally, find, use, super, slo, mo, frame, advance, \n, scrutinize, Sorority, Babes, Heat, Congrats, \n\n\n , trust, helluva, time, manipulate, \n, convert, film, video, kind, artifact, \n, difference, frame, rate, expert, 3/2, pulldown, \n, laserdisc, company, \n\n\n , hey, fair, Fettucine, Alfredo, Griffin, guy, practically, \n, pivot, bat, body, \n\n\n , Daulton, strike, strange, little, bit, quiet, \n, plate, like, Franco, get, bat, hit, zone, level, \n, plane, time, watch, Julio, Franco, think, \n, hit, like, marvel, easy, make, look, time, make, \n, contact, solid, get, good, power, field, rarely, \n, catch, ready, pitch, \n\n , wonder, Phil, Plantier, severe, bout, hemorrhoid, ...]"


In [52]:
nlp = spacy.load("en_core_web_lg")

In [53]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

In [54]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['clean_text'].progress_apply(get_lemmas)

100%|████████████████████████████████████████████████████████████████████████████| 11314/11314 [06:31<00:00, 28.89it/s]


In [55]:
df.head()

Unnamed: 0,content,target,target_names,clean_text,lemmas
0,"I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.",7,rec.autos,"I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.","[wonder, enlighten, car, see, day, 2-door, sport, car, look, late, 60s/, early, 70, call, Bricklin, door, small, addition, bumper, separate, rest, body, know, tellme, model, engine, spec, year, production, car, history, info, funky, look, car, e, mail]"
1,"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.",4,comp.sys.mac.hardware,"A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll. Please send a brief message detailing your experiences with the procedure. Top speed attained, CPU rated speed, add on cards and adapters, heat sinks, hour of usage per day, floppy disk functionality with 800 and 1.4 m floppies are especially requested. I will be summarizing in the next two days, so please add to the network knowledge base if you have done the clock upgrade and haven't answered this poll. Thanks.","[fair, number, brave, soul, upgrade, SI, clock, oscillator, share, experience, poll, send, brief, message, detail, experience, procedure, speed, attain, CPU, rate, speed, add, card, adapter, heat, sink, hour, usage, day, floppy, disk, functionality, 800, 1.4, m, floppy, especially, request, summarize, day, add, network, knowledge, base, clock, upgrade, answer, poll, thank]"
2,"well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering",4,comp.sys.mac.hardware,"well folks, my mac plus finally gave up the ghost this weekend after starting life as a 512k way back in 1985. sooo, i'm in the market for a new machine a bit sooner than i intended to be... i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch of questions that (hopefully) somebody can answer: * does anybody know any dirt on when the next round of powerbook introductions are expected? i'd heard the 185c was supposed to make an appearence ""this summer"" but haven't heard anymore on it - and since i don't have access to macleak, i was wondering if anybody out there had more info... * has anybody heard rumors about price drops to the powerbook line like the ones the duo's just went through recently? * what's the impression of the display on the 180? i could probably swing a 180 if i got the 80Mb disk rather than the 120, but i don't really have a feel for how much ""better"" the display is (yea, it looks great in the store, but is that all ""wow"" or is it really that good?). could i solicit some opinions of people who use the 160 and 180 day-to-day on if its worth taking the disk size and money hit to get the active display? (i realize this is a real subjective question, but i've only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful). * how well does hellcats perform? ;) thanks a bunch in advance for any info - if you could email, i'll post a summary (news reading time is at a premium with finals just around the corner... :( ) -- Tom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering","[folk, mac, plus, finally, give, ghost, weekend, start, life, 512k, way, 1985, sooo, market, new, machine, bit, sooner, intend, look, pick, powerbook, 160, maybe, 180, bunch, question, hopefully, somebody, answer, anybody, know, dirt, round, powerbook, introduction, expect, hear, 185c, suppose, appearence, summer, hear, anymore, access, macleak, wonder, anybody, info, anybody, hear, rumor, price, drop, powerbook, line, like, one, duo, go, recently, impression, display, 180, probably, swing, 180, get, 80mb, disk, 120, feel, well, display, yea, look, great, store, wow, good, solicit, opinion, people, use, 160, 180, day, day, worth, take, disk, size, money, hit, active, display, realize, real, subjective, question, ...]"
3,\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n,1,comp.graphics,Do you have Weitek's address/phone number? I'd like to get some information about this chip.,"[Weitek, address, phone, number, like, information, chip]"
4,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.",14,sci.space,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker): My understanding is that the 'expected errors' are basically known bugs in the warning system software - things are checked that don't have the right values in yet because they aren't set till after launch, and suchlike. Rather than fix the code and possibly introduce new bugs, they just tell the crew 'ok, if you see a warning no. 213 before liftoff, ignore it'.","[article, <, c5owcb.n3p@world.std.com, >, tombaker@world.std.com, Tom, Baker, understanding, expect, error, basically, know, bug, warning, system, software, thing, check, right, value, set, till, launch, suchlike, fix, code, possibly, introduce, new, bug, tell, crew, ok, warning, 213, liftoff, ignore]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [56]:
# Create Dictionary - Maps a number to a word
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [57]:
# How many words do we have?
len(id2word.keys())

117191

In [58]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [59]:
# How many words do we have?
len(id2word.keys())

14653

In [60]:
id2word[76]

'Engineering'

In [61]:
df['clean_text'][5]

'Of course. The term must be rigidly defined in any bill. I doubt she uses this term for that. You are using a quote allegedly from her, can you back it up? I read the article as presenting first an argument about weapons of mass destruction (as commonly understood) and then switching to other topics. The first point evidently was to show that not all weapons should be allowed, and then the later analysis was, given this understanding, to consider another class.'

In [62]:
corpus[5]

[(112, 1),
 (170, 1),
 (186, 1),
 (210, 1),
 (213, 1),
 (214, 1),
 (215, 1),
 (216, 1),
 (217, 1),
 (218, 1),
 (219, 1),
 (220, 1),
 (221, 1),
 (222, 1),
 (223, 1),
 (224, 1),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 1),
 (233, 1),
 (234, 2),
 (235, 1),
 (236, 1),
 (237, 2)]

In [63]:
id2word[252]

'40mb'

In [64]:
id2word[276]

'fact'

In [65]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

[('impression', 1),
 ('213', 1),
 ('ok', 1),
 ('evidently', 1),
 ('point', 1),
 ('present', 1),
 ('quote', 1),
 ('read', 1),
 ('switch', 1),
 ('term', 1),
 ('topic', 1),
 ('understand', 1),
 ('weapon', 1),
 ('Sean', 1),
 ('September', 1),
 ('Sharon', 1),
 ('accidentally', 1),
 ('bounce', 1),
 ('delete', 1),
 ('directly', 1),
 ('file', 1),
 ('glad', 1),
 ('hmmm', 1),
 ('instead', 1),
 ('publicly', 1),
 ('respond', 2),
 ('rm', 1),
 ('rn', 1),
 ('sure', 2)]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [66]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

KeyboardInterrupt: 

In [67]:
lda_multicore.save('lda_multicore.model')

NameError: name 'lda_multicore' is not defined

In [None]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
newsgroups_train.target_names

In [None]:
# Prints the top 10 words in each topic
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

In [None]:
doc_lda

In [None]:
distro = [lda[d] for d in corpus]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))