# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Dataset
categories = ['sci.electronics',
              'rec.sport.baseball',
              'rec.sport.hockey']
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 1788
Testing Samples: 1189


In [4]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
newsgroups_train['target_names']

['rec.sport.baseball', 'rec.sport.hockey', 'sci.electronics']

In [6]:
newsgroups_train['data'][1000]

"<lots of pretty good stuff about how the huge towers near most nuclear\npower plants are there to cool the used steam back into near ambient\ntemperature water deleted>\n\n\n\n    as a point of info, some of the early nuclear power plants in this\ncountry used the fission pile as a first stage to get the water hot, and\nthen had a second stage -fossil fuel- step to get the water (actually\nsteam) VERY HOT.\n\n   I remember seeing this at Con Edison's Indian Point #1 power plant,\nwhich is about 30 miles north of NYC, and built more or less 1958.\n\n\ndannyb@panix.com"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [7]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train['data'])
print(X_train.shape)

(1788, 19009)


In [8]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   13.4s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [9]:
gs1.best_score_

0.8439572477035506

In [10]:
gs1.best_params_

{'min_samples_leaf': 2}

In [11]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 19009)

In [12]:
gs1.predict(test_sample)[0]

0

In [13]:
newsgroups_train['target_names'][1]

'rec.sport.hockey'

### GridSearch with BOTH the Vectoizer & Classifier

In [14]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

params_2 = {
    'vect__stop_words': (None, 'english'),
    'vect__min_df': (2,5),
    'clf__max_depth': (10, None)
}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   18.9s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [15]:
gs2.best_score_

0.858508990188254

In [16]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [17]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([0])

In [18]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [19]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [20]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

  and should_run_async(code)


In [22]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(1788, 3)


In [23]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

Unnamed: 0,content,target,target_names
1744,"Is it just me, or does Bichette look totally lost in the outfield? He \nmisplayed Martinez fly-out into a double against the Expos, misplayed\nAlou's single into a triple (Alou tagged out at 3rd after over-sliding \nthe bag) and now he misplays another out into a 3 run triple...add in his\nwonderful batting average and we have one heck of a player!",0,rec.sport.baseball
510,"\ne,\n\nIf memory serves me well, Alicea hit it, and damn near tied the game.\nTorre obviously knows his players better than you do. \n\n\nSee y'all at the ballyard\nGo Braves\nChop Chop\n\nMichael Mule'\n",0,rec.sport.baseball
1731,"\nAmusing, isn't it? Seems only the SDCNs realize how much baseball is\na *team* game, combining efforts from every player for the win.\n\nConsider the Red Sox game last night. The Sox won 4-3 in the bottom\nof the 13th. Who won the game?\n\n-Clemens pitched a strong nine (?) innings, allowing only two runs.\n-Ryan pitched a couple shutout innings, though he needed some excellent\n defensive plays behind him to do so.\n-Quantrill pitched a couple of innings, gave up the go-ahead run, and\n got credited with the win when the Sox scored two in the bottom of\n the inning.\n\nLooks like a team effort to me! Yet only Quantrill got credit for\nthe win.\n\nHow about the offense?\n-Dawson and Vaughn hit (I think) HRs early in the game. Without either\n one, the Sox would have lost in nine.\n-Quintana led off the 13th with a solid single.\n-Zupcic pinch-ran for Quintana, providing the speed to go from first\n to third when...\n-Cooper ripped a *second* single in the inning.\n-Melvin avoided the DP, getting the run home with a sac fly. Not much of\n a help, but it was something.\n-Scrub Richardson then hit a double, scoring the speedy Cooper all the\n way from first! (Hill's lack of defense helped.)\n\nCooper and Zupcic were credited with runs, Melvin and Richardson were\ncredited with RBIs. But it seems to me that it was Quintana's hit\nthat set up the whole inning! And did Melvin really contribute as\nmuch as Richardson?\n\nFurthermore, people seem to consider RBIs to be more significant than\nruns. Did Melvin contribute more than Cooper? Cooper provided the\ngame-winning baserunner, and moved the tying run to third base with\nonly one out!\n\nAssigning credit based on Runs and RBIs is clearly ridiculous. You\ncan argue that OBP and SLG don't show you who came through in the\nclutch, but R&RBI don't do any better. At least OBP and SLG don't\n*claim* to try to tell you that.\n\nHere's to the Red Sox who contributed to last night's victory.\nAll 20 of them!",0,rec.sport.baseball


In [24]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 
# 1. Remove new line characters
df['clean_text'] = df['content'].apply(lambda x: re.sub('\s+', ' ', x))

# 2. Remove Emails
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('From: \S+@\S+', '', x))

# 3. Remove non-alphanumeric characters
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

# 4. Remove extra whitespace 
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join(x.split()))

In [25]:
df.sample(3)

Unnamed: 0,content,target,target_names,clean_text
1223,"It was my impression watching the Mets & Rockies that umpires were\ncalling strikes above the belt, too, but not as far up as the letters.\nIt would be nice if this were the case.",0,rec.sport.baseball,It was my impression watching the Mets Rockies that umpires were calling strikes above the belt too but not as far up as the letters It would be nice if this were the case
130,"\n\nGretzky, Lemieux, Gilmour etc do not play the role of checking centreman.\nThey play an offensive role as opposed to a defensive one. If they\nwere used as defensive centres it would be a waste of their offensive\nabilities. \n\nWhen you compare Gretzky et al to Jarvis, Gainey etc you are comparing \napples and oranges. It is like me telling you that Felix Potvin isn't \nvery good because a team would be better if the had Lemieux instead of\nhim. Sure Lemieux is a better player, but he is a different type of\nplayer. For a team to be successful, they need to have all types of\nplayers- this includes defensive forwards.\n\nWhen compared with other defensive forwards, Bob Gainey is the greatest\ndefensive forward ever. He is the player who's talents best suited being\na defensive forward- who completely dominated the game when he played.\n\nMaybe if a more talented player such as Gretzky had decided to waste his\noffensive talents and play defensively, he could have been a better\ndefensive forward, but he wasn't.\n\nBob Gainey is the best defensive forward that has ever played hockey.",1,rec.sport.hockey,Gretzky Lemieux Gilmour etc do not play the role of checking centreman They play an offensive role as opposed to a defensive one If they were used as defensive centres it would be a waste of their offensive abilities When you compare Gretzky et al to Jarvis Gainey etc you are comparing apples and oranges It is like me telling you that Felix Potvin isn t very good because a team would be better if the had Lemieux instead of him Sure Lemieux is a better player but he is a different type of player For a team to be successful they need to have all types of players this includes defensive forwards When compared with other defensive forwards Bob Gainey is the greatest defensive forward ever He is the player who s talents best suited being a defensive forward who completely dominated the game when he played Maybe if a more talented player such as Gretzky had decided to waste his offensive talents and play defensively he could have been a better defensive forward but he wasn t Bob Gainey is the best defensive forward that has ever played hockey
302,"\n\tBeing a proud BU alumnus, I'd like to get a list of BU players in \nthe NHL so I can keep an eye on their progress. A lot of Terriers are\ngraduating this year so I hope to see them soon in the NHL. If somebody\ncould post or send me a list, I'd appreciate it. Please note if the player\ngraduated from here or not.\n",1,rec.sport.hockey,Being a proud BU alumnus I d like to get a list of BU players in the NHL so I can keep an eye on their progress A lot of Terriers are graduating this year so I hope to see them soon in the NHL If somebody could post or send me a list I d appreciate it Please note if the player graduated from here or not


In [26]:
nlp = spacy.load("en_core_web_lg")

In [27]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

  from pandas import Panel


In [28]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['clean_text'].progress_apply(get_lemmas)

100%|██████████| 1788/1788 [01:08<00:00, 26.18it/s]


In [29]:
df.head()

Unnamed: 0,content,target,target_names,clean_text,lemmas
0,"\nOh yeah, how come Dino could never take the Caps out of the Patrick\nDivision? He choked up 3 games to 1 last year and got swept away in\nthe second round two years ago. He rarely, if ever, makes it out of the\ndivision.\n\n\nSo are the Islanders, but they can still pull it out. Vancouver has Winnipeg's\n number, so it really doesn't matter.\n\n\n\n Kings always seem to go at least 6 or 7, they never play a four or five\ngame serious. There's a difference between battling it out and pulling it\nout, as I take Calgary to pull it out in 7.",1,rec.sport.hockey,Oh yeah how come Dino could never take the Caps out of the Patrick Division He choked up games to last year and got swept away in the second round two years ago He rarely if ever makes it out of the division So are the Islanders but they can still pull it out Vancouver has Winnipeg s number so it really doesn t matter Kings always seem to go at least or they never play a four or five game serious There s a difference between battling it out and pulling it out as I take Calgary to pull it out in,"[oh, yeah, come, Dino, cap, Patrick, Division, choke, game, year, get, sweep, away, second, round, year, ago, rarely, make, division, Islanders, pull, Vancouver, Winnipeg, s, number, doesn, t, matter, king, play, game, s, difference, battle, pull, Calgary, pull]"
1,"Does anyone know where Billy Taylor is? Richmond or Syracuse? He was taken\nby the Jays in the Rule V draft, but not kept on the roster. Baseball Weekly\nsaid that he was demoted to Syracuse, but a Toronto paper indicated that\nthe Braves took him back. Is there an Atlanta fan, or anyone reading this,\nwho knows?",0,rec.sport.baseball,Does anyone know where Billy Taylor is Richmond or Syracuse He was taken by the Jays in the Rule V draft but not kept on the roster Baseball Weekly said that he was demoted to Syracuse but a Toronto paper indicated that the Braves took him back Is there an Atlanta fan or anyone reading this who knows,"[know, Billy, Taylor, Richmond, Syracuse, take, Jays, Rule, v, draft, keep, roster, Baseball, Weekly, say, demote, Syracuse, Toronto, paper, indicate, Braves, take, Atlanta, fan, read, know]"
2,"\n\n\nWhy are you fooling around with analog for this job? A single chip\nmicro and a crystal will do the job reliably and easily. An 8748 only\ncosts about $5. That and a $1 crystal and you're in business. Embed\nthe whole thing in a foam insulated blanket, power it from a solar cell,\nuse the excess power to heat the assembly during the day and rely\non the insulation to hold the heat during darkness. If you don't want\nto try thermal management, contact someone like ICL and have them cut\nyou a special low temperature crystal. It'll cost at most $20.\n\nIf you use a single chip micro, you're looking at a parts count of \nmaybe 7. A processor, a crystal, two caps on the crystal, a power FET\nto fire the solenoid a flyback diode and a battery. This is fewer parts than \nyou can build an analog timer for and is infinitely more reliable. Add\na power zener diode (for heat) and a solar cell and the parts count\nscreams up to 9.\n\nPD assemblers are available for all the common single chip micros. This\napplication is so trivial you could even look up the op codes in the \nprogrammer's guide and create the binary with a hex editor.\n\nJohn",2,sci.electronics,Why are you fooling around with analog for this job A single chip micro and a crystal will do the job reliably and easily An only costs about That and a crystal and you re in business Embed the whole thing in a foam insulated blanket power it from a solar cell use the excess power to heat the assembly during the day and rely on the insulation to hold the heat during darkness If you don t want to try thermal management contact someone like ICL and have them cut you a special low temperature crystal It ll cost at most If you use a single chip micro you re looking at a parts count of maybe A processor a crystal two caps on the crystal a power FET to fire the solenoid a flyback diode and a battery This is fewer parts than you can build an analog timer for and is infinitely more reliable Add a power zener diode for heat and a solar cell and the parts count screams up to PD assemblers are available for all the common single chip micros This application is so trivial you could even look up the op codes in the programmer s guide and create the binary with a hex editor John,"[fool, analog, job, single, chip, micro, crystal, job, reliably, easily, cost, crystal, business, Embed, thing, foam, insulate, blanket, power, solar, cell, use, excess, power, heat, assembly, day, rely, insulation, hold, heat, darkness, don, t, want, try, thermal, management, contact, like, ICL, cut, special, low, temperature, crystal, will, cost, use, single, chip, micro, look, part, count, maybe, processor, crystal, cap, crystal, power, FET, fire, solenoid, flyback, diode, battery, few, part, build, analog, timer, infinitely, reliable, add, power, zener, diode, heat, solar, cell, part, count, scream, PD, assembler, available, common, single, chip, micro, application, trivial, look, op, code, programmer, s, guide, create, ...]"
3,"\nCan anybody name a player who was 'rushed' to the majors (let's, for\nargument's sake, define ""rushed"" as brought up to the majors for more than\na cup of coffee prior at age 22 or younger, and performing below\nexpectations), whose career was damaged by this rushing? I'm serious; I\ntend to agree with David that bringing the player up sooner is better, but\nI'd like to look at players for whom this theory didn't work, if there are\nany. I'd prefer players within the last 10 years or so, because then I can\nlook up their minor league stats. (It's important to distinguish between\nplayers who legitimately had careers below what their minor league numbers\nwould have projected, as opposed to players who were hyped and failed, but\nactually had careers not out of line with their minor league numbers). \n\nLet's kick it off with an example of a player who was ""rushed"", although\nthere doesn't seem to have been any damage to his career. Jay Bell was\ngiven 135 PAs in the major leagues at age 21, and performed well below what\nyou would expect from his AAA numbers the same season. He got 236 PAs the\nnext year at age 22, and still underperformed. However, the next year, at\nage 24, his performance improved, and he won the everyday shortstop job,\nand has been there ever since. It's really hard for me to see where he\nwould have been better off staying in the minor league (where he was\nperformed quite well in AAA) during this time, rather than being ""rushed"";\nCleveland might have been better off, I suppose, because they might have\nbeen less likely to give up on him.\n\nYes, if you bring a player up early, he's likely going to struggle. But\ndoes that delay the time at which he stops struggling, and starts\nperforming up to expectations?",0,rec.sport.baseball,Can anybody name a player who was rushed to the majors let s for argument s sake define rushed as brought up to the majors for more than a cup of coffee prior at age or younger and performing below expectations whose career was damaged by this rushing I m serious I tend to agree with David that bringing the player up sooner is better but I d like to look at players for whom this theory didn t work if there are any I d prefer players within the last years or so because then I can look up their minor league stats It s important to distinguish between players who legitimately had careers below what their minor league numbers would have projected as opposed to players who were hyped and failed but actually had careers not out of line with their minor league numbers Let s kick it off with an example of a player who was rushed although there doesn t seem to have been any damage to his career Jay Bell was given PAs in the major leagues at age and performed well below what you would expect from his AAA numbers the same season He got PAs the next year at age and still underperformed However the next year at age his performance improved and he won the everyday shortstop job and has been there ever since It s really hard for me to see where he would have been better off staying in the minor league where he was performed quite well in AAA during this time rather than being rushed Cleveland might have been better off I suppose because they might have been less likely to give up on him Yes if you bring a player up early he s likely going to struggle But does that delay the time at which he stops struggling and starts performing up to expectations,"[anybody, player, rush, major, let, s, argument, s, sake, define, rush, bring, major, cup, coffee, prior, age, young, perform, expectation, career, damage, rush, m, tend, agree, David, bring, player, sooner, well, d, like, look, player, theory, didn, t, work, d, prefer, player, year, look, minor, league, stat, s, important, distinguish, player, legitimately, career, minor, league, number, project, oppose, player, hype, fail, actually, career, line, minor, league, number, let, s, kick, example, player, rush, doesn, t, damage, career, Jay, Bell, give, pas, major, league, age, perform, expect, AAA, number, season, get, pas, year, age, underperform, year, age, performance, improve, win, everyday, ...]"
4,"3. With Soderstrom and Roussel, why the hell would the Flyers want to\n pick up an older and slumping Roy?\n\n(BYW, I could come up with a group of players they'd trade for.... but\nthey wouldn't be from the same team.)\n",1,rec.sport.hockey,With Soderstrom and Roussel why the hell would the Flyers want to pick up an older and slumping Roy BYW I could come up with a group of players they d trade for but they wouldn t be from the same team,"[Soderstrom, Roussel, hell, Flyers, want, pick, old, slump, Roy, BYW, come, group, player, d, trade, wouldn, t, team]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [30]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [31]:
# How many words do we have?
len(id2word.keys())

16001

In [32]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [33]:
# How many words do we have?
len(id2word.keys())

3358

In [34]:
id2word[300]

'Message'

In [35]:
df['clean_text'][5]

'OK I m sure that this has been asked s of times before but I have wondered since I heard it Where the hell did the nickname of the Habs come from for the Montreal Canadiens'

In [36]:
corpus[5]

[(12, 1),
 (27, 1),
 (169, 1),
 (193, 1),
 (206, 1),
 (213, 1),
 (214, 1),
 (215, 1),
 (216, 1),
 (217, 1),
 (218, 1),
 (219, 1),
 (220, 1),
 (221, 1)]

In [37]:
id2word[252]

'Philadelphia'

In [38]:
id2word[276]

'White'

In [39]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

[('difference', 1),
 ('second', 1),
 ('Roy', 1),
 ('room', 1),
 ('Baltimore', 1),
 ('Chicago', 1),
 ('Cincinnati', 1),
 ('City', 1),
 ('Colorado', 1),
 ('Cubs', 1),
 ('DODGERS', 1),
 ('Detroit', 1),
 ('Diego', 1),
 ('Dodgers', 1)]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [47]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            chunksize=100,
                                            passes=10,
                                            per_word_topics=True)

# https://radimrehurek.com/gensim/models/ldamodel.html

IndexError: index 3358 is out of bounds for axis 1 with size 3358

In [58]:
lda_model.save('lda_model.model')

In [48]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

Process ForkPoolWorker-7:
Process ForkPoolWorker-8:
Traceback (most recent call last):
Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/pool.py", line 105, in worker
    initializer(*initargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/pyth

  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/pool.py", line 105, in worker
    initializer(*initargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/pool.py", line 105, in worker
    initializer(*initargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamulticore.py", line 337, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamulticore.py", line 337, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 680, in inference
    ex

  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/multiprocessing/pool.py", line 105, in worker
    initializer(*initargs)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 680, in inference
    expElogbetad = self.expElogbeta[:, ids]
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamulticore.py", line 337, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
IndexError: index 3390 is out of bounds for axis 1 with size 3358
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 742, in do_estep
    gamma, sstats = self.inference(chunk, collect_sstats=True)
  File "/opt/anaconda3/envs/U4-S1-NLP/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 680, in inference
    expElogbetad = self.exp

In [49]:
lda_multicore.save('lda_multicore.model')

In [42]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [43]:
newsgroups_train.target_names

['rec.sport.baseball', 'rec.sport.hockey', 'sci.electronics']

In [44]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

[(0,
  '0.009*"people" + 0.007*"Israel" + 0.006*"government" + 0.006*"turkish" + '
  '0.006*"armenian" + 0.006*"Jews" + 0.006*"right" + 0.005*"state" + '
  '0.005*"Armenians" + 0.005*"israeli"'),
 (1,
  '0.005*"antenna" + 0.005*"unit" + 0.005*"sphere" + 0.005*"74" + 0.004*"SI" + '
  '0.004*"iranian" + 0.003*"garage" + 0.003*"Mark" + 0.003*"CPU" + '
  '0.003*"plane"'),
 (2,
  '0.069*"1" + 0.043*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.043*"0" + 0.042*"2" '
  '+ 0.026*"3" + 0.021*"4" + 0.016*"5" + 0.014*"7" + 0.013*"6" + 0.011*"25"'),
 (3,
  '0.029*"not" + 0.014*"say" + 0.013*"go" + 0.012*"gun" + 0.011*"people" + '
  '0.010*"know" + 0.009*"come" + 0.009*"s" + 0.007*"think" + 0.007*"tell"'),
 (4,
  '0.008*"Center" + 0.007*"report" + 0.006*"patient" + 0.006*"1993" + '
  '0.005*"Health" + 0.005*"April" + 0.005*"disease" + 0.005*"child" + '
  '0.005*"increase" + 0.005*"University"'),
 (5,
  '0.005*"irq" + 0.004*"fold" + 0.003*"MO" + 0.003*"overlap" + 0.003*"draw" + '
  '0.003*"line" + 0.003*"sk

In [55]:
lda_multicore[corpus[5]][0]

In [56]:
distro = [lda_multicore[d][0] for d in corpus]

In [57]:
distro[0]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [52]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [59]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [60]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))