# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [25]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'))

# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'))

print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 11314
Testing Samples: 7532


In [4]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
newsgroups_train['data'][1000]

"Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [7]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train['data'])
print(X_train.shape)

(11314, 101631)


In [8]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  4.8min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [9]:
gs1.best_score_

0.6536154404866977

In [10]:
gs1.best_params_

{'min_samples_leaf': 2}

In [11]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 101631)

In [12]:
gs1.predict(test_sample)[0]

9

In [13]:
newsgroups_train['target_names'][9]

'rec.sport.baseball'

### GridSearch with BOTH the Vectoizer & Classifier

In [15]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
    
])

params_2 = {
    'vect__stop_words': (None, 'english'),# , SPACY_STOP_WORDS, CUSTOM_STOP_WORDS),
    'vect__min_df': (2, 5),
    'clf__max_depth': (10, None)

}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  4.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [16]:
gs2.best_score_

0.6607746264533867

In [17]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [18]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([9])

In [19]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [20]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


In [21]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [22]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(11314, 3)


In [23]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

Unnamed: 0,content,target,target_names
8141,"I recently bought an apparantly complete Expansion Chassis by Mountain\nComputer Inc. It consists of a box with 8 Apple ][+ compatible slots,\npowersupply brick, interface card and ribbon cable to attach it to the computer\nto be expanded. There was also included a small card with empty sockets on top\nand pins on the bottom that looks like it would plug into the ][+ motherboard\nsomewhere after pulling a chip. There's an empty socket also on the interface\ncard and a short 16-pin DIP jumper like the ones used with ][+ language cards.\n \nThis technological marvel came with no docs and I haven't a clue as how to hook\nthis thing up. If anyone has docs and/or users disk of any sort for this I\ncould really use copies of them or at least some help.\n \nI need to know:\n \no How to orient the ribbon cable between the card and the chassis.\no How to attach the short cable from the motherboard to the card\n and if the small card is used.\no The purposes of the various jumper-pins on the card (it has more\n of those than my CMS SCSI card!)\n",12,sci.electronics
9323,"Seems that the Mile-Long Billboard and any other inflateble space\nobject/station or what ever have the same problems. (other than being a little\nbit different than the ""normal"" space ideas, such as trusses and shuttles)\n\nBut also dag and such.. Why not combine the discussion of how and fesibility to\nthe same topic?\n\nI personnelly liek the idea of a billboard in space. But problem. How do you\nservice it? fly a shuttle/DC-1 to near it and then dismount and ""fly"" to it?\nOr what?? or havign a special docking section for shuttle/DC-1 docking?\n\nAlso what if the billboard springs a leak? Self sealing and such??\n\n\nJust thinking (okay rambling)..\n\nAlso why must the now inflated billboard, not be covered in the inside by a\nharder substance (such as a polymer or other agent) and then the now ""hard""\nbillboard would be a now giant docking structure/space dock/station??",14,sci.space
1616,"\n Just because the 68070 can run upto 15Mhz doesn't mean the CD-I\nis running at that speed. I said -> I understand it is a 68070 running\nat something like 7Mhz. I am not sure, but I think I read this a long\ntime ago.\n\n Anyway, still with 15Mhz, you need sprites for a lot of tricks for\nmaking cool awesome games (read psygnosis).\n",1,comp.graphics


In [28]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 

# 1.. Remove new line characters
df['clean_text']=df['content'].apply(lambda x: re.sub('\s+', ' ', x)) #includes /n.. look into 

# 2.. Remove Emails
df['clean_text']=df['clean_text'].apply(lambda x: re.sub('From: \S+@\S+', '', x))

# 3. Remove non-alphanumeric characters
df['clean_text']=df['clean_text'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))



# 4.. Remove extra whitespace 
df['clean_text']=df['clean_text'].apply(lambda x: ' '.join(x.split()))


In [29]:
df.sample(3)

Unnamed: 0,content,target,target_names,clean_text
5988,"^^^^^\n\nWhat an incrediblt sexist remark! Come now, Mike, what ever possessed you to\nmake such a un-PC remark? I hope all women out there reading this are as\nincensed as I am. Remember, WOMAN ARE JUST AS GOOD AS MEN!!!! \n\nWomen stand up for your right to be just as stupid as men. In fact, insist on\nevery oppurtunity to be even more stupid than men! You've got the right, use\nit!\n\nHey, it's a slow afternoon and I really don't want to get back to that\nreport...;)\n\nBTW: mega-smileys for the humor impaired...",16,talk.politics.guns,What an incrediblt sexist remark Come now Mike what ever possessed you to make such a un PC remark I hope all women out there reading this are as incensed as I am Remember WOMAN ARE JUST AS GOOD AS MEN Women stand up for your right to be just as stupid as men In fact insist on every oppurtunity to be even more stupid than men You ve got the right use it Hey it s a slow afternoon and I really don t want to get back to that report BTW mega smileys for the humor impaired
9444,Diamond engagement ring. 14k gold band. 33point diamond. appraised at\n1900 dollars. Will sell for 600 dollars. Appraisal available upon request.\nsend e-mail to yb025@uafhp.uark.edu\n,6,misc.forsale,Diamond engagement ring k gold band point diamond appraised at dollars Will sell for dollars Appraisal available upon request send e mail to yb uafhp uark edu
7217,"Avi,\n For your information, Islam permits freedom of religion - there is\nno compulsion in religion. Does Judaism permit freedom of religion\n(i.e. are non-Jews recognized in Judaism). Just wondering.",17,talk.politics.mideast,Avi For your information Islam permits freedom of religion there is no compulsion in religion Does Judaism permit freedom of religion i e are non Jews recognized in Judaism Just wondering


In [31]:
nlp = spacy.load("en_core_web_lg")

In [32]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

In [34]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['clean_text'].progress_apply(get_lemmas)


  0%|                                                 | 0/11314 [00:00<?, ?it/s][A
  0%|                                         | 2/11314 [00:00<17:45, 10.62it/s][A
  0%|                                         | 3/11314 [00:00<18:20, 10.27it/s][A
  0%|                                         | 4/11314 [00:00<23:22,  8.06it/s][A
  0%|                                         | 8/11314 [00:00<18:24, 10.23it/s][A
  0%|                                        | 10/11314 [00:00<18:51,  9.99it/s][A
  0%|                                        | 12/11314 [00:01<18:56,  9.94it/s][A
  0%|                                        | 14/11314 [00:01<19:40,  9.57it/s][A
  0%|                                        | 16/11314 [00:01<18:23, 10.24it/s][A
  0%|                                        | 19/11314 [00:01<19:51,  9.48it/s][A
  0%|                                        | 23/11314 [00:01<15:19, 12.27it/s][A
  0%|                                        | 27/11314 [00:02<12:32, 15.00

  3%|█                                      | 312/11314 [00:25<16:14, 11.29it/s][A
  3%|█                                      | 314/11314 [00:25<15:51, 11.55it/s][A
  3%|█                                      | 316/11314 [00:25<16:46, 10.93it/s][A
  3%|█                                      | 318/11314 [00:25<15:18, 11.98it/s][A
  3%|█                                      | 320/11314 [00:26<13:50, 13.25it/s][A
  3%|█                                      | 322/11314 [00:26<14:46, 12.40it/s][A
  3%|█                                      | 324/11314 [00:26<18:30,  9.90it/s][A
  3%|█                                      | 326/11314 [00:26<17:24, 10.52it/s][A
  3%|█▏                                     | 328/11314 [00:27<21:36,  8.47it/s][A
  3%|█▏                                     | 330/11314 [00:27<18:26,  9.93it/s][A
  3%|█▏                                     | 332/11314 [00:27<17:23, 10.53it/s][A
  3%|█▏                                     | 334/11314 [00:27<15:56, 11.48i

  5%|██                                     | 604/11314 [00:53<11:26, 15.60it/s][A
  5%|██                                     | 608/11314 [00:53<09:58, 17.88it/s][A
  5%|██                                     | 612/11314 [00:53<08:21, 21.35it/s][A
  5%|██                                     | 616/11314 [00:53<08:09, 21.83it/s][A
  5%|██▏                                    | 622/11314 [00:53<06:43, 26.50it/s][A
  6%|██▏                                    | 629/11314 [00:53<05:37, 31.70it/s][A
  6%|██▏                                    | 634/11314 [00:53<05:22, 33.10it/s][A
  6%|██▏                                    | 639/11314 [00:54<06:27, 27.52it/s][A
  6%|██▏                                    | 645/11314 [00:54<05:30, 32.31it/s][A
  6%|██▏                                    | 650/11314 [00:54<06:18, 28.21it/s][A
  6%|██▎                                    | 654/11314 [00:54<06:49, 26.03it/s][A
  6%|██▎                                    | 658/11314 [00:54<08:34, 20.71i

  8%|███                                    | 897/11314 [01:12<18:00,  9.64it/s][A
  8%|███                                    | 899/11314 [01:12<17:07, 10.14it/s][A
  8%|███                                    | 901/11314 [01:13<24:35,  7.06it/s][A
  8%|███                                    | 903/11314 [01:13<20:31,  8.45it/s][A
  8%|███                                    | 906/11314 [01:13<17:09, 10.11it/s][A
  8%|███▏                                   | 908/11314 [01:13<16:34, 10.47it/s][A
  8%|███▏                                   | 910/11314 [01:13<16:21, 10.60it/s][A
  8%|███▏                                   | 912/11314 [01:13<14:15, 12.16it/s][A
  8%|███▏                                   | 914/11314 [01:13<13:09, 13.17it/s][A
  8%|██▉                                  | 916/11314 [01:16<1:15:08,  2.31it/s][A
  8%|███▏                                   | 921/11314 [01:16<58:20,  2.97it/s][A
  8%|███▏                                   | 923/11314 [01:17<44:07,  3.93i

 10%|███▊                                  | 1135/11314 [01:35<46:34,  3.64it/s][A
 10%|███▊                                  | 1137/11314 [01:35<35:28,  4.78it/s][A
 10%|███▊                                  | 1138/11314 [01:36<35:18,  4.80it/s][A
 10%|███▊                                  | 1141/11314 [01:36<26:33,  6.39it/s][A
 10%|███▊                                  | 1143/11314 [01:36<22:13,  7.63it/s][A
 10%|███▊                                  | 1145/11314 [01:36<26:16,  6.45it/s][A
 10%|███▊                                  | 1147/11314 [01:37<31:24,  5.40it/s][A
 10%|███▊                                  | 1148/11314 [01:37<29:28,  5.75it/s][A
 10%|███▊                                  | 1149/11314 [01:37<31:45,  5.33it/s][A
 10%|███▊                                  | 1152/11314 [01:37<24:37,  6.88it/s][A
 10%|███▉                                  | 1154/11314 [01:37<21:37,  7.83it/s][A
 10%|███▉                                  | 1156/11314 [01:38<21:55,  7.72i

 13%|████▉                                 | 1452/11314 [01:56<13:26, 12.23it/s][A
 13%|████▉                                 | 1454/11314 [01:57<14:50, 11.07it/s][A
 13%|████▉                                 | 1457/11314 [01:57<12:30, 13.13it/s][A
 13%|████▉                                 | 1459/11314 [01:57<12:02, 13.63it/s][A
 13%|████▋                               | 1461/11314 [01:59<1:09:27,  2.36it/s][A
 13%|████▉                                 | 1463/11314 [02:00<54:33,  3.01it/s][A
 13%|████▉                                 | 1465/11314 [02:00<45:34,  3.60it/s][A
 13%|████▉                                 | 1467/11314 [02:00<34:25,  4.77it/s][A
 13%|████▉                                 | 1469/11314 [02:00<28:30,  5.76it/s][A
 13%|████▉                                 | 1471/11314 [02:01<27:46,  5.91it/s][A
 13%|████▉                                 | 1473/11314 [02:01<22:13,  7.38it/s][A
 13%|████▉                                 | 1476/11314 [02:01<17:24,  9.42i

 16%|██████                                | 1800/11314 [02:36<04:53, 32.40it/s][A
 16%|██████                                | 1807/11314 [02:36<04:55, 32.21it/s][A
 16%|██████                                | 1813/11314 [02:37<04:14, 37.37it/s][A
 16%|██████                                | 1821/11314 [02:37<03:35, 44.06it/s][A
 16%|██████▏                               | 1828/11314 [02:37<03:18, 47.88it/s][A
 16%|██████▏                               | 1834/11314 [02:37<07:45, 20.36it/s][A
 16%|██████▏                               | 1843/11314 [02:38<05:59, 26.36it/s][A
 16%|██████▏                               | 1849/11314 [02:38<05:25, 29.06it/s][A
 16%|██████▏                               | 1854/11314 [02:38<05:11, 30.41it/s][A
 16%|██████▏                               | 1859/11314 [02:38<06:00, 26.26it/s][A
 16%|██████▎                               | 1863/11314 [02:38<05:37, 28.01it/s][A
 17%|██████▎                               | 1867/11314 [02:38<05:51, 26.84i

 19%|███████▍                              | 2202/11314 [02:57<23:10,  6.55it/s][A
 19%|███████▍                              | 2204/11314 [02:57<18:48,  8.07it/s][A
 19%|███████▍                              | 2206/11314 [02:57<18:21,  8.27it/s][A
 20%|███████▍                              | 2212/11314 [02:57<13:44, 11.03it/s][A
 20%|███████▍                              | 2216/11314 [02:57<11:13, 13.50it/s][A
 20%|███████▍                              | 2219/11314 [02:58<09:47, 15.48it/s][A
 20%|███████▍                              | 2225/11314 [02:58<08:08, 18.59it/s][A
 20%|███████▍                              | 2229/11314 [02:58<07:51, 19.28it/s][A
 20%|███████▍                              | 2232/11314 [02:58<07:16, 20.81it/s][A
 20%|███████▌                              | 2238/11314 [02:58<05:53, 25.66it/s][A
 20%|███████▌                              | 2242/11314 [02:58<05:21, 28.22it/s][A
 20%|███████▌                              | 2246/11314 [02:58<05:03, 29.88i

 23%|████████▋                             | 2593/11314 [03:17<11:20, 12.81it/s][A
 23%|████████▋                             | 2597/11314 [03:17<09:13, 15.76it/s][A
 23%|████████▋                             | 2602/11314 [03:17<07:33, 19.23it/s][A
 23%|████████▊                             | 2610/11314 [03:17<05:54, 24.59it/s][A
 23%|████████▊                             | 2615/11314 [03:17<07:27, 19.42it/s][A
 23%|████████▊                             | 2619/11314 [03:18<06:33, 22.10it/s][A
 23%|████████▊                             | 2623/11314 [03:18<06:37, 21.87it/s][A
 23%|████████▊                             | 2626/11314 [03:18<09:57, 14.55it/s][A
 23%|████████▊                             | 2630/11314 [03:18<09:18, 15.56it/s][A
 23%|████████▊                             | 2633/11314 [03:19<10:26, 13.85it/s][A
 23%|████████▊                             | 2635/11314 [03:19<10:21, 13.96it/s][A
 23%|████████▊                             | 2638/11314 [03:19<09:10, 15.75i

 25%|█████████▍                            | 2804/11314 [03:41<28:05,  5.05it/s][A
 25%|█████████▍                            | 2806/11314 [03:41<21:59,  6.45it/s][A
 25%|█████████▍                            | 2808/11314 [03:41<18:02,  7.86it/s][A
 25%|█████████▍                            | 2810/11314 [03:42<16:29,  8.60it/s][A
 25%|█████████▍                            | 2812/11314 [03:42<14:51,  9.54it/s][A
 25%|█████████▍                            | 2814/11314 [03:42<13:39, 10.37it/s][A
 25%|█████████▍                            | 2816/11314 [03:42<13:54, 10.18it/s][A
 25%|█████████▍                            | 2818/11314 [03:42<14:47,  9.57it/s][A
 25%|█████████▍                            | 2820/11314 [03:43<14:46,  9.58it/s][A
 25%|█████████▍                            | 2822/11314 [03:43<13:11, 10.72it/s][A
 25%|█████████▍                            | 2824/11314 [03:43<16:43,  8.46it/s][A
 25%|█████████▍                            | 2827/11314 [03:44<19:59,  7.08i

 28%|██████████▋                           | 3175/11314 [04:00<06:52, 19.74it/s][A
 28%|██████████▋                           | 3179/11314 [04:01<09:03, 14.96it/s][A
 28%|██████████▋                           | 3182/11314 [04:01<08:00, 16.91it/s][A
 28%|██████████▋                           | 3185/11314 [04:01<07:12, 18.78it/s][A
 28%|██████████▋                           | 3190/11314 [04:01<05:54, 22.90it/s][A
 28%|██████████▋                           | 3194/11314 [04:01<05:24, 25.00it/s][A
 28%|██████████▋                           | 3200/11314 [04:04<24:48,  5.45it/s][A
 28%|██████████▊                           | 3203/11314 [04:05<19:52,  6.80it/s][A
 28%|██████████▊                           | 3206/11314 [04:05<16:18,  8.29it/s][A
 28%|██████████▊                           | 3209/11314 [04:05<12:54, 10.46it/s][A
 28%|██████████▊                           | 3212/11314 [04:05<11:44, 11.50it/s][A
 28%|██████████▊                           | 3214/11314 [04:05<13:53,  9.72i

 31%|███████████▏                        | 3520/11314 [04:30<1:04:02,  2.03it/s][A
 31%|███████████▊                          | 3523/11314 [04:30<46:16,  2.81it/s][A
 31%|███████████▊                          | 3525/11314 [04:31<34:30,  3.76it/s][A
 31%|███████████▊                          | 3527/11314 [04:31<27:36,  4.70it/s][A
 31%|███████████▊                          | 3529/11314 [04:31<22:39,  5.73it/s][A
 31%|███████████▊                          | 3532/11314 [04:31<17:12,  7.54it/s][A
 31%|███████████▉                          | 3537/11314 [04:31<13:04,  9.92it/s][A
 31%|███████████▉                          | 3540/11314 [04:31<10:38, 12.17it/s][A
 31%|███████████▉                          | 3544/11314 [04:31<09:11, 14.08it/s][A
 31%|███████████▉                          | 3547/11314 [04:32<10:05, 12.83it/s][A
 31%|███████████▉                          | 3551/11314 [04:32<08:06, 15.97it/s][A
 31%|███████████▉                          | 3554/11314 [04:32<08:05, 15.98i

 34%|████████████▉                         | 3835/11314 [04:53<17:07,  7.28it/s][A
 34%|████████████▉                         | 3838/11314 [04:53<13:24,  9.29it/s][A
 34%|████████████▉                         | 3840/11314 [04:54<28:19,  4.40it/s][A
 34%|████████████▉                         | 3842/11314 [04:54<22:20,  5.57it/s][A
 34%|████████████▉                         | 3844/11314 [04:54<20:22,  6.11it/s][A
 34%|████████████▉                         | 3846/11314 [04:55<20:32,  6.06it/s][A
 34%|████████████▉                         | 3848/11314 [04:55<16:50,  7.39it/s][A
 34%|████████████▉                         | 3850/11314 [04:55<16:35,  7.50it/s][A
 34%|████████████▉                         | 3852/11314 [04:55<14:48,  8.40it/s][A
 34%|████████████▉                         | 3854/11314 [04:55<14:48,  8.40it/s][A
 34%|████████████▉                         | 3856/11314 [04:56<12:37,  9.84it/s][A
 34%|████████████▉                         | 3858/11314 [04:56<15:05,  8.23i

 37%|█████████████▉                        | 4165/11314 [05:19<11:18, 10.54it/s][A
 37%|█████████████▉                        | 4167/11314 [05:19<14:11,  8.39it/s][A
 37%|█████████████▎                      | 4168/11314 [05:24<2:54:20,  1.46s/it][A
 37%|█████████████▎                      | 4171/11314 [05:24<2:03:22,  1.04s/it][A
 37%|█████████████▎                      | 4173/11314 [05:24<1:28:59,  1.34it/s][A
 37%|█████████████▎                      | 4175/11314 [05:24<1:07:06,  1.77it/s][A
 37%|██████████████                        | 4177/11314 [05:25<56:41,  2.10it/s][A
 37%|██████████████                        | 4178/11314 [05:25<45:39,  2.60it/s][A
 37%|██████████████                        | 4180/11314 [05:25<34:14,  3.47it/s][A
 37%|██████████████                        | 4182/11314 [05:25<26:48,  4.43it/s][A
 37%|██████████████                        | 4184/11314 [05:25<21:33,  5.51it/s][A
 37%|██████████████                        | 4188/11314 [05:25<16:08,  7.36i

 39%|██████████████▊                       | 4420/11314 [05:48<09:44, 11.78it/s][A
 39%|██████████████▊                       | 4422/11314 [05:48<09:08, 12.57it/s][A
 39%|██████████████▊                       | 4424/11314 [05:48<08:42, 13.18it/s][A
 39%|██████████████▊                       | 4427/11314 [05:48<10:53, 10.54it/s][A
 39%|██████████████▉                       | 4430/11314 [05:49<09:19, 12.30it/s][A
 39%|██████████████▉                       | 4435/11314 [05:49<07:26, 15.41it/s][A
 39%|██████████████▉                       | 4438/11314 [05:49<06:43, 17.04it/s][A
 39%|██████████████▉                       | 4441/11314 [05:49<06:01, 19.00it/s][A
 39%|██████████████▉                       | 4444/11314 [05:49<08:38, 13.26it/s][A
 39%|██████████████▉                       | 4446/11314 [05:49<08:16, 13.85it/s][A
 39%|██████████████▉                       | 4448/11314 [05:50<07:48, 14.66it/s][A
 39%|██████████████▉                       | 4450/11314 [05:50<07:36, 15.02i

 42%|███████████████▉                      | 4733/11314 [06:20<04:36, 23.84it/s][A
 42%|███████████████▉                      | 4737/11314 [06:20<04:17, 25.53it/s][A
 42%|███████████████▉                      | 4741/11314 [06:20<03:57, 27.72it/s][A
 42%|███████████████▉                      | 4745/11314 [06:21<04:01, 27.19it/s][A
 42%|███████████████▉                      | 4749/11314 [06:21<03:59, 27.38it/s][A
 42%|███████████████▉                      | 4754/11314 [06:21<03:28, 31.43it/s][A
 42%|███████████████▉                      | 4758/11314 [06:21<04:06, 26.57it/s][A
 42%|███████████████▉                      | 4762/11314 [06:21<04:22, 24.94it/s][A
 42%|████████████████                      | 4765/11314 [06:21<04:23, 24.85it/s][A
 42%|████████████████                      | 4769/11314 [06:21<04:15, 25.59it/s][A
 42%|████████████████                      | 4774/11314 [06:28<45:44,  2.38it/s][A
 42%|████████████████                      | 4776/11314 [06:28<34:18,  3.18i

 44%|████████████████▊                     | 5001/11314 [06:49<09:32, 11.03it/s][A
 44%|████████████████▊                     | 5004/11314 [06:49<08:46, 11.99it/s][A
 44%|████████████████▊                     | 5006/11314 [06:49<07:58, 13.18it/s][A
 44%|████████████████▊                     | 5009/11314 [06:49<07:27, 14.09it/s][A
 44%|████████████████▊                     | 5012/11314 [06:49<07:07, 14.73it/s][A
 44%|████████████████▊                     | 5014/11314 [06:50<06:58, 15.04it/s][A
 44%|████████████████▊                     | 5016/11314 [06:50<06:58, 15.04it/s][A
 44%|████████████████▊                     | 5018/11314 [06:50<06:47, 15.45it/s][A
 44%|████████████████▊                     | 5020/11314 [06:50<06:44, 15.56it/s][A
 44%|████████████████▊                     | 5022/11314 [06:50<06:42, 15.63it/s][A
 44%|████████████████▉                     | 5025/11314 [06:50<06:55, 15.15it/s][A
 44%|████████████████▉                     | 5027/11314 [06:51<10:13, 10.25i

 46%|█████████████████▍                    | 5207/11314 [07:16<38:26,  2.65it/s][A
 46%|█████████████████▍                    | 5209/11314 [07:16<31:07,  3.27it/s][A
 46%|█████████████████▍                    | 5210/11314 [07:16<26:48,  3.80it/s][A
 46%|█████████████████▌                    | 5211/11314 [07:17<23:35,  4.31it/s][A
 46%|█████████████████▌                    | 5212/11314 [07:17<20:41,  4.92it/s][A
 46%|█████████████████▌                    | 5213/11314 [07:17<18:32,  5.49it/s][A
 46%|█████████████████▌                    | 5214/11314 [07:17<18:16,  5.56it/s][A
 46%|█████████████████▌                    | 5215/11314 [07:18<28:03,  3.62it/s][A
 46%|█████████████████▌                    | 5217/11314 [07:18<22:47,  4.46it/s][A
 46%|█████████████████▌                    | 5218/11314 [07:18<22:16,  4.56it/s][A
 46%|█████████████████▌                    | 5220/11314 [07:18<17:59,  5.64it/s][A
 46%|█████████████████▌                    | 5221/11314 [07:18<16:08,  6.29i

 48%|██████████████████                    | 5384/11314 [07:42<13:12,  7.48it/s][A
 48%|██████████████████                    | 5385/11314 [07:42<13:14,  7.46it/s][A
 48%|██████████████████                    | 5388/11314 [07:42<10:28,  9.43it/s][A
 48%|██████████████████                    | 5390/11314 [07:42<09:07, 10.82it/s][A
 48%|██████████████████                    | 5392/11314 [07:42<09:19, 10.58it/s][A
 48%|██████████████████                    | 5394/11314 [07:43<09:33, 10.32it/s][A
 48%|██████████████████                    | 5396/11314 [07:43<12:07,  8.14it/s][A
 48%|██████████████████▏                   | 5397/11314 [07:43<14:38,  6.73it/s][A
 48%|██████████████████▏                   | 5399/11314 [07:43<12:24,  7.94it/s][A
 48%|██████████████████▏                   | 5402/11314 [07:43<09:56,  9.91it/s][A
 48%|██████████████████▏                   | 5404/11314 [07:44<09:15, 10.64it/s][A
 48%|██████████████████▏                   | 5406/11314 [07:44<11:01,  8.93i

 49%|██████████████████▊                   | 5583/11314 [08:05<07:19, 13.03it/s][A
 49%|██████████████████▊                   | 5585/11314 [08:05<06:58, 13.69it/s][A
 49%|██████████████████▊                   | 5587/11314 [08:05<10:57,  8.72it/s][A
 49%|██████████████████▊                   | 5589/11314 [08:06<23:05,  4.13it/s][A
 49%|██████████████████▊                   | 5590/11314 [08:06<19:01,  5.01it/s][A
 49%|██████████████████▊                   | 5592/11314 [08:06<15:05,  6.32it/s][A
 49%|██████████████████▊                   | 5594/11314 [08:07<12:42,  7.50it/s][A
 49%|██████████████████▊                   | 5596/11314 [08:07<10:57,  8.70it/s][A
 49%|██████████████████▊                   | 5598/11314 [08:07<09:23, 10.14it/s][A
 49%|██████████████████▊                   | 5600/11314 [08:07<08:34, 11.11it/s][A
 50%|██████████████████▊                   | 5602/11314 [08:07<09:01, 10.55it/s][A
 50%|██████████████████▊                   | 5604/11314 [08:08<13:48,  6.90i

 51%|███████████████████▍                  | 5776/11314 [08:27<11:01,  8.37it/s][A
 51%|███████████████████▍                  | 5778/11314 [08:27<09:25,  9.80it/s][A
 51%|███████████████████▍                  | 5780/11314 [08:27<09:56,  9.28it/s][A
 51%|███████████████████▍                  | 5782/11314 [08:28<10:32,  8.74it/s][A
 51%|███████████████████▍                  | 5783/11314 [08:28<14:47,  6.23it/s][A
 51%|███████████████████▍                  | 5785/11314 [08:28<11:46,  7.83it/s][A
 51%|███████████████████▍                  | 5787/11314 [08:28<09:44,  9.46it/s][A
 51%|███████████████████▍                  | 5789/11314 [08:28<09:58,  9.23it/s][A
 51%|███████████████████▍                  | 5791/11314 [08:29<09:55,  9.28it/s][A
 51%|███████████████████▍                  | 5795/11314 [08:29<08:30, 10.81it/s][A
 51%|███████████████████▍                  | 5797/11314 [08:29<08:13, 11.17it/s][A
 51%|███████████████████▍                  | 5800/11314 [08:29<06:56, 13.25i

 53%|████████████████████                  | 5963/11314 [08:52<16:16,  5.48it/s][A
 53%|████████████████████                  | 5965/11314 [08:52<13:53,  6.42it/s][A
 53%|████████████████████                  | 5967/11314 [08:53<11:16,  7.90it/s][A
 53%|████████████████████                  | 5969/11314 [08:53<10:19,  8.63it/s][A
 53%|████████████████████                  | 5971/11314 [08:53<08:40, 10.27it/s][A
 53%|████████████████████                  | 5973/11314 [08:53<08:07, 10.95it/s][A
 53%|████████████████████                  | 5975/11314 [08:53<07:11, 12.36it/s][A
 53%|████████████████████                  | 5977/11314 [08:53<09:48,  9.07it/s][A
 53%|████████████████████                  | 5979/11314 [08:54<09:52,  9.01it/s][A
 53%|████████████████████                  | 5981/11314 [08:54<08:55,  9.97it/s][A
 53%|████████████████████                  | 5983/11314 [08:54<07:46, 11.43it/s][A
 53%|████████████████████                  | 5985/11314 [08:54<07:13, 12.29i

 54%|████████████████████▋                 | 6164/11314 [09:14<14:04,  6.10it/s][A
 54%|████████████████████▋                 | 6166/11314 [09:14<12:01,  7.13it/s][A
 55%|████████████████████▋                 | 6168/11314 [09:14<10:03,  8.52it/s][A
 55%|████████████████████▋                 | 6171/11314 [09:15<08:20, 10.28it/s][A
 55%|████████████████████▋                 | 6173/11314 [09:15<08:00, 10.69it/s][A
 55%|████████████████████▋                 | 6175/11314 [09:15<14:12,  6.03it/s][A
 55%|████████████████████▋                 | 6177/11314 [09:16<12:02,  7.11it/s][A
 55%|████████████████████▊                 | 6179/11314 [09:16<10:30,  8.15it/s][A
 55%|████████████████████▊                 | 6181/11314 [09:16<09:23,  9.11it/s][A
 55%|████████████████████▊                 | 6183/11314 [09:16<08:11, 10.45it/s][A
 55%|████████████████████▊                 | 6185/11314 [09:16<07:20, 11.64it/s][A
 55%|████████████████████▊                 | 6187/11314 [09:16<06:39, 12.84i

 56%|█████████████████████▎                | 6357/11314 [09:42<06:18, 13.09it/s][A
 56%|█████████████████████▎                | 6359/11314 [09:42<06:34, 12.56it/s][A
 56%|█████████████████████▎                | 6361/11314 [09:42<06:08, 13.45it/s][A
 56%|█████████████████████▎                | 6363/11314 [09:42<05:50, 14.13it/s][A
 56%|█████████████████████▍                | 6365/11314 [09:42<07:06, 11.61it/s][A
 56%|█████████████████████▍                | 6367/11314 [09:43<08:18,  9.91it/s][A
 56%|█████████████████████▍                | 6369/11314 [09:43<10:53,  7.57it/s][A
 56%|█████████████████████▍                | 6370/11314 [09:43<14:40,  5.61it/s][A
 56%|█████████████████████▍                | 6371/11314 [09:43<13:33,  6.08it/s][A
 56%|█████████████████████▍                | 6373/11314 [09:44<10:44,  7.67it/s][A
 56%|█████████████████████▍                | 6375/11314 [09:44<09:22,  8.79it/s][A
 56%|█████████████████████▍                | 6377/11314 [09:44<07:57, 10.35i

 58%|██████████████████████                | 6556/11314 [10:05<10:36,  7.48it/s][A
 58%|██████████████████████                | 6558/11314 [10:06<10:52,  7.28it/s][A
 58%|██████████████████████                | 6560/11314 [10:06<09:02,  8.76it/s][A
 58%|██████████████████████                | 6562/11314 [10:06<08:57,  8.85it/s][A
 58%|██████████████████████                | 6564/11314 [10:06<08:10,  9.67it/s][A
 58%|██████████████████████                | 6566/11314 [10:06<07:25, 10.66it/s][A
 58%|██████████████████████                | 6568/11314 [10:07<12:41,  6.23it/s][A
 58%|██████████████████████                | 6569/11314 [10:07<11:34,  6.83it/s][A
 58%|██████████████████████                | 6571/11314 [10:07<09:42,  8.15it/s][A
 58%|██████████████████████                | 6573/11314 [10:07<08:40,  9.12it/s][A
 58%|██████████████████████                | 6575/11314 [10:08<09:04,  8.70it/s][A
 58%|██████████████████████                | 6577/11314 [10:08<07:48, 10.11i

 60%|██████████████████████▋               | 6770/11314 [10:30<09:48,  7.72it/s][A
 60%|██████████████████████▋               | 6772/11314 [10:30<08:29,  8.92it/s][A
 60%|██████████████████████▊               | 6774/11314 [10:30<07:06, 10.64it/s][A
 60%|██████████████████████▊               | 6776/11314 [10:31<10:11,  7.42it/s][A
 60%|██████████████████████▊               | 6778/11314 [10:31<08:38,  8.76it/s][A
 60%|██████████████████████▊               | 6780/11314 [10:31<07:24, 10.21it/s][A
 60%|██████████████████████▊               | 6782/11314 [10:31<06:21, 11.87it/s][A
 60%|██████████████████████▊               | 6784/11314 [10:31<06:08, 12.30it/s][A
 60%|██████████████████████▊               | 6786/11314 [10:31<06:49, 11.05it/s][A
 60%|██████████████████████▊               | 6788/11314 [10:32<09:13,  8.17it/s][A
 60%|██████████████████████▊               | 6790/11314 [10:32<08:28,  8.90it/s][A
 60%|██████████████████████▊               | 6792/11314 [10:32<07:22, 10.22i

 62%|███████████████████████▍              | 6974/11314 [10:57<06:33, 11.04it/s][A
 62%|███████████████████████▍              | 6977/11314 [10:57<05:45, 12.56it/s][A
 62%|███████████████████████▍              | 6979/11314 [10:57<06:36, 10.93it/s][A
 62%|███████████████████████▍              | 6981/11314 [10:58<06:51, 10.52it/s][A
 62%|███████████████████████▍              | 6983/11314 [10:58<06:03, 11.92it/s][A
 62%|███████████████████████▍              | 6985/11314 [10:58<05:45, 12.52it/s][A
 62%|███████████████████████▍              | 6987/11314 [10:58<08:55,  8.07it/s][A
 62%|███████████████████████▍              | 6989/11314 [11:03<54:29,  1.32it/s][A
 62%|███████████████████████▍              | 6992/11314 [11:03<39:13,  1.84it/s][A
 62%|███████████████████████▍              | 6994/11314 [11:03<29:20,  2.45it/s][A
 62%|███████████████████████▍              | 6996/11314 [11:03<21:49,  3.30it/s][A
 62%|███████████████████████▌              | 6998/11314 [11:03<16:32,  4.35i

 64%|████████████████████████▏             | 7185/11314 [11:29<08:32,  8.06it/s][A
 64%|████████████████████████▏             | 7187/11314 [11:29<08:12,  8.39it/s][A
 64%|████████████████████████▏             | 7189/11314 [11:29<07:12,  9.53it/s][A
 64%|████████████████████████▏             | 7191/11314 [11:30<08:53,  7.72it/s][A
 64%|████████████████████████▏             | 7193/11314 [11:30<07:24,  9.28it/s][A
 64%|████████████████████████▏             | 7195/11314 [11:30<07:29,  9.16it/s][A
 64%|████████████████████████▏             | 7197/11314 [11:31<09:01,  7.61it/s][A
 64%|████████████████████████▏             | 7198/11314 [11:31<10:05,  6.79it/s][A
 64%|████████████████████████▏             | 7200/11314 [11:31<08:07,  8.44it/s][A
 64%|████████████████████████▏             | 7202/11314 [11:31<09:12,  7.45it/s][A
 64%|████████████████████████▏             | 7204/11314 [11:31<07:45,  8.82it/s][A
 64%|████████████████████████▏             | 7206/11314 [11:32<07:29,  9.14i

 65%|████████████████████████▊             | 7396/11314 [11:51<05:16, 12.36it/s][A
 65%|████████████████████████▊             | 7398/11314 [11:52<05:01, 12.99it/s][A
 65%|████████████████████████▊             | 7400/11314 [11:52<04:50, 13.47it/s][A
 65%|████████████████████████▊             | 7403/11314 [11:52<04:30, 14.48it/s][A
 65%|████████████████████████▊             | 7405/11314 [11:52<04:50, 13.46it/s][A
 65%|████████████████████████▉             | 7407/11314 [11:52<04:32, 14.35it/s][A
 65%|████████████████████████▉             | 7409/11314 [11:52<04:28, 14.52it/s][A
 66%|████████████████████████▉             | 7411/11314 [11:53<07:15,  8.97it/s][A
 66%|████████████████████████▉             | 7413/11314 [11:53<07:44,  8.40it/s][A
 66%|████████████████████████▉             | 7415/11314 [11:53<06:59,  9.30it/s][A
 66%|████████████████████████▉             | 7417/11314 [11:53<06:03, 10.72it/s][A
 66%|████████████████████████▉             | 7419/11314 [11:53<05:47, 11.22i

 67%|█████████████████████████▍            | 7588/11314 [12:17<13:01,  4.76it/s][A
 67%|█████████████████████████▍            | 7591/11314 [12:17<10:08,  6.12it/s][A
 67%|█████████████████████████▌            | 7593/11314 [12:17<08:38,  7.18it/s][A
 67%|█████████████████████████▌            | 7595/11314 [12:17<07:42,  8.04it/s][A
 67%|█████████████████████████▌            | 7597/11314 [12:17<07:30,  8.25it/s][A
 67%|█████████████████████████▌            | 7599/11314 [12:17<06:36,  9.36it/s][A
 67%|█████████████████████████▌            | 7601/11314 [12:18<05:34, 11.10it/s][A
 67%|█████████████████████████▌            | 7603/11314 [12:18<06:05, 10.15it/s][A
 67%|█████████████████████████▌            | 7605/11314 [12:18<05:16, 11.73it/s][A
 67%|█████████████████████████▌            | 7608/11314 [12:18<04:36, 13.38it/s][A
 67%|█████████████████████████▌            | 7611/11314 [12:18<04:04, 15.15it/s][A
 67%|█████████████████████████▌            | 7613/11314 [12:18<04:26, 13.87i

 69%|██████████████████████████▏           | 7795/11314 [12:35<04:48, 12.19it/s][A
 69%|██████████████████████████▏           | 7797/11314 [12:36<05:02, 11.63it/s][A
 69%|██████████████████████████▏           | 7799/11314 [12:36<06:24,  9.13it/s][A
 69%|██████████████████████████▏           | 7801/11314 [12:36<05:29, 10.66it/s][A
 69%|██████████████████████████▏           | 7803/11314 [12:36<05:19, 10.98it/s][A
 69%|██████████████████████████▏           | 7805/11314 [12:36<04:38, 12.62it/s][A
 69%|██████████████████████████▏           | 7808/11314 [12:36<04:14, 13.76it/s][A
 69%|██████████████████████████▏           | 7810/11314 [12:37<04:18, 13.58it/s][A
 69%|██████████████████████████▏           | 7812/11314 [12:37<04:24, 13.25it/s][A
 69%|██████████████████████████▏           | 7814/11314 [12:37<03:57, 14.72it/s][A
 69%|██████████████████████████▎           | 7817/11314 [12:37<03:43, 15.62it/s][A
 69%|██████████████████████████▎           | 7820/11314 [12:37<03:19, 17.53i

 71%|██████████████████████████▊           | 7991/11314 [13:00<06:23,  8.67it/s][A
 71%|██████████████████████████▊           | 7993/11314 [13:01<06:48,  8.13it/s][A
 71%|██████████████████████████▊           | 7995/11314 [13:01<05:58,  9.25it/s][A
 71%|██████████████████████████▊           | 7997/11314 [13:01<05:04, 10.90it/s][A
 71%|██████████████████████████▊           | 7999/11314 [13:01<04:46, 11.56it/s][A
 71%|██████████████████████████▊           | 8001/11314 [13:01<04:24, 12.54it/s][A
 71%|██████████████████████████▉           | 8004/11314 [13:01<03:52, 14.24it/s][A
 71%|██████████████████████████▉           | 8006/11314 [13:02<03:53, 14.19it/s][A
 71%|██████████████████████████▉           | 8008/11314 [13:02<04:05, 13.47it/s][A
 71%|██████████████████████████▉           | 8012/11314 [13:02<03:25, 16.05it/s][A
 71%|██████████████████████████▉           | 8015/11314 [13:02<03:06, 17.71it/s][A
 71%|██████████████████████████▉           | 8018/11314 [13:02<03:01, 18.20i

 73%|███████████████████████████▌          | 8223/11314 [13:28<22:47,  2.26it/s][A
 73%|███████████████████████████▋          | 8225/11314 [13:29<17:24,  2.96it/s][A
 73%|███████████████████████████▋          | 8227/11314 [13:29<17:01,  3.02it/s][A
 73%|███████████████████████████▋          | 8230/11314 [13:29<12:29,  4.11it/s][A
 73%|███████████████████████████▋          | 8233/11314 [13:30<09:17,  5.52it/s][A
 73%|███████████████████████████▋          | 8237/11314 [13:30<07:10,  7.14it/s][A
 73%|███████████████████████████▋          | 8239/11314 [13:30<05:52,  8.73it/s][A
 73%|███████████████████████████▋          | 8241/11314 [13:30<05:01, 10.18it/s][A
 73%|███████████████████████████▋          | 8245/11314 [13:30<03:59, 12.83it/s][A
 73%|███████████████████████████▋          | 8248/11314 [13:30<03:29, 14.61it/s][A
 73%|███████████████████████████▋          | 8251/11314 [13:30<03:16, 15.59it/s][A
 73%|███████████████████████████▋          | 8254/11314 [13:30<03:00, 16.93i

 75%|████████████████████████████▍         | 8467/11314 [13:46<03:45, 12.61it/s][A
 75%|████████████████████████████▍         | 8469/11314 [13:47<05:25,  8.74it/s][A
 75%|████████████████████████████▍         | 8471/11314 [13:47<04:43, 10.04it/s][A
 75%|████████████████████████████▍         | 8473/11314 [13:47<04:18, 10.99it/s][A
 75%|████████████████████████████▍         | 8475/11314 [13:47<04:36, 10.25it/s][A
 75%|████████████████████████████▍         | 8477/11314 [13:47<06:17,  7.52it/s][A
 75%|████████████████████████████▍         | 8478/11314 [13:48<06:07,  7.71it/s][A
 75%|████████████████████████████▍         | 8479/11314 [13:48<06:21,  7.44it/s][A
 75%|████████████████████████████▍         | 8482/11314 [13:48<05:08,  9.18it/s][A
 75%|████████████████████████████▍         | 8484/11314 [13:48<04:22, 10.79it/s][A
 75%|████████████████████████████▌         | 8488/11314 [13:48<03:27, 13.61it/s][A
 75%|████████████████████████████▌         | 8491/11314 [13:48<03:14, 14.55i

 77%|█████████████████████████████▏        | 8695/11314 [14:17<03:55, 11.13it/s][A
 77%|█████████████████████████████▏        | 8698/11314 [14:17<03:10, 13.71it/s][A
 77%|█████████████████████████████▏        | 8701/11314 [14:18<03:07, 13.97it/s][A
 77%|█████████████████████████████▏        | 8703/11314 [14:18<03:14, 13.41it/s][A
 77%|█████████████████████████████▏        | 8707/11314 [14:18<02:42, 16.02it/s][A
 77%|█████████████████████████████▎        | 8710/11314 [14:18<02:32, 17.13it/s][A
 77%|█████████████████████████████▎        | 8713/11314 [14:18<02:26, 17.69it/s][A
 77%|█████████████████████████████▎        | 8716/11314 [14:19<03:59, 10.84it/s][A
 77%|█████████████████████████████▎        | 8718/11314 [14:19<03:35, 12.05it/s][A
 77%|█████████████████████████████▎        | 8720/11314 [14:19<03:13, 13.38it/s][A
 77%|█████████████████████████████▎        | 8724/11314 [14:19<02:38, 16.29it/s][A
 77%|█████████████████████████████▎        | 8727/11314 [14:19<02:22, 18.14i

 79%|██████████████████████████████        | 8969/11314 [14:36<02:18, 16.97it/s][A
 79%|██████████████████████████████▏       | 8973/11314 [14:36<02:05, 18.68it/s][A
 79%|██████████████████████████████▏       | 8976/11314 [14:36<01:56, 20.12it/s][A
 79%|██████████████████████████████▏       | 8979/11314 [14:36<01:44, 22.27it/s][A
 79%|██████████████████████████████▏       | 8982/11314 [14:37<02:31, 15.35it/s][A
 79%|██████████████████████████████▏       | 8984/11314 [14:37<02:27, 15.78it/s][A
 79%|██████████████████████████████▏       | 8986/11314 [14:37<05:18,  7.31it/s][A
 79%|██████████████████████████████▏       | 8988/11314 [14:38<04:26,  8.74it/s][A
 79%|██████████████████████████████▏       | 8990/11314 [14:38<04:02,  9.58it/s][A
 79%|██████████████████████████████▏       | 8993/11314 [14:38<03:24, 11.38it/s][A
 80%|██████████████████████████████▏       | 8995/11314 [14:38<04:33,  8.48it/s][A
 80%|██████████████████████████████▏       | 8998/11314 [14:38<03:36, 10.72i

 81%|██████████████████████████████▉       | 9208/11314 [15:08<03:00, 11.64it/s][A
 81%|██████████████████████████████▉       | 9210/11314 [15:08<02:41, 13.04it/s][A
 81%|██████████████████████████████▉       | 9213/11314 [15:08<02:27, 14.22it/s][A
 81%|██████████████████████████████▉       | 9216/11314 [15:08<02:09, 16.21it/s][A
 81%|██████████████████████████████▉       | 9220/11314 [15:08<01:54, 18.27it/s][A
 82%|██████████████████████████████▉       | 9223/11314 [15:09<02:22, 14.65it/s][A
 82%|██████████████████████████████▉       | 9225/11314 [15:09<02:30, 13.89it/s][A
 82%|██████████████████████████████▉       | 9227/11314 [15:09<02:20, 14.89it/s][A
 82%|██████████████████████████████▉       | 9229/11314 [15:09<02:09, 16.05it/s][A
 82%|███████████████████████████████       | 9233/11314 [15:09<01:50, 18.86it/s][A
 82%|███████████████████████████████       | 9236/11314 [15:09<02:05, 16.56it/s][A
 82%|███████████████████████████████       | 9238/11314 [15:09<02:01, 17.09i

 83%|███████████████████████████████▌      | 9415/11314 [15:30<04:27,  7.11it/s][A
 83%|███████████████████████████████▋      | 9417/11314 [15:30<03:47,  8.34it/s][A
 83%|███████████████████████████████▋      | 9419/11314 [15:30<03:21,  9.41it/s][A
 83%|███████████████████████████████▋      | 9421/11314 [15:30<03:15,  9.66it/s][A
 83%|███████████████████████████████▋      | 9423/11314 [15:30<02:45, 11.43it/s][A
 83%|███████████████████████████████▋      | 9425/11314 [15:32<10:01,  3.14it/s][A
 83%|███████████████████████████████▋      | 9427/11314 [15:32<07:35,  4.14it/s][A
 83%|███████████████████████████████▋      | 9429/11314 [15:32<05:49,  5.39it/s][A
 83%|███████████████████████████████▋      | 9432/11314 [15:32<04:37,  6.78it/s][A
 83%|███████████████████████████████▋      | 9436/11314 [15:32<03:29,  8.94it/s][A
 83%|███████████████████████████████▋      | 9439/11314 [15:33<03:20,  9.33it/s][A
 83%|███████████████████████████████▋      | 9443/11314 [15:33<02:44, 11.40i

 85%|████████████████████████████████▏     | 9599/11314 [15:51<04:03,  7.05it/s][A
 85%|████████████████████████████████▏     | 9600/11314 [15:51<04:04,  7.02it/s][A
 85%|████████████████████████████████▏     | 9602/11314 [15:51<03:33,  8.01it/s][A
 85%|████████████████████████████████▎     | 9604/11314 [15:51<03:24,  8.37it/s][A
 85%|████████████████████████████████▎     | 9606/11314 [15:51<02:57,  9.61it/s][A
 85%|████████████████████████████████▎     | 9608/11314 [15:51<03:01,  9.39it/s][A
 85%|████████████████████████████████▎     | 9610/11314 [15:52<02:32, 11.17it/s][A
 85%|████████████████████████████████▎     | 9612/11314 [15:52<02:23, 11.87it/s][A
 85%|████████████████████████████████▎     | 9614/11314 [15:52<02:17, 12.36it/s][A
 85%|████████████████████████████████▎     | 9616/11314 [15:52<02:12, 12.84it/s][A
 85%|████████████████████████████████▎     | 9618/11314 [15:52<02:05, 13.51it/s][A
 85%|████████████████████████████████▎     | 9620/11314 [15:52<02:35, 10.92i

 87%|█████████████████████████████████     | 9854/11314 [16:19<01:48, 13.46it/s][A
 87%|█████████████████████████████████     | 9857/11314 [16:19<01:32, 15.82it/s][A
 87%|█████████████████████████████████     | 9860/11314 [16:19<01:21, 17.86it/s][A
 87%|█████████████████████████████████▏    | 9863/11314 [16:19<01:41, 14.28it/s][A
 87%|█████████████████████████████████▏    | 9865/11314 [16:20<01:48, 13.38it/s][A
 87%|█████████████████████████████████▏    | 9867/11314 [16:20<01:40, 14.46it/s][A
 87%|█████████████████████████████████▏    | 9869/11314 [16:20<01:37, 14.83it/s][A
 87%|█████████████████████████████████▏    | 9871/11314 [16:21<03:48,  6.33it/s][A
 87%|█████████████████████████████████▏    | 9873/11314 [16:21<03:39,  6.55it/s][A
 87%|█████████████████████████████████▏    | 9875/11314 [16:21<03:22,  7.09it/s][A
 87%|█████████████████████████████████▏    | 9877/11314 [16:21<03:03,  7.83it/s][A
 87%|█████████████████████████████████▏    | 9881/11314 [16:21<02:32,  9.43i

 89%|█████████████████████████████████    | 10098/11314 [16:38<02:02,  9.92it/s][A
 89%|█████████████████████████████████    | 10100/11314 [16:39<02:55,  6.93it/s][A
 89%|█████████████████████████████████    | 10101/11314 [16:39<02:48,  7.21it/s][A
 89%|█████████████████████████████████    | 10104/11314 [16:39<02:14,  9.01it/s][A
 89%|█████████████████████████████████    | 10106/11314 [16:39<01:54, 10.57it/s][A
 89%|█████████████████████████████████    | 10108/11314 [16:39<01:58, 10.15it/s][A
 89%|█████████████████████████████████    | 10110/11314 [16:39<01:43, 11.61it/s][A
 89%|█████████████████████████████████    | 10112/11314 [16:39<01:30, 13.22it/s][A
 89%|█████████████████████████████████    | 10114/11314 [16:40<01:54, 10.48it/s][A
 89%|█████████████████████████████████    | 10116/11314 [16:40<01:49, 10.91it/s][A
 89%|█████████████████████████████████    | 10118/11314 [16:40<01:37, 12.22it/s][A
 89%|█████████████████████████████████    | 10121/11314 [16:40<01:23, 14.25i

 91%|█████████████████████████████████▊   | 10322/11314 [17:04<01:08, 14.49it/s][A
 91%|█████████████████████████████████▊   | 10324/11314 [17:05<01:06, 14.90it/s][A
 91%|█████████████████████████████████▊   | 10326/11314 [17:05<01:01, 15.95it/s][A
 91%|█████████████████████████████████▊   | 10329/11314 [17:05<00:55, 17.74it/s][A
 91%|█████████████████████████████████▊   | 10331/11314 [17:05<01:41,  9.71it/s][A
 91%|█████████████████████████████████▊   | 10333/11314 [17:05<01:42,  9.53it/s][A
 91%|█████████████████████████████████▊   | 10335/11314 [17:06<01:26, 11.26it/s][A
 91%|█████████████████████████████████▊   | 10337/11314 [17:06<01:44,  9.34it/s][A
 91%|█████████████████████████████████▊   | 10339/11314 [17:06<01:33, 10.40it/s][A
 91%|█████████████████████████████████▊   | 10341/11314 [17:06<01:41,  9.54it/s][A
 91%|█████████████████████████████████▊   | 10343/11314 [17:06<01:42,  9.49it/s][A
 91%|█████████████████████████████████▊   | 10345/11314 [17:07<01:41,  9.53i

 93%|██████████████████████████████████▌  | 10572/11314 [17:29<00:47, 15.64it/s][A
 93%|██████████████████████████████████▌  | 10574/11314 [17:29<01:06, 11.12it/s][A
 93%|██████████████████████████████████▌  | 10576/11314 [17:30<01:13, 10.04it/s][A
 93%|██████████████████████████████████▌  | 10578/11314 [17:30<01:56,  6.33it/s][A
 94%|██████████████████████████████████▌  | 10581/11314 [17:30<01:30,  8.13it/s][A
 94%|██████████████████████████████████▌  | 10586/11314 [17:31<01:08, 10.60it/s][A
 94%|██████████████████████████████████▋  | 10589/11314 [17:31<00:59, 12.24it/s][A
 94%|██████████████████████████████████▋  | 10592/11314 [17:31<00:53, 13.48it/s][A
 94%|██████████████████████████████████▋  | 10595/11314 [17:31<00:47, 15.19it/s][A
 94%|██████████████████████████████████▋  | 10598/11314 [17:31<00:41, 17.21it/s][A
 94%|██████████████████████████████████▋  | 10601/11314 [17:31<00:53, 13.31it/s][A
 94%|██████████████████████████████████▋  | 10603/11314 [17:32<00:51, 13.81i

 96%|███████████████████████████████████▍ | 10843/11314 [17:57<01:42,  4.59it/s][A
 96%|███████████████████████████████████▍ | 10844/11314 [17:57<01:33,  5.03it/s][A
 96%|███████████████████████████████████▍ | 10845/11314 [17:57<01:32,  5.10it/s][A
 96%|███████████████████████████████████▍ | 10847/11314 [17:57<01:25,  5.46it/s][A
 96%|███████████████████████████████████▍ | 10849/11314 [17:58<01:31,  5.09it/s][A
 96%|███████████████████████████████████▍ | 10851/11314 [17:58<01:13,  6.27it/s][A
 96%|███████████████████████████████████▍ | 10853/11314 [17:58<01:04,  7.13it/s][A
 96%|███████████████████████████████████▍ | 10855/11314 [17:58<01:07,  6.80it/s][A
 96%|███████████████████████████████████▌ | 10856/11314 [17:58<01:11,  6.44it/s][A
 96%|███████████████████████████████████▌ | 10857/11314 [17:59<01:40,  4.53it/s][A
 96%|███████████████████████████████████▌ | 10858/11314 [17:59<01:52,  4.05it/s][A
 96%|███████████████████████████████████▌ | 10860/11314 [17:59<01:31,  4.96i

 98%|████████████████████████████████████▏| 11051/11314 [18:17<00:16, 15.95it/s][A
 98%|████████████████████████████████████▏| 11053/11314 [18:18<00:17, 15.16it/s][A
 98%|████████████████████████████████████▏| 11055/11314 [18:18<00:16, 16.11it/s][A
 98%|████████████████████████████████████▏| 11057/11314 [18:18<00:15, 16.55it/s][A
 98%|████████████████████████████████████▏| 11061/11314 [18:18<00:13, 19.22it/s][A
 98%|████████████████████████████████████▏| 11064/11314 [18:18<00:12, 20.06it/s][A
 98%|████████████████████████████████████▏| 11067/11314 [18:18<00:13, 18.44it/s][A
 98%|████████████████████████████████████▏| 11070/11314 [18:18<00:13, 18.12it/s][A
 98%|████████████████████████████████████▏| 11074/11314 [18:19<00:11, 21.24it/s][A
 98%|████████████████████████████████████▏| 11077/11314 [18:19<00:14, 16.79it/s][A
 98%|████████████████████████████████████▏| 11080/11314 [18:19<00:13, 17.70it/s][A
 98%|████████████████████████████████████▎| 11085/11314 [18:19<00:11, 20.32i

 99%|████████████████████████████████████▋| 11230/11314 [18:41<00:10,  7.86it/s][A
 99%|████████████████████████████████████▋| 11231/11314 [18:43<00:39,  2.09it/s][A
 99%|████████████████████████████████████▋| 11235/11314 [18:43<00:27,  2.89it/s][A
 99%|████████████████████████████████████▊| 11239/11314 [18:43<00:18,  3.99it/s][A
 99%|████████████████████████████████████▊| 11244/11314 [18:43<00:12,  5.44it/s][A
 99%|████████████████████████████████████▊| 11247/11314 [18:43<00:09,  6.73it/s][A
 99%|████████████████████████████████████▊| 11250/11314 [18:43<00:07,  8.24it/s][A
 99%|████████████████████████████████████▊| 11253/11314 [18:43<00:05, 10.24it/s][A
 99%|████████████████████████████████████▊| 11256/11314 [18:44<00:05, 10.98it/s][A
100%|████████████████████████████████████▊| 11258/11314 [18:44<00:04, 12.33it/s][A
100%|████████████████████████████████████▊| 11260/11314 [18:44<00:08,  6.55it/s][A
100%|████████████████████████████████████▊| 11262/11314 [18:45<00:07,  6.62i

In [36]:
df.head()

Unnamed: 0,content,target,target_names,clean_text,lemmas
0,"I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.",7,rec.autos,I was wondering if anyone out there could enlighten me on this car I saw the other day It was a door sports car looked to be from the late s early s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail,"[wonder, enlighten, car, see, day, , , door, sport, car, , look, late, , s, , early, , s, , call, Bricklin, , door, small, , addition, , bumper, separate, rest, body, , know, , tellme, model, , engine, spec, , year, production, , car, , history, , info, funky, look, car, , e, mail]"
1,"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.",4,comp.sys.mac.hardware,A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll Please send a brief message detailing your experiences with the procedure Top speed attained CPU rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with and m floppies are especially requested I will be summarizing in the next two days so please add to the network knowledge base if you have done the clock upgrade and haven t answered this poll Thanks,"[fair, number, brave, soul, upgrade, SI, clock, oscillator, share, experience, poll, , send, brief, message, detail, experience, procedure, , speed, attain, , cpu, rate, speed, , add, card, adapter, , heat, sink, , hour, usage, day, , floppy, disk, functionality, , , m, floppy, especially, request, , summarize, day, , add, network, knowledge, base, clock, upgrade, haven, t, answer, poll, , thank]"
2,"well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering",4,comp.sys.mac.hardware,well folks my mac plus finally gave up the ghost this weekend after starting life as a k way back in sooo i m in the market for a new machine a bit sooner than i intended to be i m looking into picking up a powerbook or maybe and have a bunch of questions that hopefully somebody can answer does anybody know any dirt on when the next round of powerbook introductions are expected i d heard the c was supposed to make an appearence this summer but haven t heard anymore on it and since i don t have access to macleak i was wondering if anybody out there had more info has anybody heard rumors about price drops to the powerbook line like the ones the duo s just went through recently what s the impression of the display on the i could probably swing a if i got the Mb disk rather than the but i don t really have a feel for how much better the display is yea it looks great in the store but is that all wow or is it really that good could i solicit some opinions of people who use the and day to day on if its worth taking the disk size and money hit to get the active display i realize this is a real subjective question but i ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful how well does hellcats perform thanks a bunch in advance for any info if you could email i ll post a summary news reading time is at a premium with finals just around the corner Tom Willis twillis ecn purdue edu Purdue Electrical Engineering,"[folk, , mac, plus, finally, give, ghost, weekend, start, life, , k, way, , sooo, , m, market, new, machine, bit, sooner, intend, , m, look, pick, powerbook, , maybe, , bunch, question, , hopefully, , somebody, answer, , anybody, know, dirt, round, powerbook, introduction, expect, , d, hear, , c, suppose, appearence, , summer, , haven, t, hear, anymore, , don, t, access, macleak, , wonder, anybody, info, , anybody, hear, rumor, price, drop, powerbook, line, like, one, duo, s, go, recently, , s, impression, display, , probably, swing, , get, , Mb, disk, , don, t, feel, , ...]"
3,\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n,1,comp.graphics,Do you have Weitek s address phone number I d like to get some information about this chip,"[Weitek, s, address, phone, number, , d, like, information, chip]"
4,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.",14,sci.space,From article C owCB n p world std com by tombaker world std com Tom A Baker My understanding is that the expected errors are basically known bugs in the warning system software things are checked that don t have the right values in yet because they aren t set till after launch and suchlike Rather than fix the code and possibly introduce new bugs they just tell the crew ok if you see a warning no before liftoff ignore it,"[article, , C, owcb, n, p, world, std, com, , tombaker, world, std, com, , Tom, Baker, , understanding, , expect, error, , basically, know, bug, warning, system, software, , thing, check, don, t, right, value, aren, t, set, till, launch, , suchlike, , fix, code, possibly, introduce, new, bug, , tell, crew, , ok, , warning, , liftoff, , ignore, ]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [37]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [38]:
# How many words do we have?
len(id2word.keys())

77754

In [39]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [40]:
# How many words do we have?
len(id2word.keys())

14778

In [48]:
id2word[249]

'prob'

In [53]:
df['content'][4]



In [43]:
corpus[5]

[(0, 11),
 (117, 1),
 (177, 1),
 (193, 1),
 (221, 1),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 1),
 (233, 1),
 (234, 1),
 (235, 1),
 (236, 1),
 (237, 1),
 (238, 1),
 (239, 1),
 (240, 1),
 (241, 1),
 (242, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 2),
 (247, 1),
 (248, 1),
 (249, 2)]

In [44]:
id2word[252]

'rm'

In [45]:
id2word[276]

'controller'

In [46]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

[('  ', 11),
 ('helpful', 1),
 ('address', 1),
 ('ignore', 1),
 ('course', 1),
 ('evidently', 1),
 ('later', 1),
 ('mass', 1),
 ('point', 1),
 ('present', 1),
 ('quote', 1),
 ('read', 1),
 ('switch', 1),
 ('term', 1),
 ('topic', 1),
 ('understand', 1),
 ('weapon', 1),
 ('News', 1),
 ('Sean', 1),
 ('September', 1),
 ('Sharon', 1),
 ('accidentally', 1),
 ('bounce', 1),
 ('couldn', 1),
 ('delete', 1),
 ('directly', 1),
 ('file', 2),
 ('glad', 1),
 ('instead', 1),
 ('prob', 2)]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [51]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            chunksize=100,
                                            passes=10,
                                            per_word_topics=True)

# https://radimrehurek.com/gensim/models/ldamodel.html

IndexError: index 14778 is out of bounds for axis 1 with size 14778

In [50]:
lda_model.save('lda_model.model')

NameError: name 'lda_model' is not defined

In [None]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

In [None]:
lda_multicore.save('lda_multicore.model')

In [None]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
newsgroups_train.target_names

In [None]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

In [None]:
lda_multicore[corpus[5]][0]

In [None]:
doc_lda

In [None]:
distro = [lda[d] for d in corpus]

In [None]:
#distro = [lda_multicore[corpus[d]][0] for d in corpus]

In [None]:
distro[0]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))