<a href="https://colab.research.google.com/github/jproctor-rebecca/DS/blob/main/module4-topic-modeling/DSPT6_LS_DS_414_Topic_Modeling_RJProctor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

In [1]:
# Dependencies for the week (instead of conda)
!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/main/requirements.txt
!pip install -r requirements.txt

--2020-10-29 17:27:59--  https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 137 [text/plain]
Saving to: ‘requirements.txt.1’


2020-10-29 17:27:59 (3.57 MB/s) - ‘requirements.txt.1’ saved [137/137]



In [2]:
!python -m spacy download en_core_web_lg  # Can do lg, takes awhile
# Also on Colab, need to restart runtime after this step!

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [3]:
!pip install gensim



In [None]:
!pip install pandarallel

# Part 0: Warm-Up
How do we do a grid search? 

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim import models

# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

In [6]:
data = fetch_20newsgroups()

In [7]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [8]:
data['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [9]:
data['data'][1000]

"From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\nSubject: Diamond SS24X, Win 3.1, Mouse cursor\nOrganization: National Library of Medicine\nLines: 10\n\n\nAnybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email\n\n--Don Lindbergh\ndabl2@lhc.nlm.nih.gov\n"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [10]:
v1 = TfidfVectorizer()
X_train = v1.fit_transform(data['data'])


In [11]:
y_train = data['target']

In [12]:
X_train

<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>

In [13]:
# same as todense()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [14]:
p1 = {
    'n_estimators':[10,20],
    'max_depth': [None, 7]
}

In [15]:
X_train.shape

(11314, 130107)

In [16]:
clf = RandomForestClassifier()
gs1 = GridSearchCV(clf, p1, cv=5,n_jobs=-1, verbose=1)
gs1.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [17]:
#gs1.predict(["Sample text"]); if we include the vectorizer in the pipe, 
# we can input raw text
# to avoid error we need to transform text outside GridSearch obj

In [18]:
test_sample = v1.transform(["Sample text"])
test_sample.shape

(1, 130107)

In [19]:
pred = gs1.predict(test_sample)
pred

array([2])

In [20]:
data['target_names'][pred[0]]

'comp.os.ms-windows.misc'

### GridSearch with BOTH the Vectoizer & Classifier

In [21]:
#RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
#                       criterion='gini', max_depth=15, max_features=500,
#                       max_leaf_nodes=None, max_samples=None,
#                       min_impurity_decrease=0.0, min_impurity_split=None,
#                       min_samples_leaf=1, min_samples_split=2,
#                       min_weight_fraction_leaf=0.0, n_estimators=100,
#                       n_jobs=None, oob_score=False, random_state=None,
#                       verbose=0, warm_start=False)

In [22]:


# 1. Create a pipeline with a vectorize and a classifier
# Create Pipeline Components
# create vectorizor
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
# create classifier
rfc = RandomForestClassifier()

pipe = Pipeline([
                 ('vect', vect), 
                 ('clf', rfc)
               ])

parameters = {
    'vect__max_features': (1000,5000),
    'clf__max_depth': (15, 20),
    'clf__n_estimators':(100, 200),
}

# 2. Use Grid Search to optimize the entire pipeline
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=1)
grid_search.fit(data['data'], y_train)


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [23]:
pred = grid_search.predict(["Sample text"])

In [24]:
data['target_names'][pred[0]]

'sci.electronics'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [27]:
df = pd.DataFrame({
    'content': data['data'],
    'target': data['target'],
    'target_names': [data['target_names'][i] for i in data['target']]
})
df

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
...,...,...,...
11309,From: jim.zisfein@factory.com (Jim Zisfein) \n...,13,sci.med
11310,From: ebodin@pearl.tufts.edu\nSubject: Screen ...,4,comp.sys.mac.hardware
11311,From: westes@netcom.com (Will Estes)\nSubject:...,3,comp.sys.ibm.pc.hardware
11312,From: steve@hcrlgw (Steven Collins)\nSubject: ...,1,comp.graphics


In [28]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 
# 1. Remove whitespace 
df['content'] = df['content'].apply(lambda text: text.strip())
# 2. Remove Emails
df['content'] = df['content'].apply(lambda text: re.sub('From: \S+@\S+', '', text))
# 3. Remove new line characters
df['content'] = df['content'].apply(lambda text: re.sub('\\n', '', text))
# 4. Remove non-alphanumeric characters
df['content'] = df['content'].apply(lambda text: re.sub('[^0-9 a-zA-Z]+', '', text))

df['content'] = df['content'].apply(lambda text: text.strip())


  df['content'] = df['content'].apply(lambda text: re.sub('From: \S+@\S+', '', text))


In [29]:
df['content'].head()

0    wheres my thingSubject WHAT car is thisNntpPos...
1    Guy KuoSubject SI Clock Poll  Final CallSummar...
2    Thomas E WillisSubject PB questionsOrganizatio...
3    Joe GreenSubject Re Weitek P9000 Organization ...
4    Jonathan McDowellSubject Re Shuttle Launch Que...
Name: content, dtype: object

In [30]:
df['content'] = df['content'].apply(lambda text: text.strip())
df['content'].head()

0    wheres my thingSubject WHAT car is thisNntpPos...
1    Guy KuoSubject SI Clock Poll  Final CallSummar...
2    Thomas E WillisSubject PB questionsOrganizatio...
3    Joe GreenSubject Re Weitek P9000 Organization ...
4    Jonathan McDowellSubject Re Shuttle Launch Que...
Name: content, dtype: object



In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
# Create 'lemmas' column
def tokenize(doc):
  return [token, lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct !=True)]
  

In [None]:
TfidVectorizer(tokenizer=tokenize)

In [None]:
df['lemmas'] = df['lemmas'].parallel_apply(lambda x: (token.lemma_ for  token in nlp(x) if (token.is_stop !=True) and (token.is_punct !=True)))


In [None]:
df.head()

### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [None]:
id2word[200]

In [None]:
df['content'][5]

In [None]:
corpus[5]

In [None]:
id2word[252]

In [None]:
id2word[276]

In [None]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [None]:
# %%time
# lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
#                                            id2word=id2word,
#                                            num_topics=20, 
#                                            chunksize=100,
#                                            passes=10,
#                                            per_word_topics=True)

# # https://radimrehurek.com/gensim/models/ldamodel.html

In [None]:
# lda_model.save('lda_model.model')

In [None]:
# %%time
# lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
#                                                         id2word=id2word,
#                                                         num_topics=20, 
#                                                         chunksize=100,
#                                                         passes=10,
#                                                         per_word_topics=True,
#                                                         workers=12)

# # https://radimrehurek.com/gensim/models/ldamulticore.html

In [None]:
# lda_multicore.save('lda_multicore.model')

In [None]:

lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]


### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
# %%time
# model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=df['lemmas'], start=2, limit=40, step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))