# Topic Modeling Assessment Project

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find optimal number of cateogries to assign these questions to. The .csv file of these text questions can be found in the NLP folder.


#### Task: Import pandas and read in the quora_questions.csv file.

In [44]:
import pandas as pd
import numpy as np
# Plotting tools

import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv('quora_questions.csv')

In [5]:
df.shape

(404289, 1)

In [6]:
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [10]:
df.isnull().sum()

Question    0
dtype: int64

# Preprocessing

#### Task: Create a vectorized document term matrix. 

- How do you want to clean up your text with regards to stopwords, special characters, and other situations.
- Using a Countvectorizer versus a TFIDFvectorizer
- You may want to explore the max_df and min_df parameters. 


In [13]:
import re 
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def spacy_tokenizer(text):
    """
    Returns the original text in a spacy-tokenized list.
    
    Param text: string of text to be tokenized.
    """
    
    # remove html tags from all of the text before processing
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', text)
    
    # Creating our token object, which is used to create documents with linguistic annotations.
    # we disabled the parser and ner parts of the pipeline in order to speed up parsing
    mytokens = nlp(cleantext, disable=['parser', 'ner'])

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [14]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

   **Let's make sure to try some different values for `max_df` and `min_df` later**

In [16]:
cv = CountVectorizer(tokenizer=spacy_tokenizer, max_df=0.90, min_df=10, stop_words='english')
tfidfv = TfidfVectorizer(tokenizer=spacy_tokenizer,max_df=0.90, min_df=10, stop_words='english')

In [21]:
%time cv_data = cv.fit_transform(df['Question'])

CPU times: user 18min 9s, sys: 6.06 s, total: 18min 15s
Wall time: 18min 34s


In [22]:
%time tfidfv_data = tfidfv.fit_transform(df['Question'])

CPU times: user 17min 50s, sys: 5.12 s, total: 17min 56s
Wall time: 18min 15s


In [43]:
cv_data.shape

(404289, 11917)

I want to pick a random subset of 25% of this data. Otherwise this modeling will take forever. Thus, I'll generate 

$$\frac{404289}{4} = 101072$$ random numbers in the range $[0, 404289]$.

In [45]:
# random state for reproducibility
np.random.RandomState(seed=1)

# Random indexes
rand_is = np.random.randint(0,404289,101072)

In [46]:
cv_subset = cv_data[rand_is]

In [47]:
cv_subset.shape

(101072, 11917)

In [48]:
tfidfv_subset = tfidfv_data[rand_is]
tfidfv_subset.shape

(101072, 11917)

Beautiful. Now we can use these subsets for modeling. Simply because I don't want to destroy my computer's cores.

# LDA Modelling

#### TASK: Using Scikit-Learn create an instance of LDA. 

- You can manually run and tune your model, then evaluate the resulting clusters. 
- Or you can use gridsearch to try and identify the best number of topics to use. 


**NOTE:** We may want to take a random sample of the data! Maybe 25% of the rows because 400,000 could take too long. For example:

`df.sample(frac=0.25, random_state=99)`

In [51]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

In [53]:
# Define Search Param
search_params = {'n_components': [15, 20, 25],'learning_decay': [.5, .7]}

# Init the Model
lda = LatentDirichletAllocation(max_iter=25, 
                                random_state=100, 
                                evaluate_every = -1)

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params, cv=3, verbose=2, n_jobs = -1)

# Do the Grid Search
%time model.fit(cv_subset)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 41.6min finished


CPU times: user 6min 23s, sys: 4.71 s, total: 6min 27s
Wall time: 48min 25s


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=25,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1, random_state=100,
                                                 topic_word_prior=None,
                                                 total_samples=1000000.0,
                

In [54]:
# Init Grid Search Class
model2 = GridSearchCV(lda, param_grid=search_params, cv=3, verbose=2, n_jobs = -1)

# Do the Grid Search
%time model2.fit(tfidfv_subset)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 38.6min finished


CPU times: user 5min 35s, sys: 3.43 s, total: 5min 38s
Wall time: 44min 35s


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=25,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1, random_state=100,
                                                 topic_word_prior=None,
                                                 total_samples=1000000.0,
                

#### Task: Evaluate the different models you have run and determine which model you think determines the best clusters.  


The evaluation part could invlove:
- Printing out the top 15 most common words for each of the topics and seeing if they make sense.
- Using the perplexity and log-likelihoood scores.
- Using the pyLDAvis tool to investigate the different clusters. 

### Let's see the best topic models for each type of vectorizer. We can compare their parameters and performance.

In [55]:
# Best Model for CountVectorizer
best_cv_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_cv_lda_model.perplexity(cv_subset))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 15}
Best Log Likelihood Score:  -1333847.1039681332
Model Perplexity:  2689.6030766376007


In [57]:
# Best Model for TfidfVectorizer
best_tfidfv_lda_model = model2.best_estimator_

# Model Parameters
print("Best Model's Params: ", model2.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model2.best_score_)

# Perplexity
print("Model Perplexity: ", best_tfidfv_lda_model.perplexity(tfidfv_subset))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 15}
Best Log Likelihood Score:  -694352.7635593917
Model Perplexity:  11339.599058857213


In [None]:
# Build LDA Model
# lda_model = LatentDirichletAllocation(n_components=20,               # Number of topics
#                                       max_iter=20,               # Max learning iterations
#                                       learning_method='online',   
#                                       random_state=100,          # Random state
#                                       batch_size=128,            # n docs in each learning iter
#                                       evaluate_every = -1,       # compute perplexity every n iters, default: Don't
#                                       n_jobs = 1,               # Use all available CPUs
#                                  )

# print(lda_model)  # Model attributes

In [58]:
# Replace dtm with whichever tokenized data set
cv_lda_output = best_cv_lda_model.fit_transform(cv_subset)
tfidfv_lda_output = best_tfidfv_lda_model.fit_transform(tfidf_subset)

KeyboardInterrupt: 

In [59]:
# For CountVectorization
for index,topic in enumerate(best_cv_lda_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['body', 'month', 'exercise', 'fast', 'gain', 'hate', 'reduce', 'high', 'school', 'fat', 'india', 'weight', 'lose', 'way', 'good']


THE TOP 15 WORDS FOR TOPIC #1
['eye', 'international', 'possible', 'speed', 'light', 'example', 'visit', 'youtube', 'laptop', 'time', 'travel', 'video', 'place', 'student', 'good']


THE TOP 15 WORDS FOR TOPIC #2
['window', 'code', 'sentence', 'difference', 'different', 'phone', 'period', 'learn', 'die', 'word', 'pro', 'number', 'programming', 'language', 'use']


THE TOP 15 WORDS FOR TOPIC #3
['sell', 'benefit', 'pass', 'health', 'skill', 'social', 'center', 'help', 'drug', 'test', 'good', 'english', 'improve', 'study', 'work']


THE TOP 15 WORDS FOR TOPIC #4
['process', 'interview', 'song', 'course', 'software', 'college', 'company', 'phone', 'android', 'website', 'india', 'engineering', 'app', 'job', 'good']


THE TOP 15 WORDS FOR TOPIC #5
['hair', 'career', 'start', 'dream', 'year', 'new', 'relationship', 'feel', '2017', 

In [60]:
# For TfidfV
for index,topic in enumerate(best_tfidfv_lda_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['body', 'exercise', 'mean', 'energy', 'reduce', 'gain', 'way', 'hate', 'food', 'fast', 'fat', 'eat', 'good', 'weight', 'lose']


THE TOP 15 WORDS FOR TOPIC #1
['cheap', 'possible', 'example', 'eye', 'speed', 'porn', 'light', 'laptop', 'time', 'video', 'visit', 'youtube', 'travel', 'good', 'place']


THE TOP 15 WORDS FOR TOPIC #2
['java', 'pregnant', 'window', 'code', 'phone', 'app', 'sentence', 'good', 'android', 'word', 'iphone', 'learn', 'programming', 'language', 'use']


THE TOP 15 WORDS FOR TOPIC #3
['vote', 'india', 'successful', 'writing', 'drug', 'good', 'study', 'skill', 'hillary', 'clinton', 'donald', 'president', 'english', 'trump', 'improve']


THE TOP 15 WORDS FOR TOPIC #4
['start', 'business', 'science', 'career', 'computer', 'india', 'software', 'school', 'engineer', 'company', 'student', 'college', 'engineering', 'job', 'good']


THE TOP 15 WORDS FOR TOPIC #5
['resolution', 'stop', 'like', 'year', 'cat', 'relationship', 'dream', 'car', '20

In [64]:
help(best_cv_lda_model)

Help on LatentDirichletAllocation in module sklearn.decomposition.online_lda object:

class LatentDirichletAllocation(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  LatentDirichletAllocation(n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)
 |  
 |  Latent Dirichlet Allocation with online variational Bayes algorithm
 |  
 |  .. versionadded:: 0.17
 |  
 |  Read more in the :ref:`User Guide <LatentDirichletAllocation>`.
 |  
 |  Parameters
 |  ----------
 |  n_components : int, optional (default=10)
 |      Number of topics.
 |  
 |  doc_topic_prior : float, optional (default=None)
 |      Prior of document topic distribution `theta`. If the value is None,
 |      defaults to `1 / n_components`.
 |      In [1]_, this i

In [None]:
# Create Document - Topic Matrix
lda_output = best_cv_lda_model.transform(cv_subset)

# column names
topicnames = ["Topic" + str(i) for i in range(best_cv_lda_model.n_components)]

# index names
docnames = ["Doc" + str(i) for i in range(cv_subset.shape[0])]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

In [72]:
# Styling
def color_red(val):
    color = 'red' if (val > .1) or (val in list(df_document_topic.dominant_topic)) else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if (val > .1) or (val in list(df_document_topic.dominant_topic)) else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_red).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,dominant_topic
Doc0,0.01,0.81,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1
Doc1,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.53,0.03,0.03,0.03,0.03,0.03,0.03,0.03,7
Doc2,0.01,0.81,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1
Doc3,0.77,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0
Doc4,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.77,0.02,0.02,0.02,0.02,0.02,0.02,8
Doc5,0.01,0.01,0.01,0.01,0.01,0.48,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.42,0.01,5
Doc6,0.52,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.27,0.02,0.02,0.02,0.02,0.02,0.02,0
Doc7,0.48,0.02,0.02,0.02,0.02,0.02,0.3,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0
Doc8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.84,14
Doc9,0.01,0.01,0.01,0.01,0.28,0.01,0.01,0.01,0.01,0.64,0.01,0.01,0.01,0.01,0.01,9


### Review topics distribution across documents

In [73]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,4,8803
1,11,8173
2,2,7778
3,7,7767
4,9,7157
5,0,6912
6,13,6895
7,12,6747
8,10,6539
9,1,6215


In [76]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_cv_lda_model, cv_subset, vectorizer=cv, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Get the top 15 keywords each topic

In [82]:
# Show top n keywords for each topic
def show_topics(vectorizer=cv, lda_model=best_cv_lda_model, n_words=20):
    keywords = np.array(cv.get_feature_names())
    topic_keywords = []
    for topic_weights in best_cv_lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=cv, lda_model=best_cv_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,good,way,lose,weight,india,fat,school,high,reduce,hate,gain,fast,exercise,month,body
Topic 1,good,student,place,video,travel,time,laptop,youtube,visit,example,light,speed,possible,international,eye
Topic 2,use,language,programming,number,pro,word,die,learn,period,phone,different,difference,sentence,code,window
Topic 3,work,study,improve,english,good,test,drug,help,center,social,skill,health,pass,benefit,sell
Topic 4,good,job,app,engineering,india,website,android,phone,company,college,software,course,song,interview,process
Topic 5,prepare,stop,long,girl,car,exam,2017,feel,relationship,new,year,dream,start,career,hair
Topic 6,people,person,engineer,like,instagram,difference,eat,know,view,work,favorite,life,animal,chinese,differ
Topic 7,quora,question,ask,people,answer,mean,time,need,message,come,photo,white,google,delete,number
Topic 8,note,indian,india,500,1000,state,rs,rupee,ban,government,card,compare,live,average,hotel
Topic 9,know,love,business,cause,exist,energy,like,mind,start,fact,believe,god,life,difference,people


#### TASK: Add a new column to the original quora dataframe that labels each question into one of the topic categories.

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17


# Great job!