# Topic Modeling Assessment Project

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find optimal number of cateogries to assign these questions to. The .csv file of these text questions can be found in the NLP folder.


#### Task: Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

  from collections import Iterable


In [2]:
df = pd.read_csv('quora_questions.csv')

In [3]:
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Create a vectorized document term matrix. 

- How do you want to clean up your text with regards to stopwords, special characters, and other situations.
- Using a Countvectorizer versus a TFIDFvectorizer
- You may want to explore the max_df and min_df parameters. 


In [4]:
import re 
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def spacy_tokenizer(text):
    # remove html tags from all of the text before processing
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', text)
    # Creating our token object, which is used to create documents with linguistic annotations.
    # we disabled the parser and ner parts of the pipeline in order to speed up parsing
    mytokens = nlp(cleantext, disable=['parser', 'ner'])

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
cv = CountVectorizer(tokenizer=spacy_tokenizer, max_df=0.90, min_df=10, stop_words='english')

In [13]:
dtm = cv.fit_transform(df['Question'])

In [14]:
dtm

<404289x11984 sparse matrix of type '<class 'numpy.int64'>'
	with 1838887 stored elements in Compressed Sparse Row format>

# LDA Modelling

#### TASK: Using Scikit-Learn create an instance of LDA. 

- You can manually run and tune your model, then evaluate the resulting clusters. 
- Or you can use gridsearch to try and identify the best number of topics to use. 


In [15]:
from sklearn.decomposition import LatentDirichletAllocation

In [16]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=20,               # Number of topics
                                      max_iter=20,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter can up this
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )

print(lda_model)  # Model attributes

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=20,
                          mean_change_tol=0.001, n_components=20, n_jobs=-1,
                          perp_tol=0.1, random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)


In [17]:
# This can take awhile, we're dealing with a large amount of documents!

lda_output = lda_model.fit_transform(dtm)


## Saving Model

In [65]:
#save model to local folder
import pickle 
  
# Save the trained model as a pickle string. 
saved_lda_model = pickle.dumps(lda_model) 
  


In [None]:
# # Load the pickled model 
# knn_from_pickle = pickle.loads(saved_lda_model)

In [66]:
from sklearn.externals import joblib 
# Save the model as a pickle in a file 
joblib.dump(lda_model, 'saved_lda_model.pkl') 



['saved_lda_model.pkl']

In [None]:
# Load the model from the file 
# lda_model___ = joblib.load('saved_lda_model.pkl')  

In [None]:
# Use the loaded model to make predictions 
# lda_model___.predict(X_test) 

#### Task: Evaluate the different models you have run and determine which model you think determines the best clusters.  


The evaluation part could invlove:
- Printing out the top 15 most common words for each of the topics and seeing if they make sense.
- Using the perplexity and log-likelihoood scores.
- Using the pyLDAvis tool to investigate the different clusters. 

In [18]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(dtm))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(dtm))

# See model parameters
print(lda_model.get_params())

Log Likelihood:  -15566059.752121642
Perplexity:  3637.476085169163
{'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'online', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 20, 'mean_change_tol': 0.001, 'n_components': 20, 'n_jobs': -1, 'perp_tol': 0.1, 'random_state': 100, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}


In [32]:
import random
for i in range(10):
    random_word_id = random.randint(0,len(cv.get_feature_names())-1)
    print(cv.get_feature_names()[random_word_id])

protest
ola
jerry
quietly
grill
list
context
secure
hedge
impractical


### simple exploration

In [21]:
len(lda_model.components_)

20

In [22]:
lda_model.components_

array([[1.33966081e+01, 5.00000010e-02, 5.00000011e-02, ...,
        5.00000000e-02, 5.00000000e-02, 5.00000001e-02],
       [5.00000000e-02, 5.00000003e-02, 5.00000003e-02, ...,
        5.00000003e-02, 5.00000002e-02, 5.00000002e-02],
       [5.00000000e-02, 1.30833382e+01, 5.00000007e-02, ...,
        5.00000000e-02, 5.00000000e-02, 5.00000000e-02],
       ...,
       [5.00000000e-02, 5.00000001e-02, 5.71120221e+01, ...,
        5.00000003e-02, 5.00000000e-02, 5.00000002e-02],
       [5.00000000e-02, 5.00000000e-02, 5.00000002e-02, ...,
        5.00000000e-02, 5.00000000e-02, 5.00000001e-02],
       [5.00000002e-02, 5.00000002e-02, 5.00000001e-02, ...,
        5.00000000e-02, 5.00000000e-02, 5.00000001e-02]])

In [37]:
len(lda_model.components_[0])

11984

In [24]:
single_topic = lda_model.components_[0]

In [25]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([  875,  8783,  4902, ..., 11715,  7557,  6487])

In [27]:
# Word least representative of this topic
single_topic[1802]

0.05000000002925991

In [33]:
single_topic.argsort()[-10:]

array([ 8225, 10752,  9460,    24,  1546,   153,  3331, 11715,  7557,
        6487])

In [28]:
top_word_indices = single_topic.argsort()[-10:]

In [29]:
for index in top_word_indices:
    print(cv.get_feature_names()[index])

plan
term
safe
1
big
2
different
weight
number
lose


### Further Exploration

In [30]:
for index,topic in enumerate(lda_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['week', 'today', 'reduce', 'option', 'god', 'plan', 'term', 'safe', '1', 'big', '2', 'different', 'weight', 'number', 'lose']


THE TOP 15 WORDS FOR TOPIC #1
['wear', 'tip', 'process', 'skill', '2017', 'interview', 'government', '3', 'improve', 'man', 'old', 'woman', 'job', 'indian', 'year']


THE TOP 15 WORDS FOR TOPIC #2
['complete', 'handle', 'offer', 'drug', 'best', 'kill', 'common', 'startup', 'pro', 'code', 'affect', '10', 'pay', 'important', 'write']


THE TOP 15 WORDS FOR TOPIC #3
['bollywood', 'gift', 'area', 'expect', 'film', 'break', 'require', 'iit', 'fact', 'remove', 'hotel', '5', 'earth', 'study', 'happen']


THE TOP 15 WORDS FOR TOPIC #4
['java', 'salary', 'culture', 'favorite', 'child', 'learn', 'programming', 'game', 'high', 'stop', 'language', 'love', 'new', 'book', 'difference']


THE TOP 15 WORDS FOR TOPIC #5
['universe', 'open', 'city', 'future', 'speak', 'china', 'increase', 'download', 'food', 'experience', 'great', 'software', 'son

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the topic categories.

In [42]:
# Create Document - Topic Matrix
lda_output = lda_model.transform(dtm)

# column names
topicnames = ["Topic" + str(i) for i in range(lda_model.n_components)]

# index names
docnames = ["Doc" + str(i) for i in range(len(df['Question']))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

In [47]:
# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(30).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,dominant_topic
Doc0,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.76,0.01,0.13,0.01,0.01,0.01,0.01,13
Doc1,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.81,0.01,0.01,0.01,0.01,0.01,0.01,13
Doc2,0.01,0.01,0.01,0.01,0.01,0.15,0.01,0.01,0.72,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,8
Doc3,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.76,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,10
Doc4,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.89,0.01,0.01,0.01,0.01,0.01,14
Doc5,0.01,0.01,0.01,0.01,0.01,0.89,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,5
Doc6,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.52,0.03,0.03,0.03,0.03,0.03,0.03,13
Doc7,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.52,19
Doc8,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.35,0.02,0.02,0.02,0.02,0.35,0.02,0.02,0.02,0.02,0.02,0.02,8
Doc9,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.61,0.21,0.01,0.01,0.01,0.01,0.01,0.01,12


In [60]:
df_document_topic.dominant_topic.value_counts()

0     37992
1     31724
4     28047
2     25469
3     25120
8     24814
5     23467
13    22409
6     21329
10    20254
14    20135
7     18206
16    16759
11    16624
9     14505
19    13816
12    12547
15    10940
17    10436
18     9696
Name: dominant_topic, dtype: int64

In [63]:
df_document_topic.loc[df_document_topic['dominant_topic']==1]

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,...,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,dominant_topic
Doc25,0.01,0.84,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1
Doc36,0.01,0.86,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1
Doc80,0.00,0.82,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.10,0.00,1
Doc133,0.01,0.72,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.15,0.01,0.01,0.01,0.01,0.01,1
Doc135,0.03,0.52,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,1
Doc166,0.01,0.61,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.21,0.01,0.01,0.01,1
Doc187,0.01,0.51,0.01,0.13,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.26,0.01,0.01,0.01,0.01,0.01,1
Doc222,0.02,0.68,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,...,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,1
Doc252,0.01,0.38,0.01,0.13,0.01,0.01,0.13,0.01,0.01,0.01,...,0.13,0.01,0.01,0.13,0.01,0.01,0.01,0.01,0.01,1
Doc262,0.01,0.61,0.01,0.01,0.21,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1


### Review topics distribution across documents

In [48]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,0,37992
1,1,31724
2,4,28047
3,2,25469
4,3,25120
5,8,24814
6,5,23467
7,13,22409
8,6,21329
9,10,20254


In [52]:
import pyLDAvis
import pyLDAvis.sklearn

In [64]:
# Plotting tools

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, dtm, cv, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Get the top 15 keywords each topic

In [None]:
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

# Great job!