___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Topic Modeling Assessment Project

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.

Remember you can always check the solutions notebook and video lecture for any questions.

#### Task: Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd
npr = pd.read_csv('quora_questions.csv')
print(npr.head())
npr.shape

                                            Question
0  What is the step by step guide to invest in sh...
1  What is the story of Kohinoor (Koh-i-Noor) Dia...
2  How can I increase the speed of my internet co...
3  Why am I mentally very lonely? How can I solve...
4  Which one dissolve in water quikly sugar, salt...


(404289, 1)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Question'])
print(dtm[0])
print(dtm.shape)
print(len(cv.get_feature_names()))

  (0, 32971)	2
  (0, 15464)	1
  (0, 18192)	1
  (0, 31209)	1
  (0, 21408)	1
  (0, 17507)	1
(404289, 38669)
38669


In [26]:
feature_names = cv.get_feature_names()
non_zero_counts = dtm[0]

# Iterate through the non-zero entries and print the terms and their counts
for col_index in non_zero_counts.indices:
    term = feature_names[col_index]
    count = non_zero_counts[0, col_index]
    print(f"Term: {term}, Count: {count}")

Term: step, Count: 2
Term: guide, Count: 1
Term: invest, Count: 1
Term: share, Count: 1
Term: market, Count: 1
Term: india, Count: 1


In [20]:
from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [24]:
print(len(LDA.components_))
print(len(LDA.components_[0]))
print(LDA.components_[:, :5])  
print(LDA.components_.shape)
for index, topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[::-1][:15]])
    print('\n')

7
38669
[[1.43027512e-01 1.81171965e-01 1.46422411e-01 5.14254964e+00
  1.72316092e+00]
 [1.43597885e-01 6.86689433e+01 1.42857928e-01 1.42857314e-01
  1.42857667e-01]
 [5.30622659e+00 1.50086747e-01 1.42857911e-01 1.42857311e-01
  1.42857660e-01]
 [1.43676091e+01 7.53570572e+02 1.42857822e-01 1.43026583e-01
  1.42857604e-01]
 [1.43251681e-01 1.43133965e-01 1.42857771e-01 1.42994613e-01
  1.43458562e-01]
 [2.47595566e-01 1.42937359e-01 2.13928835e+00 1.42857249e-01
  5.60978550e-01]
 [2.66486917e+01 1.43154531e-01 1.42857806e-01 1.42857290e-01
  1.43829037e-01]]
(7, 38669)
THE TOP 15 WORDS FOR TOPIC #0
['best', 'india', 'phone', 'use', 'good', 'does', 'engineering', 'android', 'app', 'google', 'software', 'mobile', 'using', 'company', 'free']


THE TOP 15 WORDS FOR TOPIC #1
['best', 'money', 'learn', 'way', 'make', 'english', '500', 'online', 'notes', '1000', 'improve', 'language', 'stop', 'programming', 'ways']


THE TOP 15 WORDS FOR TOPIC #2
['life', 'does', 'time', 'good', 'best', '

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm2 = tfidf.fit_transform(npr['Question'])
print(dtm2[0])
print(dtm2.shape)
print(len(tfidf.get_feature_names()))

  (0, 17507)	0.19115261267972286
  (0, 21408)	0.3003349492828469
  (0, 31209)	0.3291744678610915
  (0, 18192)	0.31649207559225107
  (0, 15464)	0.3894800676252344
  (0, 32971)	0.7162693694576815
(404289, 38669)
38669


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer(max_df=0.9, min_df=3, stop_words='english')
dtm3 = tfidf2.fit_transform(npr['Question'])
print(dtm3[0])
print(dtm3.shape)
print(len(tfidf2.get_feature_names()))

  (0, 13568)	0.19115261267972286
  (0, 16555)	0.3003349492828469
  (0, 24181)	0.3291744678610915
  (0, 14125)	0.31649207559225107
  (0, 11995)	0.3894800676252344
  (0, 25551)	0.7162693694576815
(404289, 29917)
29917


<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [4]:
from sklearn.decomposition import NMF
nmf_model = NMF(n_components=20,random_state=42)
nmf_model.fit(dtm2)



NMF(n_components=20, random_state=42)

In [5]:
nmf_model.components_.shape

(20, 38669)

In [6]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[::-1][:15]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['best', 'movies', 'book', 'books', '2016', 'ways', 'movie', 'laptop', 'buy', 'phone', 'places', 'visit', 'place', 'read', 'thing']


THE TOP 15 WORDS FOR TOPIC #1
['does', 'mean', 'work', 'feel', 'long', 'cost', 'compare', 'really', 'exist', 'use', 'differ', 'looking', 'sex', 'recruit', 'majors']


THE TOP 15 WORDS FOR TOPIC #2
['quora', 'questions', 'question', 'ask', 'answer', 'answers', 'google', 'asked', 'delete', 'improvement', 'easily', 'post', 'needing', 'answered', 'add']


THE TOP 15 WORDS FOR TOPIC #3
['money', 'make', 'online', 'earn', 'ways', 'youtube', 'easy', 'home', 'free', 'internet', 'black', 'friends', 'investment', 'website', 'using']


THE TOP 15 WORDS FOR TOPIC #4
['life', 'purpose', 'meaning', 'thing', 'important', 'real', 'moment', 'change', 'want', 'live', 'changed', 'death', 'day', 'earth', 'balance']


THE TOP 15 WORDS FOR TOPIC #5
['india', 'pakistan', 'war', 'spotify', 'job', 'available', 'olympics', 'country', 'business', 'chi

In [7]:
topic_results = nmf_model.transform(dtm2)
print(topic_results.shape)
npr2 = npr
npr2['Topic'] = topic_results.argmax(axis=1)
print(npr.head(20))

(404289, 20)
                                             Question  Topic
0   What is the step by step guide to invest in sh...      5
1   What is the story of Kohinoor (Koh-i-Noor) Dia...     16
2   How can I increase the speed of my internet co...     17
3   Why am I mentally very lonely? How can I solve...     11
4   Which one dissolve in water quikly sugar, salt...     14
5   Astrology: I am a Capricorn Sun Cap moon and c...      1
6                                 Should I buy tiago?      0
7                      How can I be a good geologist?     10
8                     When do you use シ instead of し?     19
9   Motorola (company): Can I hack my Charter Moto...     17
10  Method to find separation of slits using fresn...      2
11        How do I read and find my YouTube comments?      3
12               What can make Physics easy to learn?      3
13        What was your first sexual experience like?      9
14  What are the laws to change your status from a...      1
15  What wo

In [30]:
# ERROR: This is based on nmf_model.fit(dtm), which is the wrong dtm

for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[::-1][:15]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['best', 'book', 'books', 'ways', 'movies', 'buy', 'laptop', '2016', 'places', 'visit', 'place', 'online', 'phone', 'movie', 'coaching']


THE TOP 15 WORDS FOR TOPIC #1
['does', 'mean', 'work', 'feel', 'compare', 'long', 'cost', 'differ', 'really', 'looking', 'universities', 'recruit', 'use', 'grads', 'majors']


THE TOP 15 WORDS FOR TOPIC #2
['india', 'pakistan', 'war', 'china', 'country', 'spotify', 'available', 'job', 'olympics', 'buy', 'company', 'business', 'start', 'world', 'engineering']


THE TOP 15 WORDS FOR TOPIC #3
['people', 'think', 'don', 'ask', 'questions', 'believe', 'world', 'mind', 'easily', 'hate', 'google', 'blowing', 'use', 'say', 'white']


THE TOP 15 WORDS FOR TOPIC #4
['500', 'notes', '1000', 'rs', 'indian', 'rupee', 'black', 'banning', 'ban', 'government', 'think', '2000', 'currency', 'modi', 'economy']


THE TOP 15 WORDS FOR TOPIC #5
['like', 'feel', 'look', 'culture', 'companies', 'work', 'different', 'girl', 'sex', 'live', 'girl

In [8]:
nmf_model2 = NMF(n_components=20,random_state=42)
nmf_model2.fit(dtm3)
for index,topic in enumerate(nmf_model2.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf2.get_feature_names()[i] for i in topic.argsort()[::-1][:15]])
    print('\n')



THE TOP 15 WORDS FOR TOPIC #0
['best', 'movies', 'books', 'book', '2016', 'ways', 'movie', 'laptop', 'buy', 'phone', 'places', 'visit', 'place', 'read', 'thing']


THE TOP 15 WORDS FOR TOPIC #1
['does', 'mean', 'work', 'feel', 'long', 'cost', 'compare', 'really', 'exist', 'use', 'differ', 'looking', 'recruit', 'sex', 'grads']


THE TOP 15 WORDS FOR TOPIC #2
['quora', 'questions', 'question', 'ask', 'answer', 'answers', 'google', 'asked', 'delete', 'improvement', 'easily', 'post', 'needing', 'answered', 'add']


THE TOP 15 WORDS FOR TOPIC #3
['money', 'make', 'online', 'earn', 'ways', 'youtube', 'easy', 'home', 'free', 'internet', 'black', 'friends', 'com', 'investment', 'website']


THE TOP 15 WORDS FOR TOPIC #4
['life', 'purpose', 'meaning', 'thing', 'important', 'real', 'moment', 'want', 'change', 'live', 'changed', 'death', 'day', 'earth', 'balance']


THE TOP 15 WORDS FOR TOPIC #5
['india', 'pakistan', 'war', 'spotify', 'job', 'business', 'available', 'olympics', 'country', 'start'

In [9]:
topic_results2 = nmf_model2.transform(dtm3)
print(topic_results2.shape)
npr3 = npr
npr3['Topic'] = topic_results2.argmax(axis=1)
print(npr.head(20))

(404289, 20)
                                             Question  Topic
0   What is the step by step guide to invest in sh...      5
1   What is the story of Kohinoor (Koh-i-Noor) Dia...     16
2   How can I increase the speed of my internet co...     18
3   Why am I mentally very lonely? How can I solve...     11
4   Which one dissolve in water quikly sugar, salt...     14
5   Astrology: I am a Capricorn Sun Cap moon and c...      1
6                                 Should I buy tiago?      0
7                      How can I be a good geologist?     10
8                     When do you use シ instead of し?      2
9   Motorola (company): Can I hack my Charter Moto...     18
10  Method to find separation of slits using fresn...      2
11        How do I read and find my YouTube comments?      3
12               What can make Physics easy to learn?      8
13        What was your first sexual experience like?      9
14  What are the laws to change your status from a...      1
15  What wo

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17


# Great job!