# Topic Modeling with NMF

Documents can be represented by a bag of words matrix with TFIDF transformation. 
Matrix shape = (num of docs, num of total vocab).

NMF decomposes any matrix into W*H where W is in shape of (num of docs, num of latent factors/topics) and H is (num of latent factors/topics, num of total vocab).

We can categorize each document into topics by choosing the highest value in latent factor vector in W, and inspect what each topic is about by looking at the words with high values in H.

#### Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [2]:
data = pd.read_csv('data//quora_questions.csv')

In [3]:
data.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [26]:
data.shape

(404289, 1)

## Preprocessing

#### Use TF-IDF Vectorization to create a vectorized document term matrix.

In [7]:
tfidf = TfidfVectorizer(stop_words='english', max_df = 0.9, min_df = 2)

In [9]:
df_transformed = tfidf.fit_transform(data['Question'])

In [10]:
df_transformed

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

## Non-negative Matrix Factorization

#### Using Scikit-Learn create an instance of NMF with 20 expected components. 

In [15]:
nmf = NMF(n_components =20, random_state=42)

In [17]:
nmf.fit(df_transformed)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

#### Print our the top 15 most common words for each of the 20 topics.

In [39]:
nmf.components_.shape
# NMF decompose the original matrix into W*H
# components_ represents the matrix H = (topics, num of total words), how each word in corpus is related to each topic

(20, 38669)

In [41]:
len(tfidf.get_feature_names())

38669

In [20]:
for index, topic in enumerate(nmf.components_):
    print('THE TOP 15 WORDS FOR TOPIC #{}'.format(index))
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

#### Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [27]:
topics = nmf.transform(df_transformed)
# transform returns the decomposed matrix W
# W represents the relations between doc and topics

In [34]:
print(topics.shape)
print(topics.argmax(axis=1).shape)
# each doc/row is represented by a mix of topics, we choose the max value of each row as the main topic it belongs to

(404289, 20)
(404289,)


In [35]:
data['Topic'] = topics.argmax(axis=1)

In [38]:
data.head(20)

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17
