# Topic Modeling - No Preprocessing

This notebook explores topic modeling on a high level using Non-Negative Matrix Factorization. Here I am trying to identify themes in questions that are written on [Quora](https://www.quora.com/) with minimal text preprocessing to get a better grasp on text vectorization and what interesting topics we can find first without typical preprocessing. Later I will explore more preprocessing techniques to understand how it affects our results. 

In [1]:
import numpy as np
import pandas as pd

This dataset was provided on one of the Udemy courses I took and contains over 400K questions from Quora. In this notebook, I am doing more than simply what was given in the course assignments. Rather, I am extending what was taught to play around more with things like the hyperparameters of the text vectorization and the NMF model. 

In [5]:
df = pd.read_csv('data/quora_questions.csv')

In [10]:
len(df)

404289

First five questions in the dataset...

In [8]:
for i in range(5):
    print(df['Question'][i])

What is the step by step guide to invest in share market in india?
What is the story of Kohinoor (Koh-i-Noor) Diamond?
How can I increase the speed of my internet connection while using a VPN?
Why am I mentally very lonely? How can I solve it?
Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?


# Text Vectorization

Vectorizing the questions in the corpus. First, I am going to set min_df to 0.5% We are trying to discover topics that are common in the corpus so perhaps we don't care about words that don't appear very much across all the documents. Excluding words that don't appear frequently can also save some execution time. 

On the flip side, depending how you look at it, we might want to keep certain words that aren't mentioned across many of the documents because these words might be more specific to a smaller subset of documents and can be used to identify the topic(s) of these subsets. 


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
tf_idf = TfidfVectorizer(min_df = 0.005, stop_words = 'english')

In [13]:
data = df['Question']
vectorized = tf_idf.fit_transform(data)

In [14]:
vectorized

<404289x109 sparse matrix of type '<class 'numpy.float64'>'
	with 478090 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

Implementing Non-Negative Matrix Factorization to see what topics it might be able to identify, first starting with 20 components. 

In [15]:
from sklearn.decomposition import NMF

In [16]:
nmf_mod = NMF(n_components=20, random_state=42)

In [17]:
nmf_mod.fit(vectorized)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

Below, the top 15 words are printed for each of the 20 topics. Each component has a vector the length of the number of words in the corpus and each dimension represents the magnitude or importance of the word to the associated topic.

Looking below, it's difficult to come up with a theme for any of the topics below based off the words that the model found most useful. I would like to see these clusters contain words that appear to be more associated with each other. For example, Topic 19 and the documents that fall within the topic could be about the 2016 presidential race between Hilary Clinton and Trump, but there are other words that muddle our ability to determine a theme for this topic. 

In [18]:
for idx, topic in enumerate(nmf_mod.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{idx}:')
    print([tf_idf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0:
['app', 'website', 'company', 'free', 'engineering', 'phone', 'movie', '2016', 'online', 'buy', 'ways', 'movies', 'books', 'book', 'best']


THE TOP 15 WORDS FOR TOPIC #1:
['weight', 'facebook', 'increase', 'girl', 'person', 'need', 'sex', 'help', 'really', 'new', 'long', 'feel', 'compare', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2:
['happen', 'online', 'start', 'engineering', 'president', '1000', '500', 'business', 'notes', 'country', 'company', 'buy', 'job', 'war', 'india']


THE TOP 15 WORDS FOR TOPIC #3:
['online', 'year', 'examples', 'learning', 'college', 'business', 'job', 'movies', 'buy', 'book', 'engineering', 'bad', 'ways', 'books', 'good']


THE TOP 15 WORDS FOR TOPIC #4:
['google', 'just', 'really', 'facebook', 'website', 'world', 'person', 'don', 'women', 'sex', 'girl', 'live', 'different', 'feel', 'like']


THE TOP 15 WORDS FOR TOPIC #5:
['sex', 'different', 'love', 'just', 'feel', 'important', 'stop', 'want', 'google', 'instagram', 'b

## Testing

Below I'm going to play around with the min_df, max_df parameters during TF-IDF and the number of components in the NMF model to see I can extract more identifiable themes. One possible approach to take note of is that if the model originally finds a lot of the same words across topics, consider lowering k.

Above I stated the cases for and against taking out words that don't show up frequently in the corpus. We might also want to disregard words that appear in a majority of documents or else they might appear in several of the latent topics found by the model, rendering it difficult to interpret any differences between the topics. 

In [50]:
from random import randint
###HELPER FUNCTIONS###

def fit_nmf(tf_idf_vec, n_components=10):
    '''Fits NMF model with specified number of components (default=10) and returns NMF fitted components'''
    mod = NMF(n_components=n_components, random_state=42)
    mod.fit(tf_idf_vec)
    
    nmf_components = mod.components_
    return mod, nmf_components

def print_top15(tf_idf, nmf_components):
    '''Prints top 15 words for each topic'''
    for idx, topic in enumerate(nmf_components):
        print(f'THE TOP 15 WORDS FOR TOPIC #{idx}:')
        print([tf_idf.get_feature_names()[i] for i in topic.argsort()[-15:]])
        print('\n')

def print_topic_sents(df, topic_dict, n=10):
    '''Prints set of n random questions for each topic the model found'''
    topics = topic_dict.values()
    for topic in topics:
        topic_questions = df[df['Topic'] == topic]['Question']
        max_range = len(topic_questions)
        print(f'{n} RANDOM SENTENCES FOR TOPIC: {topic}')
        
        for i in range(n):
            rand_num = randint(0, max_range)
            print(topic_questions.iloc[rand_num])
            
        print('\n\n')

## min_df=0.01, n_components=10

In [29]:
tf_idf2 = TfidfVectorizer(min_df = 0.01, stop_words = 'english')

In [30]:
vectorized2 = tf_idf2.fit_transform(data)

In [31]:
vectorized2

<404289x30 sparse matrix of type '<class 'numpy.float64'>'
	with 258260 stored elements in Compressed Sparse Row format>

In [33]:
nmf2, nmf2_components = fit_nmf(vectorized2, 10)

The topics generated this round aren't very clear; they appear to share similar words such as indian, trump, world... Because the same words in appear in a lot of the topics, it's difficult to classify a theme. Let's investigate this more...

In [34]:
print_top15(tf_idf2, nmf2_components)

THE TOP 15 WORDS FOR TOPIC #0:
['work', 'want', 'know', 'year', 'use', 'new', 'indian', 'start', 'world', 'english', 'time', 'online', 'learn', 'way', 'best']


THE TOP 15 WORDS FOR TOPIC #1:
['world', 'want', 'year', 'make', 'indian', 'know', 'think', 'money', 'trump', 'use', 'time', 'new', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2:
['use', 'time', 'year', 'work', 'want', 'better', 'world', 'new', 'online', 'start', 'indian', 'trump', 'money', 'think', 'india']


THE TOP 15 WORDS FOR TOPIC #3:
['think', 'new', 'use', 'trump', 'want', 'online', 'year', 'indian', 'start', 'english', 'way', 'work', 'learn', 'time', 'good']


THE TOP 15 WORDS FOR TOPIC #4:
['better', 'trump', 'think', 'learn', 'new', 'want', 'english', 'use', 'year', 'know', 'time', 'indian', 'world', 'work', 'like']


THE TOP 15 WORDS FOR TOPIC #5:
['english', 'money', 'year', 'learn', 'better', 'indian', 'time', 'want', 'trump', 'new', 'use', 'world', 'think', 'know', 'people']


THE TOP 15 WORDS FOR TOPIC

## max_df=0.0005, n_components=20

One thing I noticed above is that although setting min_df to 0.01 seems like a low threshold (any words that appear in less than 1% of the total corpus are thrown out), the dimension of the spare matrix when fitting for TF-IDF is 404289x30. The second dimension indicates that only 30 words remain after setting min_df to 0.01, meaning that because this corpus is so large and questions on Quora vary by a lot, the questions are unique in the words that make them up, which might make it hard to capture any themes. But let's still see where we can get.

Setting max_df at 0.0005 (discard words that appear in more than 0.05% of the corpus), the number of words remaining that are vectorized is 65,778, which again means that there are a lot of unique words across these Quora questions; there isn't a single word that shows up in more than 0.05% of the documents.

In [36]:
tf_idf3 = TfidfVectorizer(max_df = 0.0005, stop_words = 'english')
vectorized3 = tf_idf3.fit_transform(data)
vectorized3

<404289x65778 sparse matrix of type '<class 'numpy.float64'>'
	with 673714 stored elements in Compressed Sparse Row format>

Knowing how sparse the questions are on Quora, it would make sense to increase the number of components in the NMF model so that it can hopefully identify more themes across these varying questions. Below, I set n_components to 20.

In [40]:
nmf3, nmf3_components = fit_nmf(vectorized3, 20)

By simply identifying that these questions vary a lot in terms of the words that are used has pointed me in a direction in what values might be good to use for vectorizing the text and fitting the model. Not really related, but I believe iteration is the rule not the exception when it comes to NLP and the results you get earlier in the process will help fine tune any sort of preprocessing or decisions made for future iterations. 

Below isn't perfect, but even with minimal text preprocessing and some adjustments to the hyperparameters, I think we can identify a couple of topics prevalent in Quora. Some of the topics below contain words that appear to cluster well with each other. 

For example, Topic 1 appears to be about people asking about suicide and instant and painless death. Questions that fall under Topic 11 might be about how to make good habits to be productive or inspirational quotes. 

In [41]:
print_top15(tf_idf3, nmf3_components)

THE TOP 15 WORDS FOR TOPIC #0:
['processor', 'cellphone', '17k', 'costing', 'earns', 'commute', '20k', 'coolest', 'august', 'lakh', 'shoot', '10k', 'bicycle', 'smartphones', '15k']


THE TOP 15 WORDS FOR TOPIC #1:
['pubic', 'killing', 'hanging', 'efficient', 'oneself', 'fairly', 'peaceful', 'dying', 'instantly', 'painful', 'neurological', 'disabling', 'untreatable', 'quickest', 'painless']


THE TOP 15 WORDS FOR TOPIC #2:
['ejaculate', 'reduces', 'ill', 'virginity', 'procrastination', 'stamina', 'sperm', 'sin', 'harm', 'teenager', 'habit', 'pimples', 'caught', 'forever', 'masturbating']


THE TOP 15 WORDS FOR TOPIC #3:
['finish', 'achieved', 'drugs', 'officers', 'forever', 'excited', 'distracted', 'vegetarian', 'tier', 'struggling', 'motivates', 'keeps', 'minute', 'goals', 'motivated']


THE TOP 15 WORDS FOR TOPIC #4:
['direction', 'destroyed', 'kinetic', 'appears', 'slower', 'transforms', 'scientists', 'potentially', 'expand', 'forever', 'conserved', 'potentiality', 'expansion', 'expa

Assign each question to a topic number. I then try to do my best to label each topic listed in the topic_dict. There are some that are hard to put a category to so for those clusters, I label them as unclassified. Once I created these topics, I'm going to map the topic numbers to their respective topics.

In [42]:
topic_results = nmf3.transform(vectorized3)
df['Topic_Num'] = topic_results.argmax(axis=1)
df.head()

Unnamed: 0,Question,Topic_Num
0,What is the step by step guide to invest in sh...,15
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,15
2,How can I increase the speed of my internet co...,2
3,Why am I mentally very lonely? How can I solve...,7
4,"Which one dissolve in water quikly sugar, salt...",10


In [43]:
topic_dict = {0: 'technology', 1: 'suicide', 2: 'puberty', 3: 'unclassified', 4: 'science', 5: 'boredom',
              6: 'communication', 7: 'productivity', 8: 'documentary_genres', 9: 'unclassified2', 10: 'login_help', 
              11: 'good_habits', 12: 'unclassified3', 13: 'jokes', 14: 'unclassified4', 15: 'entrepreneurship',
             16: 'unclassified5', 17: 'music', 18: 'language', 19: 'sex'}

In [44]:
df['Topic'] = df['Topic_Num'].map(topic_dict)
df.head()

Unnamed: 0,Question,Topic_Num,Topic
0,What is the step by step guide to invest in sh...,15,entrepreneurship
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,15,entrepreneurship
2,How can I increase the speed of my internet co...,2,puberty
3,Why am I mentally very lonely? How can I solve...,7,productivity
4,"Which one dissolve in water quikly sugar, salt...",10,login_help


Number of questions per topic:

In [46]:
df['Topic'].value_counts()

technology            110680
entrepreneurship       51990
unclassified5          34789
login_help             34700
unclassified4          29990
music                  17056
language               15617
documentary_genres     13359
communication          12111
sex                    11917
science                10724
good_habits            10700
jokes                   9662
puberty                 9180
unclassified3           8939
unclassified            8032
productivity            5063
boredom                 4678
suicide                 2737
unclassified2           2365
Name: Topic, dtype: int64

Below are 10 randomly picked questions per topic. Although the model was able to distinguish and choose words that appear associated with each other, it appears most of the questions below still don't fall under the certain categories they were placed into. I believe this is because there are still too many varied questions in the dataset while not enough components and therefore appropriate themes to place the questions under. 

From here, I plan to explore more with: additional text preprocessing, increasing the number of components, and/or working with a smaller subset of the dataset to see if I can still find underlying themes but something where a majority of the questions fall under an appropriate topic.

In [57]:
print_topic_sents(df, topic_dict, n=10)

10 RANDOM SENTENCES FOR TOPIC: technology
What does the world not know about India and why?
What should be my resolution for 2017?
What is architecture all about?
Why can't I be myself?
What is the best brand of stethoscope for a medical student?
How soon can humans move to and live on Mars?
What are some of the best whatsapp status?
If you were president, what is the first thing you would do?
What is the way to make a girl fall in love with you?
How can you describe Taos material?



10 RANDOM SENTENCES FOR TOPIC: suicide
What is the quickest way to get meth out your system?
How do you stop a sunburn from peeling?
Where can I find a surf (Speeded Up Robust Features) MATLAB Code for Keypoint detector and keypoint descriptor?
What is the best way to stop talking to oneself?
What are some things new employees should know going into their first day at PPL?
If you could get now instantly and completely guaranteed right and complete answer to ANY of your questions, which one question would 