# Topic Modeling

The goal of this project is to **assign over 400 000 quora questions to different categories**, or topics.

For that, we'll be using three different methods:
* **Latent Dirichlet Allocation (LDA)**
* **Latent Semantic Analysis (LSA)**
* **Non-Negative Matrix Factorization (NMF)**

#### 1. Perform initial imports

In [1]:
import pandas as pd

#### 2. Load data

In [2]:
quora = pd.read_csv("data/quora_questions.csv")

#### 3. Check the dataframe

In [3]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


#### 4. Check missing values

In [4]:
quora.isnull().sum()

Question    0
dtype: int64

There are no missing questions.

#### 5. Check empty strings

In [5]:
# using the isspace() method

empty_strings = []

for i, q in quora.itertuples():
    if q.isspace():
        empty_strings.append(i)

In [6]:
print(empty_strings)
print(len(empty_strings))

[]
0


There are no questions that correspond to empty strings.

In [7]:
# check length

len(quora)

404289

We have 404 289 quora questions. Our dataset is cleaned and we can now perform topic modeling with LDA, LSA and NMF.

## LDA

In Latent Dirichlet Allocation, topics are represented as a distribution of words, i. e., the probability that each of a given set of terms will occur. Documents are in turn represented as a mixture (linear combination) of these topics. The probability for each of these topics within a document, as well as the probability of a word being assigned to a topic, is assumed to start with a Dirichlet probability distribution.

### The gensim way

#### 7. Preprocess text

In [8]:
# imports

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer



In [9]:
# define stemmer and lemmatizer

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [10]:
# define function to tokenize, lemmatize and stem the text

def preprocess(text):
    result=[]
    for token in simple_preprocess(text):
        if token not in STOPWORDS: # remove stopwords
            token_lemmatized = lemmatizer.lemmatize(token, pos='v')
            token_stemmed = stemmer.stem(token_lemmatized)
            result.append(token_stemmed)
    return result

In [11]:
# preprocess questions

quora['Question_gensim'] = quora['Question'].apply(preprocess)

In [12]:
quora.head()

Unnamed: 0,Question,Question_gensim
0,What is the step by step guide to invest in sh...,"[step, step, guid, invest, share, market, india]"
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,"[stori, kohinoor, koh, noor, diamond]"
2,How can I increase the speed of my internet co...,"[increas, speed, internet, connect, vpn]"
3,Why am I mentally very lonely? How can I solve...,"[mental, lone, solv]"
4,"Which one dissolve in water quikly sugar, salt...","[dissolv, water, quik, sugar, salt, methan, ca..."


We have created a new column `Question_gensim` in our dataframe where each question is now a list of tokens ready to be further processed with the help of gensim.

#### 8. Create a bag of words

In [13]:
from gensim.corpora import Dictionary

In [14]:
dct = Dictionary(quora['Question_gensim'])

In [15]:
len(dct)

46376

In [16]:
example_count = 0

for i, word in dct.items():
    print(i, word)
    example_count += 1
    if example_count > 10:
        break

0 guid
1 india
2 invest
3 market
4 share
5 step
6 diamond
7 koh
8 kohinoor
9 noor
10 stori


In [17]:
print(dct)

Dictionary(46376 unique tokens: ['guid', 'india', 'invest', 'market', 'share']...)


Our dictionary contains **46376 unique tokens (our vocabulary)**, and each token has a corresponding id.

In [18]:
# create a bag of words

bow = [dct.doc2bow(question) for question in quora['Question_gensim']]

In [19]:
# first question

quora['Question_gensim'][0]

['step', 'step', 'guid', 'invest', 'share', 'market', 'india']

In [20]:
# bag of words for the first question

bow[0]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)]

All the words with **id's from 0 to 4 appear once** in this question, and the word with **id=5 (step) appears twice**.

#### 9. Use `LDAMulticore` to create our LDA model

In [21]:
import multiprocessing

# number of cores
multiprocessing.cpu_count()

4

In [22]:
%%time

# workers = number of cores - 1
lda_model = gensim.models.LdaMulticore(bow, 
                                       num_topics=20, 
                                       id2word = dct, 
                                       passes = 2, 
                                       workers=3)

Wall time: 1min 33s


#### 10. Check how we can have access to words and topics

In [23]:
# most probable topic for a given document and most probable topic for each word
# example with first question

lda_model.get_document_topics(bow=bow[0], per_word_topics=True)

([(2, 0.21110655), (11, 0.67639345)],
 [(0, [11]), (1, [11, 2]), (2, [11]), (3, [11]), (4, [11]), (5, [2, 11])],
 [(0, [(11, 1.0)]),
  (1, [(2, 0.10889442), (11, 0.8911056)]),
  (2, [(11, 0.99999994)]),
  (3, [(11, 1.0000001)]),
  (4, [(11, 1.0)]),
  (5, [(2, 1.526031), (11, 0.47396907)])])

In [24]:
# most relevant topics for a given word

lda_model.get_term_topics('thing', minimum_probability=0)
# this docstring doesn't seem right... it's the word and not word_id

[(0, 0.0007628809),
 (10, 0.009788418),
 (16, 2.4625226e-07),
 (18, 7.290221e-06),
 (19, 0.048436616)]

In [25]:
# most relevant words (id) for a given topic

lda_model.get_topic_terms(topicid=1, topn=10)

[(116, 0.057453986),
 (316, 0.042467263),
 (671, 0.041471988),
 (155, 0.039485935),
 (153, 0.028141282),
 (1, 0.027203513),
 (516, 0.022662492),
 (857, 0.021236362),
 (291, 0.020539965),
 (737, 0.020228669)]

In [26]:
# most relevant words (string) for a given topic

lda_model.show_topic(topicid=1, topn=10)

[('money', 0.057453986),
 ('indian', 0.042467263),
 ('note', 0.041471988),
 ('year', 0.039485935),
 ('old', 0.028141282),
 ('india', 0.027203513),
 ('black', 0.022662492),
 ('bad', 0.021236362),
 ('help', 0.020539965),
 ('earn', 0.020228669)]

In [27]:
# lda_model.get_topics() gives us the probability for each word in each topic

lda_model.get_topics().shape

(20, 46376)

#### 11. Print out the top 15 words for each of the 20 topics

In [28]:
for topic, top_words in lda_model.print_topics(num_topics=20, num_words=15):
    print(f'TOP 15 WORDS FOR TOPIC #{topic+1}') #topic+1 to start with Topic #1 instead of #0
    print(top_words)
    print("\n")

TOP 15 WORDS FOR TOPIC #1
0.084*"time" + 0.045*"love" + 0.032*"friend" + 0.022*"travel" + 0.018*"card" + 0.016*"girl" + 0.015*"stori" + 0.015*"know" + 0.014*"follow" + 0.013*"girlfriend" + 0.012*"want" + 0.012*"benefit" + 0.012*"polit" + 0.012*"talk" + 0.011*"peopl"


TOP 15 WORDS FOR TOPIC #2
0.057*"money" + 0.042*"indian" + 0.041*"note" + 0.039*"year" + 0.028*"old" + 0.027*"india" + 0.023*"black" + 0.021*"bad" + 0.021*"help" + 0.020*"earn" + 0.020*"rs" + 0.020*"govern" + 0.019*"rupe" + 0.019*"ban" + 0.014*"hair"


TOP 15 WORDS FOR TOPIC #3
0.087*"like" + 0.054*"work" + 0.038*"feel" + 0.023*"school" + 0.022*"stop" + 0.020*"look" + 0.020*"high" + 0.014*"law" + 0.014*"startup" + 0.012*"fall" + 0.011*"stay" + 0.011*"program" + 0.009*"text" + 0.008*"hire" + 0.008*"live"


TOP 15 WORDS FOR TOPIC #4
0.034*"develop" + 0.029*"video" + 0.021*"free" + 0.020*"safe" + 0.020*"download" + 0.019*"end" + 0.018*"web" + 0.017*"hotel" + 0.015*"polic" + 0.015*"period" + 0.014*"digit" + 0.011*"kill" + 0.0

This is one possible approach with **gensim**. Let's now see how we can do something similar with scikit-learn.

### The scikit-learn way

#### 12. Create a vectorized document-term matrix with `CountVectorizer`

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [10]:
dtm = cv.fit_transform(quora['Question'])

In [11]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.int64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

We now have a sparse matrix for the 404 289 questions for a total of 38 669 different words. Let's try to  group these questions into 20 different topics with the LDA method.

#### 13. Create an instance of LatentDirichletAllocation with 20 expected components and fit it

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

In [13]:
LDA = LatentDirichletAllocation(n_components=20,random_state=42)

In [14]:
%%time

LDA.fit(dtm)

Wall time: 8min 5s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

#### 14. Check how we can have access to words and topics 

In [15]:
# words

print(len(cv.get_feature_names()))

38669


In [16]:
import random

for i in range(10):
    random_word_id = random.randint(0,38668)
    print(cv.get_feature_names()[random_word_id])

spoil
smayan
close
stops
candidature
journeys
annie
stories
rabid
crunchbase


Like we have seen before, we have a total of 38 669 different words. We've print out 10 random words of those 38 669 words.

In [17]:
# topics

len(LDA.components_)

20

In [18]:
# words

len(LDA.components_[0])

38669

As expected, we have 20 different topics. And for each topic, we have a certain combination of our total of 38 669 words.

In [19]:
# top 10 words for topic #0 in descending order

top10_word_indices = LDA.components_[0].argsort()[-10:][::-1] #[::-1] to reverse order

for index in top10_word_indices:
    print(cv.get_feature_names()[index])

best
service
history
career
social
india
company
good
google
media


These are the top 10 words for topic #0.

#### 15. Print out the top 15 words for each of the 20 topics

In [20]:
for index,topic in enumerate(LDA.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{index+1}') #index+1 to start with Topic #1 instead of #0
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]][::-1])
    print('\n')

TOP 15 WORDS FOR TOPIC #1
['best', 'service', 'history', 'career', 'social', 'india', 'company', 'good', 'google', 'media', 'services', 'open', 'code', 'development', 'sydney']


TOP 15 WORDS FOR TOPIC #2
['500', 'notes', '1000', 'indian', 'black', 'english', 'rs', 'money', 'word', 'india', 'rupee', 'government', 'making', 'process', 'economy']


TOP 15 WORDS FOR TOPIC #3
['does', 'average', 'cost', 'india', 'good', 'purpose', 'man', 'compare', 'state', 'home', 'legal', 'center', 'alcohol', 'ones', 'current']


TOP 15 WORDS FOR TOPIC #4
['new', 'iphone', 'does', 'tv', 'exist', 'big', 'worth', 'interesting', 'looking', 'series', 'mind', 'apple', 'facts', 'year', 'answers']


TOP 15 WORDS FOR TOPIC #5
['job', 'car', 'india', 'differences', 'college', 'jobs', 'apply', 'mba', 'canada', 'visa', 'student', 'students', 'usa', 'overcome', 'australia']


TOP 15 WORDS FOR TOPIC #6
['world', 'long', 'does', 'war', 'india', 'math', 'like', 'china', 'pakistan', 'countries', 'culture', 'relationship

#### 16. Visualize topics and most relevant terms per topic

In [22]:
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore")

In [23]:
%%time

pyLDAvis.sklearn.prepare(LDA, dtm, cv, sort_topics=False)
# sort_topics = False to keep the original topic order

Wall time: 18min 39s


With $\lambda=1$, the terms are ranked by their probabilities within each topic (as we had in our top 15 words for each of the 20 topics). With $\lambda=0$, the terms are ranked only by their lift, the ratio between a term's probability within a topic and its margin probability across the corpus. For other values of $\lambda$, the terms are ranked by their relevance, a combination of those two values.

As we can see in our intertopic distance map, there is **some overlaping between different topics**. Let's try to avoid this overlaping by **reducing the total number of topics to 15**.

In [27]:
LDA = LatentDirichletAllocation(n_components=15,random_state=42)

In [28]:
%%time

LDA.fit(dtm)

Wall time: 9min 30s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=15, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [29]:
%%time

pyLDAvis.sklearn.prepare(LDA, dtm, cv, sort_topics=False)

Wall time: 19min 44s


#### 17. Create a dataframe with each question and the correspoding topic

In [30]:
topic_results = LDA.transform(dtm)

In [31]:
topic_results.shape

(404289, 15)

In [32]:
# for question #2

topic_results[2]

array([0.00952381, 0.00952381, 0.00952381, 0.00952381, 0.00952381,
       0.00952381, 0.00952381, 0.00952381, 0.72894152, 0.00952388,
       0.00952381, 0.00952381, 0.00952381, 0.00952381, 0.14724888])

In [33]:
topic_results[2].round(2)

array([0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.73, 0.01, 0.01,
       0.01, 0.01, 0.01, 0.15])

In [34]:
topic_results[2].argmax()

8

According to this, question #2 belongs to topic #8 (if we start counting from zero).

In [35]:
# for all questions

quora_lda = quora.copy()
quora_lda['Topic'] = topic_results.argmax(axis=1)+1 # +1 so that we have topics 1 to 15

In [38]:
quora_lda.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,8
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,5
2,How can I increase the speed of my internet co...,9
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",3


We've managed to assign each question to one of the 15 topics. Let's do the same thing with the LSA method.

## LSA

In Latent Semantic Analysis, we can break down our TF-IDF document-term matrix into **three simpler matrices** using singular value decomposition (SVD). We can then truncate those matrices (remove some rows and columns), which reduces the number of dimensions we have to deal with in our vector space model - this is called **truncated singular value decomposition**.

If we multiply these new truncated matrices, we don't get the exact same TF-IDF matrix, but a new representation of the documents that contains the esence or "latent semantics" of those documents.

#### 18. Create a vectorized document-term matrix with `TfidfVectorizer`

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [42]:
dtm = tfidf.fit_transform(quora['Question'])

In [43]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

We have now used `TfidfVectorizer` instead of `CountVectorizer` to create our document-term matrix.

#### 19. Create an instance of TruncatedSVD with 15 expected components and fit it

In [44]:
from sklearn.decomposition import TruncatedSVD

In [45]:
lsa_model = TruncatedSVD(n_components=15, random_state=42)

In [46]:
%%time

lsa_model.fit(dtm)

Wall time: 3.59 s


TruncatedSVD(algorithm='randomized', n_components=15, n_iter=5, random_state=42,
             tol=0.0)

#### 20. Print out the top 15 words for each of the 15 topics

In [51]:
for index,topic in enumerate(lsa_model.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{index+1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]][::-1])
    print('\n')

TOP 15 WORDS FOR TOPIC #1
['best', 'way', 'learn', 'india', 'money', 'make', 'does', 'online', 'quora', 'life', 'books', 'ways', 'book', 'programming', 'language']


TOP 15 WORDS FOR TOPIC #2
['does', 'quora', 'money', 'make', 'people', 'like', 'mean', 'questions', 'feel', 'life', 'work', 'online', 'question', 'earn', 'ask']


TOP 15 WORDS FOR TOPIC #3
['quora', 'questions', 'people', 'question', 'ask', 'money', 'answers', 'answer', 'google', 'make', 'asked', 'online', 'easily', 'earn', 'delete']


TOP 15 WORDS FOR TOPIC #4
['money', 'make', 'online', 'earn', '500', 'way', '1000', 'notes', 'ways', 'black', 'youtube', 'india', 'rupee', 'easy', 'rs']


TOP 15 WORDS FOR TOPIC #5
['life', 'people', 'india', 'trump', 'know', 'donald', 'think', 'good', 'like', 'did', 'purpose', 'things', 'love', 'thing', 'important']


TOP 15 WORDS FOR TOPIC #6
['india', 'trump', 'donald', 'people', 'president', 'clinton', 'hillary', 'think', 'win', '500', 'notes', '1000', 'did', 'election', 'pakistan']


TO

#### 21. Create a dataframe with each question and the correspoding topic

In [53]:
topic_results = lsa_model.transform(dtm)

In [54]:
quora_lsa = quora.copy()

In [55]:
quora_lsa['Topic'] = topic_results.argmax(axis=1)+1

In [56]:
quora_lsa.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,6
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,14
2,How can I increase the speed of my internet co...,2
3,Why am I mentally very lonely? How can I solve...,2
4,"Which one dissolve in water quikly sugar, salt...",2


It's now time do the same thing with the NMF method.

## NMF

With Non-Negative Matrix Factorization, our TF-IDF document-term matrix is now decomposed into **two factors** whose product approximates the original, in a way that every value in both factors is either positive or zero. These two matrices represent topics positively related to terms and documents of the corpus.

#### 22. Create an instance of NMF with 15 expected components and fit it

In [57]:
from sklearn.decomposition import NMF

In [58]:
nmf_model = NMF(n_components=15, random_state=42)

In [59]:
%%time

nmf_model.fit(dtm)

Wall time: 37 s


NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=15, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

#### 23. Print out the top 15 words for each of the 15 topics

In [60]:
for index,topic in enumerate(nmf_model.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{index+1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]][::-1])
    print('\n')

TOP 15 WORDS FOR TOPIC #1
['best', 'way', 'movies', 'book', 'books', '2016', 'movie', 'laptop', 'buy', 'ways', 'time', 'phone', 'places', 'visit', 'place']


TOP 15 WORDS FOR TOPIC #2
['does', 'mean', 'work', 'feel', 'long', 'cost', 'compare', 'really', 'time', 'exist', 'sex', 'use', 'looking', 'differ', 'recruit']


TOP 15 WORDS FOR TOPIC #3
['quora', 'questions', 'question', 'ask', 'answer', 'answers', 'google', 'asked', 'delete', 'improvement', 'easily', 'post', 'needing', 'answered', 'add']


TOP 15 WORDS FOR TOPIC #4
['money', 'make', 'online', 'earn', 'way', 'ways', 'youtube', 'easy', 'home', 'easiest', 'free', 'internet', 'black', 'friends', 'facebook']


TOP 15 WORDS FOR TOPIC #5
['life', 'purpose', 'meaning', 'thing', 'important', 'real', 'moment', 'change', 'want', 'live', 'day', 'changed', 'death', 'did', 'earth']


TOP 15 WORDS FOR TOPIC #6
['india', 'pakistan', 'war', 'world', 'start', 'spotify', 'country', 'job', 'business', 'available', 'olympics', 'china', 'engineering'

#### 24. Create a dataframe with each question and the correspoding topic

In [61]:
topic_results = nmf_model.transform(dtm)

In [62]:
quora_nmf = quora.copy()

In [63]:
quora_nmf['Topic'] = topic_results.argmax(axis=1)+1

In [64]:
quora_nmf.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,6
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,9
2,How can I increase the speed of my internet co...,4
3,Why am I mentally very lonely? How can I solve...,12
4,"Which one dissolve in water quikly sugar, salt...",15


We've managed to assign each question to its corresponding topic using **three different methods**. So, which is the best model? As is almost always the case, it depends. We have to consider the corpus we are dealing with and the goal we're trying to achieve.

Generally speaking, **LDA is better when you want easier to explain topic vectors**, but **LSA and NMF are usually much faster to compute**.