# Topic Modeling

The goal of this project is to **assign over 400 000 quora questions to different categories**, or topics.

For that, we'll be using two different methods:
* **Latent Dirichlet Allocation (LDA)**
* **Non-Negative Matrix Factorization (NMF)**

#### 1. Perform initial imports

In [1]:
import pandas as pd

#### 2. Load data

In [2]:
quora = pd.read_csv("data/quora_questions.csv")

#### 3. Check the dataframe

In [3]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


#### 4. Check missing values

In [4]:
quora.isnull().sum()

Question    0
dtype: int64

There are no missing questions.

#### 5. Check empty strings

In [5]:
# using the isspace() method

empty_strings = []

for i, q in quora.itertuples():
    if q.isspace():
        empty_strings.append(i)

In [6]:
print(empty_strings)
print(len(empty_strings))

[]
0


There are no questions that correspond to empty strings.

In [7]:
# check length

len(quora)

404289

We have 404 289 quora questions. Our dataset is cleaned and we can now perform topic modeling with LDA and NMF.

## LDA

#### 7. Create a vectorized document term matrix with `CountVectorizer`

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [10]:
dtm = cv.fit_transform(quora['Question'])

In [11]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.int64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

We now have a sparse matrix for the 404 289 questions for a total of 38 669 different words. Let's try to  group these questions into 20 different topics with the LDA method.

#### 8. Create an instance of LDA with 20 expected components and fit it

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

In [13]:
LDA = LatentDirichletAllocation(n_components=20,random_state=42)

In [14]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=20, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

#### 9. Check how we can have access to words and topics 

In [15]:
# words

print(len(cv.get_feature_names()))

38669


In [16]:
import random

for i in range(10):
    random_word_id = random.randint(0,38668)
    print(cv.get_feature_names()[random_word_id])

schooled
backup
shitty
shepherds
azealia
curly
intervening
arsenal
15days
darkseid


Like we have seen before, we have a total of 38 669 different words. We've print out 10 random words of those 38 669 words.

In [17]:
# topics

len(LDA.components_)

20

In [18]:
len(LDA.components_[0])

38669

As expected, we have 20 different topics. And for each topic, we have a certain combination of our total of 38 669 words.

In [19]:
# top 10 words for topic #0

top10_word_indices = LDA.components_[0].argsort()[-10:]

for index in top10_word_indices:
    print(cv.get_feature_names()[index])

media
google
good
company
india
social
career
history
service
best


These are the top 10 words for topic #0.

#### 10. Print out the top 15 words for each of the 20 topics

In [20]:
for index,topic in enumerate(LDA.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

TOP 15 WORDS FOR TOPIC #0
['sydney', 'development', 'code', 'open', 'services', 'media', 'google', 'good', 'company', 'india', 'social', 'career', 'history', 'service', 'best']


TOP 15 WORDS FOR TOPIC #1
['economy', 'process', 'making', 'government', 'rupee', 'india', 'word', 'money', 'rs', 'english', 'black', 'indian', '1000', 'notes', '500']


TOP 15 WORDS FOR TOPIC #2
['current', 'ones', 'alcohol', 'center', 'legal', 'home', 'state', 'compare', 'man', 'purpose', 'good', 'india', 'cost', 'average', 'does']


TOP 15 WORDS FOR TOPIC #3
['answers', 'year', 'facts', 'apple', 'mind', 'series', 'looking', 'interesting', 'worth', 'big', 'exist', 'tv', 'does', 'iphone', 'new']


TOP 15 WORDS FOR TOPIC #4
['australia', 'overcome', 'usa', 'students', 'student', 'visa', 'canada', 'mba', 'apply', 'jobs', 'college', 'differences', 'india', 'car', 'job']


TOP 15 WORDS FOR TOPIC #5
['different', 'russia', 'win', 'relationship', 'culture', 'countries', 'pakistan', 'china', 'like', 'math', 'india',

#### 11. Create a dataframe with each question and the correspoding topic

In [21]:
topic_results = LDA.transform(dtm)

In [22]:
topic_results.shape

(404289, 20)

In [23]:
# for question #2

topic_results[2]

array([0.00714286, 0.00714286, 0.00714286, 0.00714286, 0.00714286,
       0.00714286, 0.00714286, 0.00714286, 0.72140237, 0.1500262 ,
       0.00714286, 0.00714286, 0.00714286, 0.00714286, 0.00714286,
       0.00714286, 0.00714286, 0.00714286, 0.00714286, 0.00714286])

In [24]:
topic_results[2].round(2)

array([0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.72, 0.15, 0.01,
       0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])

In [25]:
topic_results[2].argmax()

8

According to this, question #2 belongs to topic #8.

In [26]:
# for all questions

quora_lda = quora.copy()
quora_lda['Topic'] = topic_results.argmax(axis=1)

In [27]:
quora_lda.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,16
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,17
2,How can I increase the speed of my internet co...,8
3,Why am I mentally very lonely? How can I solve...,19
4,"Which one dissolve in water quikly sugar, salt...",17


We've managed to assign each question to one of the 20 topics. Let's do the same thing with the NMF method.

## NMF

#### 12. Create a vectorized document term matrix with `TfidfVectorizer`

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [30]:
dtm = tfidf.fit_transform(quora['Question'])

In [31]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

We have now used `TfidfVectorizer` instead of `CountVectorizer` to create our document term matrix. Let's try to group these questions into 20 different topics with the NMF method.

#### 13. Create an instance of NMF with 20 expected components and fit it

In [32]:
from sklearn.decomposition import NMF

In [33]:
nmf_model = NMF(n_components=20, random_state=42)

In [34]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

#### 14. Print out the top 15 words for each of the 20 topics

In [35]:
for index,topic in enumerate(nmf_model.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics', 'available',

#### 15. Create a dataframe with each question and the correspoding topic

In [36]:
topic_results = nmf_model.transform(dtm)

In [37]:
quora_nmf = quora.copy()

In [38]:
quora_nmf['Topic'] = topic_results.argmax(axis=1)

In [39]:
quora_nmf.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
