# **Text Classification using LDA**

We do this task by taking a topic modeling approach. topic modeling offers approach to organize, scan and synthesize large datasets. In this notebook we build the classifier that uses word counts as a feature to  decodes the similarities between the word counts. the algorithm we use is latent Dirichlet Allocation. LDA is a statistical model which aim is to discover topics that belong to the document.

In [351]:
#we load and store our dataset using pandas which is consisting of 200000 rows that means this data contains 200000 questions
import os
import pandas as pd


dataframe = pd.read_csv('quora_questions.csv', nrows=200000)
dataframe.columns = ["questions"]
print('We have',len(dataframe), 'questions in the data')




We have 200000 questions in the data


In [352]:
a = 100
for i in range(a,a+10):
    print(dataframe.questions[i])
    print()

Will there really be any war between India and Pakistan over the Uri attack? What will be its effects?

Did Ronald Reagan have a mannerism in his speech?

What were the war strategies of the Union and the Confederates during the Civil War?

Which is the best fiction novel of 2016?

Can I recover my email if I forgot the password?

Will the recent demonetisation results in higher GDP? If so how much?

Have you ever heard of travel hacking?

What's the difference between love and pity?

How competitive is the hiring process at Republic Bank?

How Google helps in spam ranking adjustment of the search results?



In [353]:
dataframe.head()

Unnamed: 0,questions
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


we use sklearn library named countvectorizer which is used to eliminate the common words in our document so that we cannot get words with high number of frequency. this step is the part of preprocessing of the data.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

For creating a vectorizer we need two parameters max_df and min_df that shows the numbers of words we ignore in document. purpose of min-df is to ignore the words that have very few occurences in the document. we use min value 2 which means that it eliminate words that appeared in less than 2 documents. as like min-df in max-df we ignore words that are too common in the document. we use 0.95 that means we ignored words that appeared 95% in the document. we also remove stop words by using stop_words

In [0]:
count_vectorizer = CountVectorizer(min_df=2, stop_words="english", max_df=0.95)

we use fit_transform to calculate the parameters and trasform data to create sparse matrix

In [0]:
doc_term_matrix = count_vectorizer.fit_transform(dataframe['questions'])

In [357]:
doc_term_matrix #contains 200000 articles and 27884 words

<200000x27884 sparse matrix of type '<class 'numpy.int64'>'
	with 981746 stored elements in Compressed Sparse Row format>

In [0]:
from sklearn.decomposition import LatentDirichletAllocation #we import LDA model from sklearn library

we build a LDA MODEL by using Latent class to create topics along with probability distribution. in this n_components parameter divided our text into number of categories that we want. random_state used to inititialize random number generator. in start we use 10 to see

In [0]:
lda = LatentDirichletAllocation(n_components=10,random_state=1)

In [360]:
lda.fit(doc_term_matrix)#for learning the projection matrix

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=1, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

we derive likelihood and perplexity in order to estimate the performance of our model. likelihood calculate the probability of observed data and perplexity used for quality of the model that how well model predicting. these can be measured by the criteria that likelihood is better when higher the value and perplexity is better when its value is low.

In [361]:
print("Log Likelihood: ", lda.score(doc_term_matrix))

Log Likelihood:  -8509084.295125658


In [362]:
print("Perplexity: ", lda.perplexity(doc_term_matrix))

Perplexity:  4577.806873046729


In [0]:
#

**World Vocabulary**

In [364]:
len(count_vectorizer.get_feature_names())#we use len to see the length of word list that stored in count_vectorizer

27884

In [365]:
count_vectorizer.get_feature_names()[610]

'484'

In [366]:
lda.components_

array([[2.15054241e+01, 1.76760456e-01, 1.00000001e-01, ...,
        1.00000000e-01, 3.09998705e+00, 1.00000000e-01],
       [2.54817362e+00, 3.69781128e+02, 1.00118556e-01, ...,
        1.00000000e-01, 1.00012946e-01, 1.00000000e-01],
       [1.00044631e-01, 1.23420584e+01, 1.00000001e-01, ...,
        1.00000000e-01, 1.00000000e-01, 1.00000000e-01],
       ...,
       [1.00004729e-01, 1.00003751e-01, 1.00000001e-01, ...,
        1.00000000e-01, 1.00000000e-01, 1.00000000e-01],
       [2.46309533e-01, 1.00004442e-01, 1.00000001e-01, ...,
        1.00000000e-01, 1.00000000e-01, 1.00000801e-01],
       [1.00001356e-01, 1.00021371e-01, 1.00000001e-01, ...,
        2.10000000e+00, 1.00000000e-01, 5.09999888e+00]])

In [367]:
lda.components_.shape# this lda component contains the probabilty of each word we can see it by using index

(10, 27884)

In [0]:
first_topic = lda.components_[0]

In [369]:
first_topic.argsort()#to see the all probabilities of words

array([  910, 26915, 10865, ..., 18522, 14630, 14032])

In [370]:
#we created a loop to show the top numbers of words of eaech topic
word_list = []
probability_list =[]

top_number = 20
topic_count = 0

for probability_number in lda.components_:
    text_message = f'Top words for topic {topic_count} are : '
    print(text_message)
    
    for number in probability_number.argsort()[-top_number:]:
        print([count_vectorizer.get_feature_names()[number]], end='')
        
        probability_list.append(number)
    print('\n')
    topic_count+=1

Top words for topic 0 are : 
['love']['guy']['stop']['earth']['going']['difference']['read']['new']['year']['don']['old']['day']['best']['books']['things']['girl']['does']['people']['like']['know']

Top words for topic 1 are : 
['laptop']['download']['tv']['favorite']['did']['watch']['mobile']['free']['energy']['buy']['iphone']['does']['app']['android']['movies']['new']['movie']['phone']['make']['best']

Top words for topic 2 are : 
['good']['writing']['skills']['think']['salary']['black']['modi']['rupee']['money']['government']['2016']['english']['rs']['prepare']['improve']['1000']['500']['notes']['indian']['india']

Top words for topic 3 are : 
['data']['earn']['learning']['company']['computer']['programming']['like']['business']['way']['india']['language']['make']['start']['engineering']['job']['good']['online']['money']['learn']['best']

Top words for topic 4 are : 
['suicide']['snapchat']['thing']['like']['sentence']['real']['die']['people']['purpose']['bad']['does']['china']['wri

In [371]:
#by using for loop we see the 20 words of each topic we can increase it by increasing the value of top_number
top_number = 20
count = 0
for probability_number in lda.components_:
    print(f"Top words for topic {count} are : ")    
    for number in probability_number.argsort()[-top_number:]:
        print([count_vectorizer.get_feature_names()[number]], end= "")
    print("\n")
    count += 1

Top words for topic 0 are : 
['love']['guy']['stop']['earth']['going']['difference']['read']['new']['year']['don']['old']['day']['best']['books']['things']['girl']['does']['people']['like']['know']

Top words for topic 1 are : 
['laptop']['download']['tv']['favorite']['did']['watch']['mobile']['free']['energy']['buy']['iphone']['does']['app']['android']['movies']['new']['movie']['phone']['make']['best']

Top words for topic 2 are : 
['good']['writing']['skills']['think']['salary']['black']['modi']['rupee']['money']['government']['2016']['english']['rs']['prepare']['improve']['1000']['500']['notes']['indian']['india']

Top words for topic 3 are : 
['data']['earn']['learning']['company']['computer']['programming']['like']['business']['way']['india']['language']['make']['start']['engineering']['job']['good']['online']['money']['learn']['best']

Top words for topic 4 are : 
['suicide']['snapchat']['thing']['like']['sentence']['real']['die']['people']['purpose']['bad']['does']['china']['wri

as we can see from above records that topic 7 is realted to Politics, Law, Government, and Judiciary. now we add the relavent topic number to the dataframe

In [0]:
textfile_topics = lda.transform(doc_term_matrix)

In [373]:
textfile_topics[0].round(2)

array([0.01, 0.01, 0.89, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])

In [374]:
textfile_topics[0].argmax()

2

In [0]:
examined_topic = lda.components_[2]

In [376]:
# Show more words for better topic selection for this topic (2)
for index in examined_topic.argsort()[-50:]:
    print(count_vectorizer.get_feature_names()[index], end=" ")

real prime narendra does stock rupees affect pro indians gate invest foreign marks cat jee iit decision help 2017 note banning ban market exam currency 2000 score new difference economy good writing skills think salary black modi rupee money government 2016 english rs prepare improve 1000 500 notes indian india 

In [0]:
topic_list = []
# Textfile_topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in textfile_topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

# Add a new column to the dataframe
dataframe["Topic number"] = topic_list

In [378]:
dataframe



Unnamed: 0,questions,Topic number
0,What is the step by step guide to invest in sh...,2
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0
2,How can I increase the speed of my internet co...,5
3,Why am I mentally very lonely? How can I solve...,7
4,"Which one dissolve in water quikly sugar, salt...",1
...,...,...
199995,Why was the Battle of Vimy Ridge so important?,1
199996,Which of these TV shows should I watch next?,1
199997,Should I change my name?,6
199998,Should I buy the new MacBook 2016 or one from ...,2


In [0]:
topic_list = {0: "Art, Design, and Style", 
              1: "Humanities", 
              2: "Life, Relationships, and Self", 
              3: "Business, Work, and Careers", 
              4: "Recreation, Sports, Travel, and Activities", 
              5: "Science, Technology, Engineering, and Mathematics", 
              6: "horoscopes", 
              7: "Politics, Law, Government, and Judiciary", 
              8: "Literature, Languages, and Communication", 
              9: "Medicine and Healthcare", 
            }



In [0]:
topic_no_to_topic = dataframe["Topic number"].map(topic_list)

In [0]:

dataframe["Topic desc"] = topic_no_to_topic

In [382]:
dataframe.head(10)

Unnamed: 0,questions,Topic number,Topic desc
0,What is the step by step guide to invest in sh...,2,"Life, Relationships, and Self"
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0,"Art, Design, and Style"
2,How can I increase the speed of my internet co...,5,"Science, Technology, Engineering, and Mathematics"
3,Why am I mentally very lonely? How can I solve...,7,"Politics, Law, Government, and Judiciary"
4,"Which one dissolve in water quikly sugar, salt...",1,Humanities
5,Astrology: I am a Capricorn Sun Cap moon and c...,5,"Science, Technology, Engineering, and Mathematics"
6,Should I buy tiago?,1,Humanities
7,How can I be a good geologist?,3,"Business, Work, and Careers"
8,When do you use ã‚· instead of ã—?,7,"Politics, Law, Government, and Judiciary"
9,Motorola (company): Can I hack my Charter Moto...,3,"Business, Work, and Careers"
