## Assignment Details
Name        : **Lavish Thomas** <br> 
Student ID  : **L00150445** <br>
Course      : MSc in Big Data Analytics and Artificial Intelligence <br>
Module      : Artificial Intelligence 2 <br> 
Assignment  : NLP CA 2 <br>
File used   : **quora_questions.csv**

--------------

### Libraries Used:
This sections explains varies libraries used in this project.

#### Numpy
NumPy is the fundamental package for scientific computing with Python for efficient multi-dimensional container of generic data.

In [1]:
import numpy as np

#### Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

##### Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [2]:
import pandas as pd

##### CountVectorizer 
CountVectorizer is used to split up the reviews into a list of words(a spare matrix with count of each word)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

##### Latent Dirichlet Allocation
The latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

##### Grid Search
Hyper-parameters are parameters that are not directly learnt within estimators. Grid Search is used to find optimal hypermeters for the models. 

In [5]:
from sklearn.model_selection import GridSearchCV

#### MultinomialNB
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

In [6]:
from sklearn.naive_bayes import MultinomialNB

#### Train and Test split
Split arrays or matrices into random train and test subsets.

In [7]:
from sklearn.model_selection import train_test_split

#### Metrics
Metrics module implements several loss, score, and utility functions to measure classification performance. 

In [8]:
from sklearn import metrics


--------------


### Pre-Processing steps

#### Reading the CSV file with questions
The files are read and loaded into a datframe using the pandas inbuild function

In [9]:
raw_dataframe = pd.read_csv("quora_questions.csv", encoding='utf-8')

#### Size of the data set

In [10]:
print ("The size of the dataset is " + str(len(raw_dataframe)));

The size of the dataset is 808578


#### Sample data

In [11]:
raw_dataframe.head(5)

Unnamed: 0,question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


#### Deleting the null rows

In [12]:
bool_series = pd.notnull(raw_dataframe["question"])
not_null_data_frame = raw_dataframe[bool_series]

#### Sampling the data set 
Randomly 200000 questions from the given dataset is selected.

In [13]:
data_frame = not_null_data_frame.sample(n = 200000, random_state = 100).sort_index()

In [14]:
### Cross checking the size.
len(data_frame)

200000

In [15]:
### Sample dataset
data_frame.head(5)

Unnamed: 0,question
12,What can make Physics easy to learn?
13,What was your first sexual experience like?
14,What are the laws to change your status from a...
18,Why are so many Quora users posting questions ...
20,Why do rockets look white?


In [16]:
# making sure there is no null lines
data_frame["question"].isnull().sum()

0

### Question 1

#### Aim:


#### Method:


#### Expected output:

--------------


**CountVectorizer** is used with options **max_df=0.9** which identify the highly repeated words as stop words and **min_df=4** which idetifies the scarly occuring words as stop words too. 

In [17]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
len(count_vectorizer.get_feature_names())

17151

In [18]:
count_vectorizer.get_feature_names()[0:20]

['00',
 '000',
 '001',
 '01',
 '02',
 '03',
 '04',
 '05',
 '07',
 '08',
 '09',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '1000rs',
 '1000s',
 '1000â',
 '100k']

In [19]:
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])

In [20]:
doc_term_matrix

<200000x17151 sparse matrix of type '<class 'numpy.int64'>'
	with 963944 stored elements in Compressed Sparse Row format>

In [21]:
# Defien the params that we want to use
search_params = {"n_components": [10,12,15,20], "learning_decay": [ .5, .7, .9]}

# Init the model
lda_comparison = LatentDirichletAllocation()

# Init Grid Search Class
lda_comparison = GridSearchCV(lda_comparison, param_grid=search_params)

# Run the grid search
lda_comparison.fit(doc_term_matrix)

GridSearchCV(cv=None, error_score=nan,
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=10,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1,
                                                 random_state=None,
                                                 topic_word_prior=None,
                                                 tota

In [22]:
# Best model which gives the highest score
best_lda_model = lda_comparison.best_estimator_      
# Metrics - log likelihood - higher score = better
print("Log likelihood : ", best_lda_model.score(doc_term_matrix))
# Perplexity - lower = better. 
# = exp(-1 * log likelihood per word)
print("Perplexity: ", best_lda_model.perplexity(doc_term_matrix))

Log likelihood :  -8118499.544938984
Perplexity:  3565.4918336214614


In [23]:
best_lda_model.components_.shape

(10, 17151)

In [24]:
lda_comparison.best_params_

{'learning_decay': 0.7, 'n_components': 10}

In [25]:
word_list = []
probability_list = []

top_number = 50
topic_count = 0

for probability_number in best_lda_model.components_:
    text_message = f"Top words for topic {topic_count} are : "
    print(text_message)
    for number in probability_number.argsort()[-top_number:]:
        print([count_vectorizer.get_feature_names()[number]], end="")
        probability_list.append(number)
    print("\n")
    topic_count += 1

Top words for topic 0 are : 
['effect']['rupees']['happen']['theory']['help']['banning']['ban']['cost']['value']['decision']['stock']['sentence']['power']['currency']['2000']['note']['china']['country']['economy']['gmail']['new']['modi']['countries']['market']['think']['did']['rupee']['pakistan']['password']['does']['meaning']['real']['money']['rs']['word']['iphone']['government']['water']['start']['math']['used']['business']['war']['black']['1000']['indian']['500']['notes']['world']['india']

Top words for topic 1 are : 
['access']['like']['happen']['answers']['types']['questions']['jio']['pay']['hate']['win']['email']['app']['sim']['delete']['search']['difference']['vote']['election']['internet']['effects']['think']['youtube']['hack']['online']['mobile']['com']['bank']['using']['india']['better']['different']['whatsapp']['number']['does']['earn']['google']['card']['facebook']['hillary']['quora']['phone']['clinton']['president']['money']['instagram']['use']['account']['donald']['peopl

In [29]:
topics = best_lda_model.transform(doc_term_matrix)

### Topic selection
Based on words, an arbitory topic name is given.

In [35]:
topic_desc_list = {0: "Education", 
              1: "Research", 
              2: "Law", 
              3: "Sport", 
              4: "Finance", 
              5: "Health", 
              6: "horoscopes", 
              7: "Environment", 
              8: "Economy", 
              9: "Various", 
              10: "Sport", 
              }



In [30]:
len(topics)

200000

Assigning the topic number with the highest probability

In [31]:
topic_list = []
# topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

In [33]:
# Add a new column to the dataframe
data_frame["Topic number"] = topic_list

In [36]:
topic_no_to_topic_desc = data_frame["Topic number"].map(topic_desc_list)

In [37]:
data_frame["Topic desc"] = topic_no_to_topic_desc

In [39]:
data_frame.to_csv(r'quora_with_topic.csv', index = False)

In [40]:
data_frame

Unnamed: 0,question,Topic number,Topic desc
12,What can make Physics easy to learn?,4,Finance
13,What was your first sexual experience like?,6,horoscopes
14,What are the laws to change your status from a...,2,Law
18,Why are so many Quora users posting questions ...,1,Research
20,Why do rockets look white?,6,horoscopes
...,...,...,...
808557,What Does It Feel Like to have antisocial pers...,8,Economy
808563,What is a utilities expense in accounting? How...,5,Health
808571,What will the CPU upgrade to the 2016 Apple Ma...,3,Sport
808574,Is it true that there is life after death?,6,horoscopes


## Question 2

#### Aim:


#### Method:


#### Expected output:

--------------


In [41]:
data_frame = pd.read_csv("quora_with_topic.csv", encoding='utf-8')
data_frame.head(5)


Unnamed: 0,question,Topic number,Topic desc
0,What can make Physics easy to learn?,4,Finance
1,What was your first sexual experience like?,6,horoscopes
2,What are the laws to change your status from a...,2,Law
3,Why are so many Quora users posting questions ...,1,Research
4,Why do rockets look white?,6,horoscopes


In [42]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
len(count_vectorizer.get_feature_names())


17151

In [43]:
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])


In [44]:
target_topic = data_frame['Topic number']

In [45]:
X_train, X_test, y_train, y_test = train_test_split(doc_term_matrix,target_topic, test_size = 0.3, random_state = 1)

### Classifier 
MultinomialNB classifier is created and used to fit/train the model using the training data

In [46]:
mnc_classifier = MultinomialNB()
mnc_classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Prediction
The Model is now equipped with data to predict a new review is positive or negative. A **positive review** is fed into the classifier to predict.

In [47]:
mnc_model_predictions = mnc_classifier.predict(X_test)

## Confusion Matrix
In order to evaluate our model for the movie review classifer we are going to use Confusion matrix

In [48]:
print(metrics.confusion_matrix(y_test, mnc_model_predictions))

[[4369  206  112  176  143   78  374  118   56   80]
 [  97 5203  104  185   88  133  227  106   66   75]
 [ 126  147 3906  199  160   74  284  223  110  106]
 [  60  142   74 5851  149  122  130  125   88   70]
 [  66   74  109  229 4575   62  318  125   87  106]
 [ 102  190   92  235  110 3991  441   68   83   80]
 [  96  177  102  102  140  104 7731   85   84  131]
 [ 132  131  142  135  154   79  232 4998   65   99]
 [  75   76  105  157  153   65  263  303 3469   78]
 [ 101  145  110  149  143   89  483  151   82 3599]]


In [49]:
print(metrics.classification_report(y_test, mnc_model_predictions))

              precision    recall  f1-score   support

           0       0.84      0.76      0.80      5712
           1       0.80      0.83      0.81      6284
           2       0.80      0.73      0.77      5335
           3       0.79      0.86      0.82      6811
           4       0.79      0.80      0.79      5751
           5       0.83      0.74      0.78      5392
           6       0.74      0.88      0.80      8752
           7       0.79      0.81      0.80      6167
           8       0.83      0.73      0.78      4744
           9       0.81      0.71      0.76      5052

    accuracy                           0.79     60000
   macro avg       0.80      0.79      0.79     60000
weighted avg       0.80      0.79      0.79     60000

