## Assignment Details
Name        : **Lavish Thomas** <br> 
Student ID  : **L00150445** <br>
Course      : MSc in Big Data Analytics and Artificial Intelligence <br>
Module      : Artificial Intelligence 2 <br> 
Assignment  : NLP CA 2 <br>
File used   : **quora_questions.csv**

--------------

### Libraries Used:
This sections explains varies libraries used in this project.

#### Numpy
NumPy is the fundamental package for scientific computing with Python for efficient multi-dimensional container of generic data.

In [1]:
import numpy as np

#### Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

##### Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [2]:
import pandas as pd

#### CountVectorizer 
CountVectorizer is used to split up the reviews into a list of words(a spare matrix with count of each word)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

#### Latent Dirichlet Allocation
The latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

#### Grid Search
Hyper-parameters are parameters that are not directly learnt within estimators. Grid Search is used to find optimal hypermeters for the models. 

In [5]:
from sklearn.model_selection import GridSearchCV

#### MultinomialNB
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

In [6]:
from sklearn.naive_bayes import MultinomialNB

#### Train and Test split
Split arrays or matrices into random train and test subsets.

In [7]:
from sklearn.model_selection import train_test_split

#### Metrics
Metrics module implements several loss, score, and utility functions to measure classification performance. 

In [8]:
from sklearn import metrics


--------------


### Pre-Processing steps

#### Reading the CSV file with questions
The files are read and loaded into a datframe using the pandas inbuild function

In [9]:
raw_dataframe = pd.read_csv("quora_questions.csv", encoding='utf-8')

#### Size of the data set

In [10]:
print ("The size of the dataset is " + str(len(raw_dataframe)));

The size of the dataset is 808578


#### Sample data

In [11]:
raw_dataframe.head(5)

Unnamed: 0,question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


#### Deleting the null rows

In [12]:
bool_series = pd.notnull(raw_dataframe["question"])
not_null_data_frame = raw_dataframe[bool_series]

#### Sampling the data set 
Randomly 200000 questions from the given dataset is selected.

In [13]:
data_frame = not_null_data_frame.sample(n = 200000, random_state = 101).sort_index()

In [14]:
### Cross checking the size.
len(data_frame)

200000

In [15]:
### Sample dataset
data_frame.head(5)

Unnamed: 0,question
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
10,Method to find separation of slits using fresn...
11,How do I read and find my YouTube comments?
12,What can make Physics easy to learn?


In [16]:
# making sure there is no null lines
data_frame["question"].isnull().sum()

0

### Question 1

#### Aim:

To evaluate the unsupervised machine learning methods for text classification, which is suitable for the provided dataset named quora questions, which is extracted from the popular Q&A platform on the internet "Quora". 

#### Method:
Two unsupervised machine learning methods are available in the scikit-learn library for text classification. <br>
1)	LatentDirichletAllocation (LDA) <br>
2)	Non-Negative Matrix Factorisation(NMF) <br>

In this library, LDA provides the log-likelihood and proximity score to evaluate the hyperparameter setting for a model. But in the current released version of the library, the NMF does not have a scoring mechanism. Hence, the LDA method will be employed in this project.

#### Expected output:
The expected artefacts of this question is a dataset which will be categorised under several topics. The topics names should be selected using the probability of words in each topic based on empirical knowledge.

---------------

#### Creation of Document matrix

This creates a sparse matrix which has the occurrence count of each word for each row of the data frame.<br> <br>
**CountVectorizer** is used with options: <br> 
**max_df=0.9** which identify the highly repeated words as stop words and  <br> 
**min_df=4** which identifies the scarcely occuring words as stop words too.


In [None]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
len(count_vectorizer.get_feature_names())

There are **17150** unique words in the corpus.

--------------------------

### Sample word list in the corpus 

In [None]:
count_vectorizer.get_feature_names()[0:20]

In [None]:
doc_term_matrix

The document terms matrix is generated.

### Hyperparameter search 

A dictionary is defined with the proposed parameters possible for the model to train on.

In [None]:
# Defien the params that we want to use
search_params = {"n_components": [10,12,15,20], "learning_decay": [ .5, .7, .9]}

**LatentDirichletAllocation** function is called on the search parameters defined (plausible hyper-parameters).

In [None]:
# Init the model
lda_comparison = LatentDirichletAllocation()
# Init Grid Search Class
lda_comparison = GridSearchCV(lda_comparison, param_grid=search_params)

### Model training/fitting

Using the fit function, the model will categorise the quora questions in to **n_components** using the learning decay paramters specified before in search array. 

In [None]:
# Run the grid search
lda_comparison.fit(doc_term_matrix)

### Best model

The best model is the model which was trained using the hyper-parameters with the highest score. 

In [None]:
# Best model which gives the highest score
best_lda_model = lda_comparison.best_estimator_      
# Metrics - log likelihood - higher score = better
print("Log likelihood : ", best_lda_model.score(doc_term_matrix))
# Perplexity - lower = better. 
# = exp(-1 * log likelihood per word)
print("Perplexity: ", best_lda_model.perplexity(doc_term_matrix))

In [None]:
best_lda_model.components_.shape

In [None]:
lda_comparison.best_params_

### Top Words

The most occuring 50 words in each Topic is printed. <br> <br>
This is used to dervie the Topic names using empherical knowledge of the words most present in the text corpus of each topic cluster identified.

In [None]:
word_list = []
probability_list = []

top_number = 50
topic_count = 0

for probability_number in best_lda_model.components_:
    text_message = f"Top words for topic {topic_count} are : "
    print(text_message)
    for number in probability_number.argsort()[-top_number:]:
        print([count_vectorizer.get_feature_names()[number]], end="")
        probability_list.append(number)
    print("\n")
    topic_count += 1

In [None]:
topics = best_lda_model.transform(doc_term_matrix)

### Topic selection
Based on words, an arbitory topic name is given.

In [None]:
topic_desc_list = {0: "Education", 
                   1: "Research", 
                   2: "Law", 
                   3: "Sport", 
                   4: "Finance", 
                   5: "Health", 
                   6: "horoscopes", 
                   7: "Environment", 
                   8: "Economy", 
                   9: "Various", 
                   10: "Sport", 
              }



In [None]:
len(topics)

Assigning the topic number with the highest probability

In [None]:
topic_list = []
# topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

### Assignment of topic numbers to the questions

In [None]:
# Add a new column to the dataframe
data_frame["Topic number"] = topic_list

### Using map function the topic descriptions are added to the dataframe

In [None]:
topic_no_to_topic_desc = data_frame["Topic number"].map(topic_desc_list)

### Addition the Topic Description column to the dataframe

In [None]:
data_frame["Topic desc"] = topic_no_to_topic_desc

### Sample dataframe with question, topic number and topic description

In [None]:
data_frame.head(5)

### Saving the dataframe to a csv file for the further processing in Q2

In [None]:
data_frame.to_csv(r'quora_supervised.csv', index = False)

--------------

## Question 2

#### Aim:


#### Method:


#### Expected output:

--------------

#### Reading of the text file with the topic classification

In [17]:
data_frame = pd.read_csv("quora_supervised.csv", encoding='utf-8')
data_frame.head(5)


Unnamed: 0,question,Topic number,Topic desc
0,What can make Physics easy to learn?,4,Finance
1,What was your first sexual experience like?,6,horoscopes
2,What are the laws to change your status from a...,2,Law
3,Why are so many Quora users posting questions ...,1,Research
4,Why do rockets look white?,6,horoscopes


#### Creation of Document matrix

This creates a sparse matrix which has the occurrence count of each word for each row of the data frame.<br> <br>
**CountVectorizer** is used with options: <br> 
**max_df=0.9** which identify the highly repeated words as stop words and  <br> 
**min_df=4** which identifies the scarcely occuring words as stop words too.


In [18]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")

**doc_term_matrix** is the feature list. <br>
**target_topic** is the classification expected.

In [19]:
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
target_topic = data_frame['Topic number']

### Train and Test split

The current dataset is split into 70% training and 30% for testing the model afterwards.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(doc_term_matrix,target_topic, test_size = 0.3, random_state = 5)

### Classifier 
MultinomialNB classifier is created and used to fit/train the model using the training data

In [21]:
mnc_classifier = MultinomialNB()
mnc_classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Prediction
The Model is now equipped with data to predict a new review is positive or negative. A **positive review** is fed into the classifier to predict.

In [22]:
mnc_model_predictions = mnc_classifier.predict(X_test)

## Confusion Matrix
In order to evaluate our model for the movie review classifer we are going to use Confusion matrix

In [23]:
print(metrics.confusion_matrix(y_test, mnc_model_predictions))

[[4348  195  135  173  168   63  352  132   68   75]
 [  92 5180  116  195   90  124  213  114   46   71]
 [ 138  125 3822  198  153   81  288  231  107  107]
 [  71  129   96 6004  139  125  132  112  102   65]
 [  89   68   96  217 4548   88  305  117   99   97]
 [  94  210   88  218   95 3876  444   74   89  107]
 [  93  197  110   94  128  108 7879   69   78  128]
 [ 111  141  140  156  171   70  255 5000   79   75]
 [  80   80  102  159  157   56  262  286 3405   87]
 [ 107  166  112  132  107   80  447  158   91 3650]]


classification_report
Based on the confusion metrics the classification report can be calculated.

#### Precision: 
When it predicts it is of a particular class, how often is it correct?

#### Recall
Recall is the number of correct results divided by the number of results that should have been returned.

#### F1-score
The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall).

#### Support
The support is the number of occurrences of each class

In [24]:
print(metrics.classification_report(y_test, mnc_model_predictions))

              precision    recall  f1-score   support

           0       0.83      0.76      0.80      5709
           1       0.80      0.83      0.81      6241
           2       0.79      0.73      0.76      5250
           3       0.80      0.86      0.83      6975
           4       0.79      0.79      0.79      5724
           5       0.83      0.73      0.78      5295
           6       0.74      0.89      0.81      8884
           7       0.79      0.81      0.80      6198
           8       0.82      0.73      0.77      4674
           9       0.82      0.72      0.77      5050

    accuracy                           0.80     60000
   macro avg       0.80      0.79      0.79     60000
weighted avg       0.80      0.80      0.79     60000



#### Accuracy                           

#### Macro avg
 
#### Weighted avg