## Assignement Details
Name        : **Lavish Thomas** <br> 
Student ID  : **L00150445** <br>
Course      : MSc in Big Data Analytics and Artificial Intelligence <br>
Module      : Artificial Intelligence 2 <br>
File used   : quora_questions.csv

### Libraries Used:
This sections explains varies libraries used in this project.

#### Numpy
NumPy is the fundamental package for scientific computing with Python for efficient multi-dimensional container of generic data.

In [None]:
import numpy as np

#### Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

##### Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [None]:
import pandas as pd

##### CountVectorizer 
CountVectorizer is used to split up the reviews into a list of words(a spare matrix with count of each word)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

##### Latent Dirichlet Allocation
The latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

##### Grid Search
Hyper-parameters are parameters that are not directly learnt within estimators. Grid Search is used to find optimal hypermeters for the models. 

In [None]:
from sklearn.model_selection import GridSearchCV

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

In [None]:
from sklearn.naive_bayes import MultinomialNB

Split arrays or matrices into random train and test subsets.

In [None]:
from sklearn.model_selection import train_test_split

Metrics module implements several loss, score, and utility functions to measure classification performance. 

In [None]:
from sklearn import metrics

### Reading the CSV file with questions
The files are read and loaded into a datframe using the pandas inbuild function

In [4]:
raw_dataframe = pd.read_csv("quora_questions.csv", encoding='utf-8')

### Size of the data set

In [None]:
print ("The size of the dataset is " + len(raw_dataframe));

### Sample data

In [6]:
raw_dataframe.head(5)

Unnamed: 0,question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


### Deleting the null rows

In [7]:
bool_series = pd.notnull(raw_dataframe["question"])
not_null_data_frame = raw_dataframe[bool_series]

### Sampling the data set 
Randomly 200000 questions from the given dataset is selected.

In [8]:
data_frame = not_null_data_frame.sample(n = 200000, random_state = 100).sort_index()

In [9]:
### Cross checking the size.
len(data_frame)

200000

In [None]:
### Sample dataset
data_frame.head(5)

In [11]:
# making sure there is no null lines
data_frame["question"].isnull().sum()

0

## Question 1

### Aim:


### Method:

CountVectorizer is used to split up the reviews into a list of words(a spare matrix with count of each word)

In [12]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
len(count_vectorizer.get_feature_names())

17151

In [13]:
count_vectorizer.get_feature_names()[0:20]

['00',
 '000',
 '001',
 '01',
 '02',
 '03',
 '04',
 '05',
 '07',
 '08',
 '09',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '1000rs',
 '1000s',
 '1000Ã¢',
 '100k']

In [14]:
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])

In [15]:
doc_term_matrix

<200000x17151 sparse matrix of type '<class 'numpy.int64'>'
	with 963944 stored elements in Compressed Sparse Row format>

In [16]:
# Defien the params that we want to use
search_params = {"n_components": [10,12,15,20], "learning_decay": [ .5, .7, .9]}

# Init the model
lda_comparison = LatentDirichletAllocation()

# Init Grid Search Class
lda_comparison = GridSearchCV(lda_comparison, param_grid=search_params)

# Run the grid search
lda_comparison.fit(doc_term_matrix)

GridSearchCV(cv=None, error_score=nan,
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=10,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1,
                                                 random_state=None,
                                                 topic_word_prior=None,
                                                 tota

In [17]:
# Best model which gives the highest score
best_lda_model = lda_comparison.best_estimator_      
# Metrics - log likelihood - higher score = better
print("Log likelihood : ", best_lda_model.score(doc_term_matrix))
# Perplexity - lower = better. 
# = exp(-1 * log likelihood per word)
print("Perplexity: ", best_lda_model.perplexity(doc_term_matrix))

Log likelihood :  -8145658.213802085
Perplexity:  3664.3950402612954


In [18]:
best_lda_model.components_.shape

(10, 17151)

In [19]:
lda_comparison.best_params_

{'learning_decay': 0.5, 'n_components': 10}

In [20]:
word_list = []
probability_list = []

top_number = 50
topic_count = 0

for probability_number in best_lda_model.components_:
    text_message = f"Top words for topic {topic_count} are : "
    print(text_message)
    for number in probability_number.argsort()[-top_number:]:
        print([count_vectorizer.get_feature_names()[number]], end="")
        probability_list.append(number)
    print("\n")
    topic_count += 1

Top words for topic 0 are : 
['person']['blog']['fall']['woman']['distance']['traffic']['guys']['end']['man']['just']['feel']['created']['successful']['sleep']['height']['friends']['code']['eat']['hotel']['universe']['age']['really']['don']['guy']['police']['time']['ways']['way']['safe']['school']['girls']['relationship']['website']['know']['read']['energy']['want']['long']['earn']['girl']['increase']['learn']['does']['start']['online']['love']['like']['did']['money']['make']

Top words for topic 1 are : 
['like']['company']['apps']['site']['mba']['management']['phone']['ve']['free']['student']['service']['sentence']['engineer']['time']['worst']['digital']['bangalore']['course']['big']['software']['2016']['better']['mind']['online']['tech']['visit']['marketing']['places']['movie']['mechanical']['career']['buy']['mobile']['delhi']['favorite']['place']['learn']['life']['app']['meaning']['books']['android']['word']['book']['thing']['job']['engineering']['india']['way']['best']

Top words 

In [21]:
best_lda_model

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.5,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

In [None]:
topics = best_lda_model.transform(doc_term_matrix)

In [None]:
len(topics)

In [None]:
topic_list = []
# topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

# Add a new column to the dataframe
data_frame["Topic number"] = topic_list

In [None]:
topic_list = {0: "Education", 
              1: "Research", 
              2: "Law", 
              3: "Sport", 
              4: "Finance", 
              5: "Health", 
              6: "horoscopes", 
              7: "Environment", 
              8: "Economy", 
              9: "Various", 
              10: "Sport", 
              }

topic_no_to_topic = data_frame["Topic number"].map(topic_list)

In [None]:
data_frame["Topic desc"] = topic_no_to_topic

In [None]:
data_frame.to_csv(r'QuoraWithTopic.csv', index = False)

In [None]:
data_frame

## Question 2

In [None]:
data_frame = pd.read_csv("QuoraWithTopic.csv", encoding='utf-8')
data_frame.head(5)


In [None]:
# max_df is between 0-1 or an INT
count_vectorizer = CountVectorizer(max_df=0.90, min_df=4, stop_words="english")
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])
len(count_vectorizer.get_feature_names())


In [None]:
doc_term_matrix = count_vectorizer.fit_transform(data_frame["question"])


In [None]:
target_topic = data_frame['Topic number']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(doc_term_matrix,target_topic, test_size = 0.3, random_state = 1)

### Classifier 
MultinomialNB classifier is created and used to fit/train the model using the training data

In [None]:
mnc_classifier = MultinomialNB()
mnc_classifier.fit(X_train, y_train)

### Prediction
The Model is now equipped with data to predict a new review is positive or negative. A **positive review** is fed into the classifier to predict.

In [None]:
mnc_model_predictions = mnc_classifier.predict(X_test)

## Confusion Matrix
In order to evaluate our model for the movie review classifer we are going to use Confusion matrix

In [None]:
print(metrics.confusion_matrix(y_test, mnc_model_predictions))

In [None]:
print(metrics.classification_report(y_test, mnc_model_predictions))