# Restaurant Review Classifier
This is a simple NLP project based on the [NLP section of A-Z Machine Learning Course on Udemy](https://www.udemy.com/machinelearning/learn/v4/t/lecture/6085634?start=0)

The objective of this exercise is to identify the best model for classifying the review comments of a restaurant. We clean the dataset and make vectors out of them according to the bag of words model.

## Index
#### 1. [Preprocessing](#preprocessing)
#### 2. [Helper Functions](#hf)
#### 3. [Gaussian Naive Bayes](#gnb)
#### 4. [Decision Tree Classifier](#dtc)
#### 5. [Random Forest Classifier](#RFC)
##### 6. [Predictor Function](#predictor)

<a id='preprocessing'></a>
### Preprocessing

##### steps taken
- Removal of punctuations and symbols
- Removing the stop words
- Tokenizing after stemming the different words.
- Building the vectors from the induvidual reviews.

In [1]:
# importing some basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


The dataset contains the review string followed by a binary flag indicating wheather the user liked it or not.

In [3]:
# This function will remove the unnecessary symbols, stopwords, and stem the words to tokens.
def clean_string(review):
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    return review

In [4]:
corpus = dataset['Review'].apply(clean_string)

In [5]:
cv = CountVectorizer(max_features = 1500)

In [6]:
x = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

<a id='hf'></a>
### Helper Functions

In [8]:
from sklearn.metrics import confusion_matrix

In [9]:
def describe_performance(model_name, y_train, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    TP = cm[0][0]
    FP = cm[0][1]
    TN = cm[1][1]
    FN = cm[1][0]
    accuracy = (TP+TN)/(TP+TN+FP+FN)
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    f1 = (2*precision*recall)/(precision+recall)
    print('## Summary of ', model_name,' ##')
    print('The confusion matrix is :')
    print(cm)
    print('accuracy is ', accuracy)
    print('precision is ', precision)
    print('recall is ', recall)
    print('F1 Score is ', f1)

<a id='gnb'></a>
### Gaussian Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB

In [11]:
gaussNB_classifier = GaussianNB()
gaussNB_classifier.fit(x_train, y_train)

GaussianNB(priors=None)

In [12]:
y_pred = gaussNB_classifier.predict(x_test)

In [13]:
describe_performance('Gaussian Naive Bayes', y_test, y_pred)

## Summary of  Gaussian Naive Bayes  ##
The confusion matrix is :
[[55 42]
 [12 91]]
accuracy is  0.73
precision is  0.567010309278
recall is  0.820895522388
F1 Score is  0.670731707317


<a id='dtc'></a>
### Decision Tree Classifier

In [14]:
from sklearn.tree import DecisionTreeClassifier

In [15]:
DTclassifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
DTclassifier.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')

In [16]:
y_pred = DTclassifier.predict(x_test)

In [17]:
describe_performance('Decision Tree', y_test, y_pred)

## Summary of  Decision Tree  ##
The confusion matrix is :
[[74 23]
 [35 68]]
accuracy is  0.71
precision is  0.762886597938
recall is  0.678899082569
F1 Score is  0.718446601942


<a id='RFC'></a>
### Random Forest Classifier

In [18]:
from sklearn.ensemble import RandomForestClassifier

In [19]:
RFclassifier = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
RFclassifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [20]:
y_pred = RFclassifier.predict(x_test)

In [21]:
describe_performance('Random Forest Classifier', y_test, y_pred)

## Summary of  Random Forest Classifier  ##
The confusion matrix is :
[[87 10]
 [46 57]]
accuracy is  0.72
precision is  0.896907216495
recall is  0.654135338346
F1 Score is  0.75652173913


<a id='predictor'></a>
### Predictor

A sample predictor was created for implementing in our django app. The basic logic is to classify the comment with all the three models that we tried and then using the average of the result in order to predict the final result. This predictor takes the input in the form of a string.

In [28]:
def predict(comment):
    cleaned = clean_string(comment)
    vector = cv.transform([cleaned]).toarray()
    result = []
    result.append(gaussNB_classifier.predict(vector)[0])
    result.append(DTclassifier.predict(vector)[0])
    result.append(RFclassifier.predict(vector)[0])
    avg_result = np.array(result).mean()
    if(avg_result>0.5):
        print("The review is good")
        return 1
    else:
        print("The review is bad")
        return 0

This function basically takes in a vector as input, this was just created in order to check the result of combination of models.

In [50]:
# Does prediction with the vectors
def predict_vector(vector):
    result = []
    result.append(gaussNB_classifier.predict(vector)[0])
    result.append(DTclassifier.predict(vector)[0])
    result.append(RFclassifier.predict(vector)[0])
    avg_result = np.array(result).mean()
    if(avg_result>0.5):
        print("The review is good")
        return 1
    else:
        print("The review is bad")
        return 0

In [51]:
y_pred = []
for vector in x_test:
    y_pred.append(predict_vector(vector))
print(y_pred)



The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is good
The review is good
The review is good
The review is good
The review is good
The review is good
The review is good
The review is bad
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is good
The review is good
The review is good
The review is good
The review is bad
The review is good
The review is good
The review is bad
The review is good
The review is good
The review is bad
The review is bad
The review is bad
The review is bad
The review is good
The review is good
The review is good
The review is good
The review is bad
The review is bad
The review is bad
The review is good
The review is good
The review is good
The review is bad
The review is bad
The review is good
The review is bad
The review is good




The review is bad
The review is bad
The review is bad
The review is bad
The review is good
The review is bad
The review is good
The review is bad
The review is good
The review is good
The review is bad
The review is good
The review is good
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is bad
The review is bad
The review is bad
The review is bad
The review is good
The review is good
The review is bad
The review is good
The review is good
The review is good
The review is bad
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is bad
The review is good
The review is bad
The review is bad
The review is bad
The review is bad
The review is good
The review is good
The review is good
The review is good
The review is bad
The review is bad
The review is good
The review is good
The review is bad
The review is bad
The review is bad
The rev



In [52]:
describe_performance('Decision Tree', y_test, y_pred)

## Summary of  Decision Tree  ##
The confusion matrix is :
[[82 15]
 [32 71]]
accuracy is  0.765
precision is  0.845360824742
recall is  0.719298245614
F1 Score is  0.777251184834


##### Pickling for use in our Django project

In [24]:
import pickle

In [25]:
with open('Gaussian_model.pkl', 'wb') as g_file:
    pickle.dump(gaussNB_classifier, g_file)
with open('Decision_Tree.pkl', 'wb') as g_file:
    pickle.dump(DTclassifier, g_file)
with open('Random_Forest.pkl', 'wb') as g_file:
    pickle.dump(RFclassifier, g_file)
with open('count_vectoriser.pkl', 'wb') as g_file:
    pickle.dump(cv, g_file)

### Conclusion 

In conclusion, we can say that none of these methods do a perfect job in classifying the reviews perfectly. However we can say that the best result was obtained for Random Forest Classifier. And even better result was obtained from the predictor function which aggregates the three classifiers. Another one factor we need to consider is that this model was built on only very limited dataset and has its limitations. Altogether we are able to get fairly good results for a basic implementatio on a web app.