# NLP for Sentiment Analysis (through Classification model)

We have a dataset containing restaurant reviews. In this exercise, we will analyse the reviews with bag of words and Naive Bayes algorithm to classify whether the review is good or bad. Hence, it will be an example of classification model

**Importing the libraries and the dataset**

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [31]:
%matplotlib inline

In [32]:
data = pd.read_csv('Restaurant_Reviews.tsv',sep='\t', quoting = 3)
#quoting parameter passed as 3, ignores all double quotes within the review column

In [33]:
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


We have an equal number of liked and disliked reviews in our dataset 

In [34]:
data['Liked'].value_counts()

1    500
0    500
Name: Liked, dtype: int64

**Data pre-processing (Text cleaning)**

In [35]:
#Importing the requried libraries
import re
import nltk

In [36]:
from nltk.corpus import stopwords #Removes common english words
from nltk.stem.porter import PorterStemmer #Transforms similar words to it's common root. Helps in reducing dimensions of sparse matrix

In [37]:
corpus = [] #list of all clean reviews
for i in range(0,data.shape[0]):
    #replacing punctuations by space
    #hat symbol means 'not'. Hence anything that is not a letter is substituted by a space
    review = re.sub('[^a-zA-z]',' ',data['Review'].iloc[i]) 
    #transform all the capital letters to lowercase 
    review = review.lower()
    #split review in different words which can later be used for stemming
    review = review.split()
    #stemming to reduce all words to their root words
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove("not") #since not indicates a negative review, we want to remove it from our stopword list 
    all_stopwords.remove("didn't")
    all_stopwords.remove("isn't")
    all_stopwords.remove("but")
    all_stopwords.remove("won't")
    #iterating through each word in the review after ignoring stopwords
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    #joining the list of clean words in a single string
    review = ' '.join(review) 
    corpus.append(review)

In [38]:
#Let us look at how first 5 reviews were transformed
print(corpus[:10])

['wow love place', 'crust not good', 'not tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch']


**Creating the bag of words model**

In [39]:
#Tokenization 
from sklearn.feature_extraction.text import CountVectorizer
#we limit ourselves to the 1500 most frequent words to exclude random occurences such as 'rick', 'steve','holiday'
cv = CountVectorizer(max_features=1500) 

In [40]:
X = cv.fit_transform(corpus).toarray()
y = data['Liked']

In [41]:
X.shape

(1000, 1500)

**Splitting dataset into training and testing sets**

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)

**Implementing and evaluating Naive Bayes Classification model**

In [43]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

In [44]:
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)

In [45]:
from sklearn.metrics import confusion_matrix, classification_report

In [46]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[ 67  50]
 [ 20 113]]
              precision    recall  f1-score   support

           0       0.77      0.57      0.66       117
           1       0.69      0.85      0.76       133

    accuracy                           0.72       250
   macro avg       0.73      0.71      0.71       250
weighted avg       0.73      0.72      0.71       250



Our Naive Bayes classifier model is able to accurately predict 72% of reviews into likes and dislikes. Let us evaluate some other classification for better outputs

**Implementing and evaluating Random Forest Classification model**

In [47]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=101)

In [48]:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

In [49]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[106  11]
 [ 60  73]]
              precision    recall  f1-score   support

           0       0.64      0.91      0.75       117
           1       0.87      0.55      0.67       133

    accuracy                           0.72       250
   macro avg       0.75      0.73      0.71       250
weighted avg       0.76      0.72      0.71       250



**Conclusion:**

Random Forest Classifier performs better than the Naive Bayes classifer and has an accuracy of 80% 