<a href="https://colab.research.google.com/github/nitinsanatan/NLP_Basic_Tasks/blob/main/Restaurant_Review_Classification_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dependency and data load

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('Restaurant_Reviews.tsv', sep='\t', quoting=3)

In [3]:
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
data.tail()

Unnamed: 0,Review,Liked
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0
999,"Then, as if I hadn't wasted enough of my life ...",0


In [5]:
data['Liked'].value_counts()

1    500
0    500
Name: Liked, dtype: int64

## NLTK Preprocessing

In [6]:
from nltk.corpus import stopwords

In [7]:
import nltk
import re

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
review = re.sub('[^a-zA-z]', ' ', data['Review'][0])
review

'Wow    Loved this place '

In [10]:
review = review.lower()

In [11]:
review = review.split()

In [12]:
review = [word for word in review if word not in stopwords.words('english')]
review

['wow', 'loved', 'place']

In [13]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [14]:
review = [ps.stem(word) for word in review]
review

['wow', 'love', 'place']

In [15]:
review = " ".join(review)
review

'wow love place'

In [16]:
# All preprocessing steps at one place

corpus = []

ps = PorterStemmer()

for i in range(len(data)):
  review = re.sub('[^a-zA-z]', ' ', data['Review'][i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
  review = " ".join(review)

  corpus.append(review)

In [28]:
len(corpus)

1000

## Bag Of words: feature engineering

In bag of words model, first the vocab is created from corpus, Vocab doesn't contain stopwords. Now, the dimension of each document(sentence) will be equal to length of vocab where only term present in the doc will be 1 and rest is 0. So, it makes a sparse matrix. 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)

In [19]:
X = cv.fit_transform(corpus).toarray()
X.shape

(1000, 1500)

In [20]:
y = data.iloc[:,1].values
y.shape

(1000,)

## Applying ML algorithm

In [21]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=0)

In [22]:
X_train.shape,X_test.shape

((800, 1500), (200, 1500))

In [23]:
from sklearn.naive_bayes import GaussianNB

In [24]:
classifier = GaussianNB()

In [25]:
classifier.fit(X_train, y_train)

GaussianNB()

In [26]:
y_pred = classifier.predict(X_test)

In [27]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))

0.73
