## Restaurant Reviews - Natural Language Processing (Beginner)

- The dataset contains the reviews of restaurants.
- We predict whether a review is positive or negative based on this dataset if a new review is uploaded. 

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [31]:
df = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)   
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


- The dataset file is of type '.tsv' (tab seperated values)
- We use tsv file instead of csv file because of the following reasons:
    1. In csv file, columns are separated by commas. Sometimes texts contains comma which the machine assumes to be the next column and thus problems arises. 
    2. In tsv file, columns are separated by tab. Texts contains no tab and hence we use tsv file.

- In 'Liked' column, 1 = review is positive; 0 = review is negative.

### Data Preprocessing (Text Cleaning): 

In [32]:
print(df['Review'][0])

Wow... Loved this place.


In [33]:
import re

new = re.sub('[^a-zA-Z]', ' ', df['Review'][0])  # We kept only letters from a-z and removed all the special characters
new = new.lower()                                # we made the review in lower case for easy interpretability
print(new)

wow    loved this place 


In [12]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords          

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Pallavi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


- stopwords package contains words that are irrevalent to the machine such as prepositions, articles etc which does not help in prediction.

In [37]:
print('Type of new: ',type(new))
new = new.split()            # we converted sting type to list for the iteration purpose.
print('Type of new after splitting: ',type(new))
print(new)

Type of new:  <class 'str'>
Type of new after splitting:  <class 'list'>
['wow', 'loved', 'this', 'place']


**Stemming:** It is a process of keeping only the root words.
Example: love is the root word for loved, loving, loves.

In [41]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
new = [ps.stem(word) for word in new if not word in set(stopwords.words('english'))]

In [42]:
print(new)

['wow', 'love', 'place']


- Here, 'this' word is removed because it is not useful for the machine in the prediction.
- And word 'loved' is changed into its root word i.e, 'love'.

In [43]:
new = ' '.join(new)         # we convert back the list type to string type.
print(type(new))

<class 'str'>


In [44]:
print(new)

wow love place


#### What all process we have done so far:

- Step 1: We removed all the special characters and kept only words between a-zA-Z.
- Step 2: We converted the words to lowercase for easy interpretability.
- Step 3: We removed irrelevant words which are not useful for the machine using stopwords package.
- Step 4: Stemming process, we kept only the root of the word.

**Now we apply the above process to our entire dataset.**

In [48]:
corpus = []
for i in range(0, 1000):        # range is size of the dataframe.
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [55]:
df.head()        # Original dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [61]:
corpus[0:5]     # reviews after applying all the above process

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

- We can see that in the first review, 'this' word is removed and in the second review, 'is', 'not' words are removed.

### Creating the Bag of Words models:

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)      # max_feature will filter most frequent words. 
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 1].values

In [74]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [75]:
X.shape      # each word has its own individual column and is represented by either 1 or 0.

(1000, 1500)

#### Splitting the dataset into training set and test set:

In [81]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#### Fitting the model:

In [82]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [93]:
y_pred = classifier.predict(X_test)

#### Confusion Matrix:

In [88]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[55 42]
 [12 91]]


- Our model predicted 55 correct predictions of negative reviews and 91 correct predictions of positive reviews.
- Out of 200 reviews our machine predicted 55+91 i.e, 146 correct predictions.

In [94]:
# Accuracy of the model:
(55+91)/200 * 100

73.0