# Natural Language Processing

It is all about analysing text and use machines to draw information from the same.

### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

Our file is `tsv`, and thus, the delimiter is tab space (`\t`). It's better to use tab space to separate the `review` from `like` as the review might have comma(,) in itself. I also set `quoting` as `3` which means to ignore all quotes in text.

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [3]:
dataset.head(5)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


### Clean text

Next, I will clean the text as it might have some words or alphanumericals that provide no useful information.

In [4]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/k.bhanot/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Sets are faster than lists in Python
stop_words = set(stopwords.words('english')) 
stemmer = PorterStemmer()
corpus = []

In [6]:
for i in range(dataset.shape[0]):
    review = re.sub('[^A-Za-z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    review = [stemmer.stem(word) for word in review if not word in stop_words]
    review = ' '.join(review)
    corpus.append(review)

Let's take a look at the first 5 reviews in their clean form.

In [7]:
corpus[0:5]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

### Create the Bag of Words

In [8]:
countVectorizer = CountVectorizer(max_features = 1500)
X = countVectorizer.fit_transform(corpus)
y = dataset.iloc[:, -1]

### Apply machine learning

The most common models used for **Natural Language Processing** are **Naive Bayes**, **Decision Tree Classifier** and **Random Forest Classifier**.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#### Naive Bayes

In [10]:
clf = GaussianNB()
clf.fit(X_train.toarray(),y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [11]:
y_pred = clf.predict(X_test.toarray())
print("Accuracy for Naive Bayes: {}".format(accuracy_score(y_test, y_pred)))
print("Confusion Matrix for Naive Bayes:")
print("{}".format(confusion_matrix(y_test, y_pred)))

Accuracy for Naive Bayes: 0.73
Confusion Matrix for Naive Bayes:
[[55 42]
 [12 91]]


#### Decision Tree Classifier

In [12]:
clf = DecisionTreeClassifier(random_state = 0)
clf.fit(X_train.toarray(),y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [13]:
y_pred = clf.predict(X_test.toarray())
print("Accuracy for Naive Bayes: {}".format(accuracy_score(y_test, y_pred)))
print("Confusion Matrix for Naive Bayes:")
print("{}".format(confusion_matrix(y_test, y_pred)))

Accuracy for Naive Bayes: 0.65
Confusion Matrix for Naive Bayes:
[[71 26]
 [44 59]]


#### Random Forest Classifier

In [14]:
clf = RandomForestClassifier(random_state = 0)
clf.fit(X_train.toarray(),y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [15]:
y_pred = clf.predict(X_test.toarray())
print("Accuracy for Naive Bayes: {}".format(accuracy_score(y_test, y_pred)))
print("Confusion Matrix for Naive Bayes:")
print("{}".format(confusion_matrix(y_test, y_pred)))

Accuracy for Naive Bayes: 0.685
Confusion Matrix for Naive Bayes:
[[82 15]
 [48 55]]
