## Natural Language Processing

### Index 
- [Equation and Method](#equation)
- [Pre processing](#preprocessing)
- [Building the model](#building)
- [Result](#result)

In [6]:
# importing some basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id='equation'></a>
### Equation and Method

#### Bag of words model
The basic idea of this model is to represent the string in a sparse matrix form, where the rows are the different strings and the columns are the different words in that string. The main idea is to convert the string into a vector and then process it by using any of the Machine Learning models.

###### 1. Removing the different symbols
The first step is to remove the excess symbols and stuff. The main idea inside Bag of words model is to just extract the words inside the string.

###### 2. Removing the stop words
Stop words are punctuation, pronouns, adjectives, etc. This is not very helpful when determining the vector in the bag of words model, so we remove the stop words.

###### 3. Stemming
Stemming is the process of trimming a particular word into a standard form. I.e words will have different tenses and different forms. Stemming changes all that into a standard form so that no two words of the same meaning but different words will be treated differently in the bag of words model.

###### 4. Building the vectors
Next we build the vectors based on the sparse matrix form. One imporant aspect to consider for this is, removing proper nouns and other words. This is done by limiting the total number of vectors in the bag of words model. This will remove the less non repeated words from the vectors.

#### Concepts for evaluating performance

### $Accuaracy = \frac{(TP+TN)}{(TP+TN+FP+FN)}$
### $Precision = \frac{(TP)}{(TP+FP)}$
### $Recall = \frac{(TP)}{(TP+FN)}$
### $F1score = \frac{2*Precision*Recall}{(Precision + Recall)}$

<a id='preprocessing'></a>
### Pre processing

In [11]:
from sklearn.cross_validation import train_test_split



In [1]:
import re
import nltk

We now download the stopwords list from the nltk stopwords

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/nevin/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [3]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [4]:
corpus = []

In [8]:
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

<a id='building'></a>
### Building the model

Now we build a model based on the Naive Bayes Classifier.

In [14]:
from sklearn.naive_bayes import GaussianNB

In [15]:
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [16]:
y_pred = classifier.predict(X_test)

<a id='result'></a>
### Result
Since the dimension of the vector is very large, we cannot visualise it. So we use the confusion matrix to evaluate our result.

In [17]:
from sklearn.metrics import confusion_matrix

In [18]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[55, 42],
       [12, 91]])

We can see that our model performs fairly with a Naive Bayes Classifier model.