# Natural Language Processing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Importing the dataset
dataset = pd.read_csv('Dataset\\Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
#Our file is a tsv file but we are reading it as csv file so we need to pass a delimiter as "\t"
#In future we don't need a trouble due to " "(quotes) so will ignore using command 'qoting'
# 3 is for ignoring quotes

We use .tsv(tab separated delimiter) file because in NLP dataset one can use ,(comma) as punctuation but not tab. Using comma could delimit the sentence where it is not required.<br>
After importing the dataset we need to clean the dataset.

In [3]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In the dataset there are two columns. First column consists of reviews and second column consits whether the review was positive or negative.<br>
Our target is to make a machine learning model which will make a prediction wether the new review is positive or negative.

**Cleaning the Texts**

**Need :**<br>
Our target is to create a bag of words model, which will only consists of only the relevant words and different reviews.<br>
* We will get rid of all the non-useful words like *that*,*or*,*on* etc.
* we will als get rid of punctuation marks.
* We also get rid of numbers unless they are significant impact.<br>
We will also do a process known as `stemming` which consists of taking the route of some different versions of a some word. We perform stemming so that we can group similar meaning words.
* We will also get rid of text in the Upper case.<br>
The last process which will make bag of words model is the tokenization process. <br>
`Tokenization` :- It splits  all the different reviews into different words.<br><br>
We will take all the word of the different reviews and we will attribute each column for each word. So we will have a lot of column and then for each review each review will contain the number of times the associated word appears in the review. 

Steps of cleaning:<br>
1. Only keeping the letters in the review. 
 * We will remove the numbers, punctuations etc. except for the alphabets. 
2. Putting all the letters to lower case.
3. Remove non- significant words, i.e. the words which are not relevant in predicting. Words like the, that, and, in, all etc. as we know that a sparse matrix will be created with each column having a word.
4. Stemming

In [4]:
dataset['Review'][0]

'Wow... Loved this place.'

**Retaing only alphabets** and **Removing irrelevant words**

stopword list<br>
The library will contain a list of words which are generically irrelevant words. We will see words from the list and delete the irrelevant words. 

In [5]:
import re 
import nltk

If the word is present in the stopword list we will remove it from our dataset.

In [6]:
nltk.download('stopwords')
review = re.sub('[^a-zA-Z]',' ',dataset['Review'][0],)
#Here we are working only for first review.

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Samir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
print("Text with only alphabets left, numbers and punctuation being removed : ",review)

Text with only alphabets left, numbers and punctuation being removed :  Wow    Loved this place 


Note :<br>
* `^` stands for not.
* The first parameter is what we want to remove in the text.
* we will put what we don't want to remove
* 2nd parameter is where we want to remove the text. 
* There may be a case that if we remove a letter from middle the letter left and right to it may stick and might be undesirable.
    * example first second third can result to firstthird on removal of second
* so we provide a space by providing empty  `' '`, so the removed character will be replaced by the space.

**Changing case into lower case.**

In [8]:
review = review.lower()

In [9]:
print('Case changed to lower case of the above text : ',review)

Case changed to lower case of the above text :  wow    loved this place 


Now we split the words from string and put them into a list.

In [10]:
review =review.split()

In [11]:
print('string being converted into list ',review)

string being converted into list  ['wow', 'loved', 'this', 'place']


In [12]:
from nltk.corpus import stopwords
review = [word for word in review if not word in set(stopwords.words('english'))]# List comprehensions

The function will take a maatch with nltk stopwords library and remove the word which are not significant.<br>
Use of `set function` here : It will take `stopwords.words('english')` as an `argument` or `input`.<br>
We are doing this because it will be way faster for our algorithm to go through all the different word in the package. Because it will traverse faster in `set` than in the `list`.

In [13]:
print('Set of relevant words : ',review)

Set of relevant words :  ['wow', 'loved', 'place']


Here 'this' was irrelevant so it was removed.

**Stemming**<br>
Stemming is about taking the root of the words.<br>
example the word loved comes from the word love, as the word love, loved or loves give the same meaning.<br>
We do this to reduce the words in sparse matrix.

In [14]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]#This step can be done 
#in the above line while we were implementing stopwords on different wordsby making a small change in the code above.
#(repeated for learning process only.)

In [15]:
print('After aplying stemming on the above colection of words ',review)

After aplying stemming on the above colection of words  ['wow', 'love', 'place']


In [16]:
#Converting list to string back again
review = ' '.join(review)

In [17]:
print('List converted back to string : ',review)

List converted back to string :  wow love place


The above step has been done for only one line but we need to apply this method to whole dataset.(We don't need to do them separately, only for learning process we have done it in two steps.)

In [18]:
corpus=[]
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ',dataset['Review'][i],)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)#We need to append clean review to corpus

In [19]:
corpus[:5]#Getting first 5 results

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

Creating the **Bag of Word Model**<br>
We will create bag of word model of unique letter and fit each letter in each column. Where the number of rows will be equal to 1000. And the cell will be having a **number** which is the count of occurrence of that particular word. <br>
So we will basically get a table with a lot of zeroes because there will be many word that have only occured once.<br>
A matrix containg a lot of zeroes is called a sparse matrix. And the fact that we have lots of zeroes is called sparsity.<br>
**Bag of word model** is creation of **sparse matrix** through the process of **tokenization**. <br>
We are going to do classification as the is the review is either is either 0 or 1. Note : The number of words in the corpus will be equal to the total number of independent variables.  

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
#for starting of tokenization
cv = CountVectorizer(max_features=1500)# Number of relevent words that we want to keep
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,1].values#Taking the dependent variable

In [21]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [22]:
y[:10]#displaying top 10 dependent variables.

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1], dtype=int64)

In [23]:
X.shape

(1000, 1500)

We can say that the name of the people are irrelevant words in the model. `max_feature` parameter will allow us to filter the irrelevent words. 

In [24]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [25]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[55, 42],
       [12, 91]], dtype=int64)

In [26]:
total = cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]
print('Our model made',cm[0,0],'correct prediction of negative reviews,',cm[1,1],'correct prediction of postive reviews',
     cm[0,1],'incorrect predictions of positive reviews and',cm[1,0],'incorrent prediction of negative reviews out of total',
     total,'reviews.')

Our model made 55 correct prediction of negative reviews, 91 correct prediction of postive reviews 42 incorrect predictions of positive reviews and 12 incorrent prediction of negative reviews out of total 200 reviews.


In [27]:
print("So number of correct prediction is ",cm[0,0]+cm[1,1],'\nand number of incorrect predictions is ',cm[0,1]+cm[1,0])

So number of correct prediction is  146 
and number of incorrect predictions is  54


In [28]:
print('Accuracy of the model is ',(cm[0,0]+cm[1,1])/total,'%')

Accuracy of the model is  0.73 %
