# Natural Language Processing Basics

## Using NLTK and Scikit-learn



The Natural Languagae Tool-Kit (NLTK) contains many functions that help you understand Natural Language Processing 

There are many other libraries out there that have many tools and functions to improve you NLP model, but the concepts are the same.


### Overview

I'll walk through each step - data preprocessing, creating the corpus, making the sparse matrix, then finally running the regression
The data I'm using is a tsv (tab-separated) file containing 1000 reviews of a restaurant. 


There are two columns:

1) The text of the review
2) 1, if the review is positive, 0, if the review was negative

In [1]:
# Step 1 Importing the required libraries:

import pandas as pd
import numpy as np
import nltk

# This data is tab-separated, instead of comma-separated because one of the columnd contains 
# text which may have commas -- confusing the pandas library into making more columns than there are

df = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting = 3)

# We'll use the Regular Expression library to filter the text and the NLTK library to tokenize and stem the words
# I'll explain the use of each library in underneath
import re

# This step is necessary only if this is the first time you're using the nltk library
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Regular Expressions:

A review may contain punctuations, emoticons, emojis, or numbers. When trying to analyze the sentiment in the review, these characters don't help the algorithm understand if the reviewis positive or negative

#### StopWords:

StepWords is the list of words name the NLTK library has given to prepositions, articles, conjuctions etc. that are necessary for a sentence structure, but can skew the NLP model. If these words are left in the model, it'll make it think those words are contributing to the positive or negative review -- which is not the case.

#### Stemming:

The PorterStem function finds the root of each word. For example, words like 'Loved' become 'love' etc. 

In [2]:
# Step 2: Cleaning the input to a suitable format
# First we'll clean one review as an example:

review = df['Review'][0]
print('Step 1: '+review)


review = re.sub('[^a-zA-Z]', ' ',  review)
print('Step 2:'+review)

review = review.lower()
print('Step 3: '+review)
# This separates each word into an indivual string and puts them all in a list:
review = review.split()

print('Step 4: '+str(review))

# This line takes each word in the above list, finds that word's stem, and verifies that it isn't in the list of StopWords
# And returns the words as a list:
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

# Now that we have the required words, we join them together as a string:
review = ' '.join(review)
print('Cleaned review: '+review)


Step 1: Wow... Loved this place.
Step 2:Wow    Loved this place 
Step 3: wow    loved this place 
Step 4: ['wow', 'loved', 'this', 'place']
Cleaned review: wow love place



#### Now we'll do this for all reviews:


In [3]:
corpus = []
# There are 1000 reviews in this dataset
for i in range(1000):
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)


### Now the Learning step:

The first step of the Learning process is to create the Count Vectorizer matrix. 
The CV Matrix is a 2D array with oen row for each review, and one column for each unique word in all reviews.

The value of each column for a review is either 1 or 0.

1, if that word is in that review, 0, if not. As a result most columns for all rows have the value 0.

We can see what I mean by this here:

In [4]:
# Step 3: Learning
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

check_shape = np.array(X)
print('Shape of the Count Vectorizer Matrix: ' + str(check_shape.shape))
print('\n\n')
shape_df = pd.DataFrame(check_shape)
shape_df.head()

Shape of the Count Vectorizer Matrix: (1000, 1565)





Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Understanding the shape:

The array created by the Count Vectorizer has 1000 rows, one for each review. 
And 1565 columns, as there are 1565 unique words in all the reviews that are not in the StopWords list.

And as you can see, most of the column values in the dataframe are 0.

The CountVectorizer accepts an argument called 'max_features' that limits the number of unique words to be added in the Matrix
This agrument only makes columns of the n most frequent words used:


There are other arguments for the CountVectorizer function that you should explore:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [5]:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [6]:
# Creating the outout vector:

y = df.iloc[:, 1].values

# Splitting the dataset into training and testing sets
# Setting 80% Training, 20% testing set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


#### Random Forest Intuition


In [7]:
# Feature Scaling
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50, criterion = 'entropy', max_features = 1000, random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

accuracy_rate = (cm[0][0]+cm[1][1])/(cm[0][1]+cm[1][0]+cm[0][0]+cm[1][1])
print(cm)
print('\n')
print("Accuray Rate = " + str(accuracy_rate))

[[79 18]
 [39 64]]


Accuray Rate = 0.715


#### Understanding the confusion matrix

The confusion matrix is a 2X2 matrix that stores the number:

'''
[

    [<'1' Label predicted as '1'>, <'1' Label predicted as '0'>],
    
    [<'0' Label predicted as '1'>, <'0' Lable predicted as '0'>]
                                    
]
'''


Thus, the error rate is calclated as the ratio of correct predictions to total predictions

### Conclusion

The accuracy rate we go was a little over 71%, which is not great. But this is where we get to experiment.
We can could test out a variety of hyperparameters so that the accuracy improves, but that might lead to overfitting.

This is where the field of Machine Learning becomes more of an art than science. You get to play around with the number of samples to train on, choosing the number of trees in the Random Forest Classifier, choosing the maximum number of features in the CountVectorizer, the Test/Train split of the dataset. 