# Natural Language Processing

## Dataset

### Layout

* Columns:
	* Review
	    * Contains the text each customer submitted for a review
    * Liked
* Rows: 100s of observations
	* Each row represents a review submitted by a customer indicating if the customer liked a restaurant or not
	    * 1 = customer liked a restaurant
        * 0 = customer did not like a restaurant
* Dataset file is a tab-separated values (TSV) file instead of CSV since review texts can contain commas

### Background

* One is a data scientist working for a group of restaurants
* Owners of the restaurants group want to analyze restaurant reviews for customer sentiment from their submitted reviews to determine if a customer liked a restaurant or not

### Goals

* Build a Bag of Words model to pre-process the text from restaurant reviews
* Build a classification model to determine if a customer will like a restaurant or not based on sentiment from reviews


## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Import Dataset

* The review column of the TSV file contains free-form text containing customer restaurant reviews
* The review texts sometimes contain double quotes
* When cleaning the texts, one must indicate to the model to ignore double quotes, which can othwerwise lead to processing errors
* Set the `quoting` parameter to $3$ in the Pandas `read_csv` function to ignore reading double quotes into the dataset

In [13]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)

## Clean Texts

### Import Additional Libraries

* `re` library provides support for regular expressions
* `nltk` is the Natural Language Toolkit library for working with human language data
    * It provides tools for various tasks in NLP:
        * **Tokenization:** Splitting text into words or sentences
        * **Stop word removal:** Filtering out common words such as *the*, *is*, *an*, etc.
        * **Stemming:** Reducing words to their root form
        * **Part-of-speech tagging:** Labeling words with their grammatical roles
        * **Lemmatization:** Reducing words to their base form
        * **Parsing:** Analyzing the grammatical structure of sentences

In [3]:
import re
import nltk

### Download Stop Words

In [35]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cjaehnen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Import Downloaded Stop Words

In [5]:
from nltk.corpus import stopwords

### Import Stemming Class

* `PorterStemmer` class reduces a word to its root form, indicating enough about what a word means
    * For example, a review contains the word *loved*
        * Stemming will transform *loved* into *love*
        * Simplifies the word to its root form
    * Goal is to remove all conjugations and keep the present tense of a word
* When the Bag of Words model is created:
    * One will create a sparse matrix where each column will have all the different words from all reviews
    * One wants to minimize the dimension of the sparse matrix
    * The dimension is the number of columns
    * Applying stemming minimizes the number of words in the sparse matrix
    * Without stemming applied, different conjugations of words would be included in the sparse matrix
    * Stemming will reduce the dimension of the sparse matrix

In [6]:
from nltk.stem.porter import PorterStemmer

### Initialize Variables

* `corpus` is a list of all cleaned texts from all reviews
* `stop_words` is a set of English language stop words
* `stop_words_to_remove` is a list of stop words to remove from the set of stop words
* `stop_word_to_remove` is the iterator variable for each stop word to remove
* `remove` method on `stop_words` set removes the specified stop word

In [11]:
corpus = []
stop_words = set(stopwords.words('english'))
stop_words_to_remove = ['not']
for stop_word_to_remove in stop_words_to_remove:
    stop_words.remove(stop_word_to_remove)

### For Loop Iterating Over Reviews

* `i` is the iterator variable for each review
* The upper bound of the range is the size of rows in the dataset
* `review` is the review with cleaned text

### Text Cleaning

#### Step 1: Remove All Punctuations

* `re.sub` function call replaces all non-alphanumeric characters with a space
    * The first parameter takes a regular expression using the not operator `^` matching text where it is not alphanumeric characters
    * The second parameter takes the replacement regular expression of a space
    * The third parameter is the review text from the review column in the dataset

#### Step 2: Normalization

* Transform all uppercase alphanumeric characters to lowercase
* `lower` function call converts all uppercase alphanumeric characters to lowercase

#### Step 3: Splitting

* Split review texts into words
* `split` function call splits review texts into a list of words

#### Step 4: Stemming and Stop Word Removal

* `ps` is the object instance of the `PorterStemmer` class used to apply stemming
* For loop iterates through all the words in the new `review` list
* `word` is the iterator variable for each review word
* To remove stop words, Python allows for an inline conditional check, in the for loop, to omit values from the iterator when the condition is not true
    * `word` is checked against all the English language stop words
    * If found, `word` is omitted
    * Otherwise, `word` is included
* `ps` object method `stem` call applies stemming to a word
* After all words in `review` list have stop words removed and are stemmed, one needs to join all the words back into a string
    * To join all the words in the list, separated by a space, the `join` method is called the on space `' '` string object and takes the list as a parameter

#### Step 5: Add Cleansed Text to Cleansed List

* Final step is to append the cleansed `review` text to the `corpus` list

In [12]:
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in stop_words]
    review = ' '.join(review)
    corpus.append(review)

In [14]:
print(*corpus[:25], sep='\n')

wow love place
crust not good
not tasti textur nasti
stop late may bank holiday rick steve recommend love
select menu great price
get angri want damn pho
honeslti tast fresh
potato like rubber could tell made ahead time kept warmer
fri great
great touch
servic prompt
would not go back
cashier care ever say still end wayyy overpr
tri cape cod ravoli chicken cranberri mmmm
disgust pretti sure human hair
shock sign indic cash
highli recommend
waitress littl slow servic
place not worth time let alon vega
not like
burritto blah
food amaz
servic also cute
could care less interior beauti
perform


## Create Bag of Words Model

### Tokenization

* Split text into words
* Tokenization will be performed by the `feature_extraction.text` module from ScikitLearn
    * `CountVectorizer` class performs tokenization
    * `cv` is the object instance of the class
        * `max_features` parameter defines:
            * Max size of the sparse matrix
                * This is the max number of columns
                    * This is the max number of words to include in the columns of the matrix
* Words exist in the texts that are not relevant to classifying reviews:
    * *Texture*
    * *Bank*
    * *Holiday*
* The means to get rid of these irrelevant words is to identify the most frequently used words in the reviews
* **The sparse matrix is the matrix of features** used to train the upcoming classification model
    * `X` is the variable containing the matrix of features
    *  `fit_transform` method on `cv` object is called with `corpus` as its parameter
        * Fit part of the method will take all the words from the all reviews
        * Transform part of the method will put the words into different columns
* The dependent variable vector `y` can be retrieved from the last column in the dataset
* `number_of_words` is the max number of columns or words to include in the columns
    * Retrieve via the `len` function from the first column of the matrix of features `X` 

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

In [32]:
print(*X[:25], sep='\n')

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


In [33]:
print(*y[:25], sep='\n')

1
0
0
1
1
0
0
0
1
1
1
0
0
1
0
0
1
0
0
0
0
1
1
1
1


In [34]:
number_of_words = len(X[0])

In [36]:
print(number_of_words)

1500


* This is the number of words:
    * Resulting from the tokenization
    * Taken from all the reviews
* For each of the reviews:
    * Value of `1` in the columns corresponds to words in the review
    * Value of `0` in the columns corresponds to words not in the review
* This is the value to be entered, minus $N$ number of words that are not frequent enough to be considered, for the `max_features` parameter when initializing the `cv` object

## Split Dataset into Training Set and Test Set

In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train Naive Bayes Model on Training Set

In [38]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

## Predict Test Set Results

In [39]:
y_pred = classifier.predict(X_test)

In [40]:
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]

## Making the Confusion Matrix

In [41]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

In [42]:
print(cm)

[[55 42]
 [12 91]]


Results:

* $55$ true negatives
* $42$ false positives
* $12$ false negatives
* $91$ true positives

## Compute Accuracy Score

In [43]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.73

## Interpreting Results

*

## Takeaways

*