# Natural Language Processing

## Dataset

### Layout

* Columns:
	* Review
	    * Contains the text each customer submitted for a review
    * Liked
* Rows: 100s of observations
	* Each row represents a review submitted by a customer indicating if the customer liked a restaurant or not
	    * 1 = customer liked a restaurant
        * 0 = customer did not like a restaurant
* Dataset file is a tab-separated values (TSV) file instead of CSV since review texts can contain commas

### Background

* One is a data scientist working for a group of restaurants
* Owners of the restaurants group want to analyze restaurant reviews for customer sentiment from their submitted reviews to determine if a customer liked a restaurant or not

### Goals

* Build a Bag of Words model to pre-process the text from restaurant reviews
* Build a classification model to determine if a customer will like a restaurant or not based on sentiment from reviews


## Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Import Dataset

* The review column of the TSV file contains free-form text containing customer restaurant reviews
* The review texts sometimes contain double quotes
* When cleaning the texts, one must indicate to the model to ignore double quotes, which can othwerwise lead to processing errors
* Set the `quoting` parameter to $3$ in the Pandas `read_csv` function to ignore reading double quotes into the dataset

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)

## Clean Texts

### Import Additional Libraries

* `re` library provides support for regular expressions
* `nltk` is the Natural Language Toolkit library for working with human language data
    * It provides tools for various tasks in NLP:
        * **Tokenization:** Splitting text into words or sentences
        * **Stop word removal:** Filtering out common words such as *the*, *is*, *an*, etc.
        * **Stemming:** Reducing words to their root form
        * **Part-of-speech tagging:** Labeling words with their grammatical roles
        * **Lemmatization:** Reducing words to their base form
        * **Parsing:** Analyzing the grammatical structure of sentences

In [3]:
import re
import nltk

### Download Stop Words

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cjaehnen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Import Downloaded Stop Words

In [5]:
from nltk.corpus import stopwords

### Import Stemming Class

* `PorterStemmer` class reduces a word to its root form, indicating enough about what a word means
    * For example, a review contains the word *loved*
        * Stemming will transform *loved* into *love*
        * Simplifies the word to its root form
    * Goal is to remove all conjugations and keep the present tense of a word
* When the Bag of Words model is created:
    * One will create a sparse matrix where each column will have all the different words from all reviews
    * One wants to minimize the dimension of the sparse matrix
    * The dimension is the number of columns
    * Applying stemming minimizes the number of words in the sparse matrix
    * Without stemming applied, different conjugations of words would be included in the sparse matrix
    * Stemming will reduce the dimension of the sparse matrix

In [6]:
from nltk.stem.porter import PorterStemmer

### Initialize Variables

* `corpus` is a list of all cleaned texts from all reviews

In [7]:
corpus = []

### For Loop Iterating Over Reviews

* `i` is the iterator variable for each review
* The upper bound of the range is the size of rows in the dataset
* `review` is the review with cleaned text

### Text Cleaning

#### Step 1: Remove All Punctuations

* `re.sub` function call replaces all non-alphanumeric characters with a space
    * The first parameter takes a regular expression using the not operator `^` matching text where it is not alphanumeric characters
    * The second parameter takes the replacement regular expression of a space
    * The third parameter is the review text from the review column in the dataset

#### Step 2: Normalization

* Transform all uppercase alphanumeric characters to lowercase
* `lower` function call converts all uppercase alphanumeric characters to lowercase

#### Step 3: Tokenization

* Split review texts into words
* `split` function call splits review texts into a list of words

In [12]:
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()

## Create Bag of Words Model

## Split Dataset into Training Set and Test Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train Naive Bayes Model on Training Set

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

## Predict Test Set Results

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

## Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

In [None]:
print(cm)

Results:

* $0$ true negatives
* $0$ false positives
* $0$ false negatives
* $0$ true positives

## Compute Accuracy Score

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

## Interpreting Results

*

## Takeaways

*