# Natural Language Processing

## Dataset

### Layout

* Columns:
	* Review
	    * Contains the text each customer submitted for a review
    * Liked
* Rows: 100s of observations
	* Each row represents a review submitted by a customer indicating if the customer liked a restaurant or not
	    * 1 = customer liked a restaurant
        * 0 = customer did not like a restaurant
* Dataset file is a tab-separated values (TSV) file instead of CSV since review texts can contain commas

### Background

* One is a data scientist working for a group of restaurants
* Owners of the restaurants group want to analyze restaurant reviews for customer sentiment from their submitted reviews to determine if a customer liked a restaurant or not

### Goals

* Build a Bag of Words model to pre-process the text from restaurant reviews
* Build a classification model to determine if a customer will like a restaurant or not based on sentiment from reviews


## Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Import Dataset

* The review column of the TSV file contains free-form text containing customer restaurant reviews
* The review texts sometimes contain double quotes
* When cleaning the texts, one must indicate to the model to ignore double quotes, which can othwerwise lead to processing errors
* Set the `quoting` parameter to $3$ in the Pandas `read_csv` function to ignore reading double quotes into the dataset

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)

## Clean Texts

## Create Bag of Words Model

## Split Dataset into Training Set and Test Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train Naive Bayes Model on Training Set

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

## Predict Test Set Results

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

## Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

In [None]:
print(cm)

Results:

* $0$ true negatives
* $0$ false positives
* $0$ false negatives
* $0$ true positives

## Compute Accuracy Score

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

## Interpreting Results

*

## Takeaways

*