# Mod 4 Code Challenge: Product Reviews

This assessment is designed to test your understanding of these areas:

1. Data Engineering
    - Understanding an existing ETL pipeline
    - Feature scaling
2. Deep Learning with Neural Networks
    - Creating a TensorFlow neural network model
    - Fitting the model on training data
    - Hyperparameter tuning
    - Model evaluation on test data
3. Business Understanding and Technical Communication
    - Advising a business on what kind of model architecture to use

**Unlike previous challenges, we have provided you some pre-existing code.**  Your work, markdown and code, should build off of the pre-existing material. 

Make sure that your code is clean and readable, and that each step of your process is documented. For this challenge each step builds upon the step before it. If you are having issues finishing one of the steps completely, move on to the next step to attempt every section.  There will be occasional hints to help you move on to the next step if you get stuck, but attempt to follow the requirements whenever possible. 

### Business Understanding

Northwind Trading Company allows customers to leave reviews, but those reviews do not have customer-facing "star ratings".  Instead, customers are free to write text, and other customers can vote on whether the review was helpful.  They find that this is a good trade-off between helping customers make informed decisions about products, and avoiding having any products go unsold because of poor ratings.

Internally, Northwind is interested to know which of these reviews are positive, and which are negative.  **A previous employee of the company has already built a Random Forest Classifier model to perform this classification task.**

Northwind management has heard great things about using Artificial Intelligence for this kind of task, especially Neural Networks like TensorFlow.  **You have been instructed to build a TensorFlow model and advise the company on whether they should switch from the Random Forest Classifier to the TensorFlow model.**

In either case, you want a **classification model** that optimizes for **accuracy**.

### Data Understanding

The data has already been described, imported, and preprocessed in this notebook.

****Below is the work of a previous employee. Take a brief moment to review their work and then complete the tasks at the bottom of the notebook.****

# Product Review Classification

## Business Understanding
Our company wants a tool that will automatically classify product reviews as _positive_ or _negative_ reviews, based on the features of the review.  This will help our Product team to perform more sophisticated analyses in the future to help ensure customer satisfaction.

## Data Understanding
We have a labeled collection of 20,000 product reviews, with an equal split of positive and negative reviews. The dataset contains the following features:

 - `ProductId` Unique identifier for the product
 - `UserId` Unqiue identifier for the user
 - `ProfileName` Profile name of the user
 - `HelpfulnessNumerator` Number of users who found the review helpful
 - `HelpfulnessDenominator` Number of users who indicated whether they found the review helpful or not
 - `Time` Timestamp for the review
 - `Summary` Brief summary of the review
 - `Text` Text of the review
 - `PositiveReview` 1 if this was labeled as a positive review, 0 if it was labeled as a negative review

In [77]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score


In [30]:
df = pd.read_csv("reviews.csv")
df.head(3)

Unnamed: 0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,PositiveReview
0,B002QWHJOU,A37565LZHTG1VH,C. Maltese,1,1,1305331200,Awesome!,This is a great product. My 2 year old Golden ...,1
1,B000ESLJ6C,AMUAWXDJHE4D2,angieseashore,1,1,1320710400,Was there a recipe change?,I have been drinking Pero ever since I was a l...,0
2,B004IJJQK4,AMHHNAFJ9L958,A M,0,1,1321747200,These taste so bland.,"Look, each pack contains two servings of 120 c...",0


The data has already been cleaned, so there are no missing values

In [31]:
df.isna().sum()

ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Time                      0
Summary                   0
Text                      0
PositiveReview            0
dtype: int64

`PositiveReview` is the target, and all other columns are features

In [32]:
X = df.drop("PositiveReview", axis=1)
y = df["PositiveReview"]

## Data Preparation

First, split into train and test sets

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape

(15000, 8)

Second, prepare for modeling. The following `Pipeline` prepares all data for modeling.  It one-hot encodes the `ProductId`, applies a tf-idf vectorizer to the `Summary` and `Text`, keeps the numeric columns as-is, and drops all other columns.

The following code may take up to 1 minute to run.

In [34]:
def drop_irrelevant_columns(X):
    return X.drop(["UserId", "ProfileName"], axis=1)

pipeline = Pipeline(steps=[
    ("drop_columns", FunctionTransformer(drop_irrelevant_columns)),
    ("transform_text_columns", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False), ["ProductId"]),
        ("summary-tf-idf", TfidfVectorizer(max_features=1000), "Summary"),
        ("text-tf-idf", TfidfVectorizer(max_features=1000), "Text")
    ], remainder="passthrough"))
])

X_train_transformed = pipeline.fit_transform(X_train)
X_test_transformed = pipeline.transform(X_test)

X_train_transformed.shape

(15000, 11275)

In [45]:
# gotta scale it or else you spend 15 minutes wondering why your nn is performing horibbly. 
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train_final = ss.fit_transform(X_train_transformed)
X_test_final = ss.transform(X_test_transformed)


## Modeling

Fit a `RandomForestClassifier` with the best hyperparameters.  The following code may take up to 1 minute to run.

In [39]:
rfc = RandomForestClassifier(
    random_state=42,
    n_estimators=100,
    max_depth=30,
    min_samples_split=15,
    min_samples_leaf=1
)
rfc.fit(X_train_transformed, y_train)

RandomForestClassifier(max_depth=30, min_samples_split=15, random_state=42)

## Model Evaluation

We are using _accuracy_ as our metric, which is the default metric in Scikit-Learn, so it is possible to just use the built-in `.score` method

In [40]:
print("Train accuracy:", rfc.score(X_train_transformed, y_train))
print("Test accuracy:", rfc.score(X_test_transformed, y_test))

Train accuracy: 0.9846666666666667
Test accuracy: 0.9116


In [41]:
print("Train confusion matrix:")
print(confusion_matrix(y_train, rfc.predict(X_train_transformed)))
print("Test confusion matrix:")
print(confusion_matrix(y_test, rfc.predict(X_test_transformed)))

Train confusion matrix:
[[7323  166]
 [  64 7447]]
Test confusion matrix:
[[2286  225]
 [ 217 2272]]


## Business Interpretation

The tuned Random Forest Classifier model appears to be somewhat overfit on the training data, but nevertheless achieves 91% accuracy on the test data.  Of the 9% of mislabeled comments, about half are false positives and half are false negatives.

Because this is a balanced dataset, 91% accuracy is a substantial improvement over a 50% baseline.  This model is ready for production use for decision support.

# NN 

In [37]:
# Imports for keras
from keras.models import Sequential 
from keras.layers import Dense
from keras import optimizers

In [46]:
# first model 
model = Sequential() 
model.add(Dense(10, activation='relu', input_shape=(X_train_final.shape[1],)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer = 'SGD',
              loss = 'binary_crossentropy',
              metrics = ['acc'])

baseline_model = model.fit(X_train_final, 
                    y_train, 
                    epochs=10, 
                    batch_size=30)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [51]:
from keras.layers import Dropout
from keras import regularizers

In [52]:
# gonna add some dropout layers and l2 so its less likely to be overfit
model3 = Sequential()

model3.add(Dense(10, activation='relu', input_shape=(X_train_final.shape[1],)))
model3.add(Dense(4, activation = 'tanh'))
model3.add(Dropout(0.3)) # dropout layer
model3.add(Dense(4, activation = 'relu'))
model3.add(Dense(10, kernel_regularizer=regularizers.l2(0.005), activation='relu')) #l2 layer
model3.add(Dense(10, activation = 'relu'))
model3.add(Dropout(0.3)) # dropout layer
model3.add(Dense(40, activation = 'relu'))
model3.add(Dense(1, activation ='sigmoid'))

model3.compile(optimizer = 'adam',
              loss = 'binary_crossentropy',
              metrics = ['acc'])

third_model =  model3.fit(X_train_final, 
                                y_train, 
                                epochs=15, 
                                batch_size=500)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [None]:
# prolly go with number model 3 as its less overfit 
# lets run it on the test set

In [78]:
y_hat_test = model3.predict_classes(X_test_final);
confusion_matrix(y_test, y_hat_test)


array([[2239,  272],
       [ 210, 2279]])

In [89]:
Accuracy = third_model.history['acc'][-1]

In [92]:
# NN
print(f'Train accuracy: {Accuracy}')
print(f'Test accuracy: {accuracy_score(y_test, y_hat_test)}')

Train accuracy: 0.9744666814804077
Test accuracy: 0.9036


In [93]:
#RandomForest
print("Train accuracy:", rfc.score(X_train_transformed, y_train))
print("Test accuracy:", rfc.score(X_test_transformed, y_test))

Train accuracy: 0.9846666666666667
Test accuracy: 0.9116


## Id stick with the random forest model 

I just followed the procedure they did, but you could have just broke out into a validation set to make sure you werent overfitting or do cross val. data set was pretty big so i didnt bother. first model was obviously overfit as you were almost getting 100% accuracy.I added some drop layers and regularization to prevent overfitting. got a lower accuracy but ensured it would likely perform better on holdout set. 

Id keep the random forest of 91% accuracy even if a NN beat it by a little bit. A lot more interprabilty and you can act on what features are going to drive the negative reviews down. Especially for a company that probably wants to undertstand whats going on it would make sense to use a random forest over a NN regardless of results. 