# Product Review Classification

## Business Understanding
Our company wants a tool that will automatically classify product reviews as _positive_ or _negative_ reviews, based on the features of the review.  This will help our Product team to perform more sophisticated analyses in the future to help ensure customer satisfaction.

## Data Understanding
We have a labeled collection of 20,000 product reviews, with an equal split of positive and negative reviews. The dataset contains the following features:

 - `ProductId` Unique identifier for the product
 - `UserId` Unqiue identifier for the user
 - `ProfileName` Profile name of the user
 - `HelpfulnessNumerator` Number of users who found the review helpful
 - `HelpfulnessDenominator` Number of users who indicated whether they found the review helpful or not
 - `Time` Timestamp for the review
 - `Summary` Brief summary of the review
 - `Text` Text of the review
 - `PositiveReview` 1 if this was labeled as a positive review, 0 if it was labeled as a negative review

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv("reviews.csv")
df.head(3)

Unnamed: 0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,PositiveReview
0,B002QWHJOU,A37565LZHTG1VH,C. Maltese,1,1,1305331200,Awesome!,This is a great product. My 2 year old Golden ...,1
1,B000ESLJ6C,AMUAWXDJHE4D2,angieseashore,1,1,1320710400,Was there a recipe change?,I have been drinking Pero ever since I was a l...,0
2,B004IJJQK4,AMHHNAFJ9L958,A M,0,1,1321747200,These taste so bland.,"Look, each pack contains two servings of 120 c...",0


The data has already been cleaned, so there are no missing values

In [3]:
df.isna().sum()

ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Time                      0
Summary                   0
Text                      0
PositiveReview            0
dtype: int64

`PositiveReview` is the target, and all other columns are features

In [4]:
X = df.drop("PositiveReview", axis=1)
y = df["PositiveReview"]

## Data Preparation

First, split into train and test sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape

(15000, 8)

Second, prepare for modeling. The following `Pipeline` prepares all data for modeling.  It one-hot encodes the `ProductId`, applies a tf-idf vectorizer to the `Summary` and `Text`, keeps the numeric columns as-is, and drops all other columns.

The following code may take up to 1 minute to run.

In [6]:
def drop_irrelevant_columns(X):
    return X.drop(["UserId", "ProfileName"], axis=1)

pipeline = Pipeline(steps=[
    ("drop_columns", FunctionTransformer(drop_irrelevant_columns)),
    ("transform_text_columns", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False), ["ProductId"]),
        ("summary-tf-idf", TfidfVectorizer(max_features=1000), "Summary"),
        ("text-tf-idf", TfidfVectorizer(max_features=1000), "Text")
    ], remainder="passthrough"))
])

X_train_transformed = pipeline.fit_transform(X_train)
X_test_transformed = pipeline.transform(X_test)

X_train_transformed.shape

(15000, 11275)

## Modeling

Fit a `RandomForestClassifier` with the best hyperparameters.  The following code may take up to 1 minute to run.

In [7]:
rfc = RandomForestClassifier(
    random_state=42,
    n_estimators=100,
    max_depth=30,
    min_samples_split=15,
    min_samples_leaf=1
)
rfc.fit(X_train_transformed, y_train)

RandomForestClassifier(max_depth=30, min_samples_split=15, random_state=42)

## Model Evaluation

We are using _accuracy_ as our metric, which is the default metric in Scikit-Learn, so it is possible to just use the built-in `.score` method

In [8]:
print("Train accuracy:", rfc.score(X_train_transformed, y_train))
print("Test accuracy:", rfc.score(X_test_transformed, y_test))

Train accuracy: 0.9846666666666667
Test accuracy: 0.9116


In [9]:
print("Train confusion matrix:")
print(confusion_matrix(y_train, rfc.predict(X_train_transformed)))
print("Test confusion matrix:")
print(confusion_matrix(y_test, rfc.predict(X_test_transformed)))

Train confusion matrix:
[[7323  166]
 [  64 7447]]
Test confusion matrix:
[[2286  225]
 [ 217 2272]]


### 1) Data Preparation

A train-test split has already been performed.

Additionally, there is already a pipeline in place that drops some columns and converts all text columns into a numeric format for modeling.

**Your only additional data preparation task is feature scaling.**  Tree-based models like Random Forest Classifiers do not require scaling, but TensorFlow neural networks do.

There are two main strategies you can take for this task:

#### Scaling within the existing pipeline

If you are comfortable with pipelines, this is the more polished/professional route.

1. Make a new pipeline, with a `StandardScaler` as the final step.  You can nest the steps of the previous pipeline inside of this new pipeline
2. Generate a new `X_train_transformed_scaled` by calling `.fit_transform` on the new pipeline
3. Generate a new `X_test_transformed_scaled` by calling `.transform` on the new pipeline

#### Scaling after the pipeline has finished

This is a better strategy if you are not as comfortable with pipelines.

1. Instantiate a `StandardScaler` object
2. Generate a new `X_train_transformed_scaled` by calling `.fit_transform` on the scaler object, after you have called `.fit_transform` on the pipeline
3. Generate a new `X_test_transformed_scaled` by calling `.transform` on the scaler object, after you have called `.transform` on the pipeline

If you are getting stuck at this step, skip it.  The model will still be able to fit, although the performance will be worse.  Keep in mind whether or not you scaled the data in your final analysis.

In [10]:
from sklearn.preprocessing import StandardScaler

In [11]:
X_train_transformed = pipeline.fit_transform(X_train)
X_test_transformed = pipeline.transform(X_test)

ss=StandardScaler()

X_train_transformed_scaled= ss.fit_transform(X_train_transformed)
X_test_transformed_scaled= ss.transform(X_test_transformed)

In [12]:
X_train_transformed_scaled.shape

(15000, 11275)

In [13]:
y_train # our outputs for target are already binarized. no need for further transformations here.

5514     1
1266     0
5864     1
15865    1
12892    1
        ..
11284    1
11964    0
5390     1
860      1
15795    0
Name: PositiveReview, Length: 15000, dtype: int64

### 2) Modeling

Build a neural network classifier.  Specifically, use the `keras` submodule of the `tensorflow` library to build a multi-layer perceptron model with the `Sequential` interface.

See the [`tf.keras` documentation](https://www.tensorflow.org/guide/keras/overview) for an overview on the use of `Sequential` models. See the [Keras layers documentation](https://keras.io/layers/core/) for descriptions of the `Dense` layer options.  

1. Instantiate a `Sequential` model
2. Add an input `Dense` layer.  You'll need to specify a `input_shape` = (11275,) because this is the number of features of the transformed dataset.
3. Add 2 `Dense` hidden layers.  They can have any number of units, but keep in mind that more units will require more processing power.  We recommend an initial `units` of 64 for processing power reasons.
4. Add a final `Dense` output layer.  This layer must have exactly 1 unit because we are doing a binary prediction task.
5. Compile the `Sequential` model
6. Fit the `Sequential` model on the preprocessed training data (`X_train_transformed_scaled`) with a b`batch_size` of 50 and `epochs` of 5 for processing power reasons.


In [14]:
import tensorflow as tf
import tensorflow.keras 

import warnings
warnings.filterwarnings('ignore')

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [15]:
model_1= Sequential() #instantiating the model

In [16]:
model_1.add(Dense(units=110, activation='relu', input_shape= (X_train_transformed_scaled.shape[1],))) #input layer
model_1.add(Dense(units=64)) #hidden layer
model_1.add(Dense(units=64)) #hidden layer
model_1.add(Dense(units=1, activation='sigmoid',)) # output layer

In [17]:
model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy']) #compiling the model and setting our metrics
#for accuracy, and our loss function.

In [18]:
model_1.fit(X_train_transformed_scaled, y_train, batch_size= 50, epochs= 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acb2b7fe50>

In [19]:
model_1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 110)               1240360   
_________________________________________________________________
dense_1 (Dense)              (None, 64)                7104      
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 1,251,689
Trainable params: 1,251,689
Non-trainable params: 0
_________________________________________________________________


In [20]:
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

In [27]:
y_hat= model_1.predict_classes(X_test_transformed_scaled)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


In [28]:
from sklearn.metrics import confusion_matrix, recall_score
from sklearn.metrics import plot_confusion_matrix

In [29]:
confusion_matrix(y_test, y_hat)

array([[2275,  236],
       [ 238, 2251]], dtype=int64)

In [74]:
acc_model1= accuracy_score(y_test,y_hat)
acc_model1

0.9052

In [None]:
#performs about just as well

### 3) Model Tuning + Feature Engineering

If you are running out of time, skip this step.

Tune the neural network model to improve performance.  This could include steps such as increasing the units, changing the activation functions, or adding regularization.

We recommend using using a `validation_split` of 0.1 to understand model performance without utilizing the test holdout set.

You can also return to the preprocessing phase, and add additional features to the model.

In [22]:
from tensorflow.keras import regularizers
#from tensorflow.keras import Dropout

In [31]:
#concerned that the model may be overfitting so I'm going to add a regularization layer, dropout
# with a dropout rate of 0.3.

#tensorflow did not let me play with parameters so im going to play with activation functions and units, regularizers

model_2= Sequential()
model_2.add(Dense(units=100, activation='relu', kernel_regularizer= regularizers.l2(0.005), input_shape=(X_train_transformed_scaled.shape[1],)))

model_2.add(Dense(units=64, activation= 'relu',kernel_regularizer= regularizers.l2(0.005)))

model_2.add(Dense(units=32, activation='relu',kernel_regularizer= regularizers.l2(0.005)))
model_2.add(Dense(units=1, activation='sigmoid'))
model_2.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
model_2.fit(X_train_transformed_scaled, y_train, batch_size= 50, epochs= 5)



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acb9a19f40>

In [33]:
y_hat_2= model_2.predict_classes(X_test_transformed_scaled)
confusion_matrix(y_test,y_hat_2)

array([[2292,  219],
       [ 227, 2262]], dtype=int64)

In [34]:
confusion_matrix(y_test, y_hat_2)

array([[2292,  219],
       [ 227, 2262]], dtype=int64)

In [76]:
acc_model2 =accuracy_score(y_test, y_hat_2)
print(acc_model2)
print(acc_model1)

0.9108
0.9052


In [77]:
model_2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 100)               1127600   
_________________________________________________________________
dense_9 (Dense)              (None, 64)                6464      
_________________________________________________________________
dense_10 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 33        
Total params: 1,136,177
Trainable params: 1,136,177
Non-trainable params: 0
_________________________________________________________________


In [43]:
model_3= Sequential()
model_3.add(Dense(units=100, activation='relu', kernel_regularizer= regularizers.l2(0.005), input_shape=(X_train_transformed_scaled.shape[1],)))

model_3.add(Dense(units=64, activation='tanh',kernel_regularizer= regularizers.l2(0.005)))

model_3.add(Dense(units=32, activation='tanh',kernel_regularizer= regularizers.l2(0.005)))
model_3.add(Dense(units =1, activation='sigmoid'))
model_3.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
model_3.fit(X_train_transformed_scaled, y_train, batch_size= 50, epochs= 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acbddc2790>

In [78]:
y_hat_3 = model_3.predict_classes(X_test_transformed_scaled)
acc_model3= accuracy_score(y_test, y_hat_3)
print(acc_model1, acc_model2, acc_model3)

0.9052 0.9108 0.9096


In [45]:
model_4= Sequential()
model_4.add(Dense(units=100, activation='relu', kernel_regularizer= regularizers.l2(0.005), input_shape=(X_train_transformed_scaled.shape[1],)))

model_4.add(Dense(units=64, activation='tanh',kernel_regularizer= regularizers.l2(0.005)))

model_4.add(Dense(units=32, activation='tanh',kernel_regularizer= regularizers.l2(0.005)))
model_4.add(Dense(units =1, activation='sigmoid'))
model_4.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='sgd')
model_4.fit(X_train_transformed_scaled, y_train, batch_size= 50, epochs= 5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acbe14d970>

In [79]:
y_hat_4 = model_4.predict_classes(X_test_transformed_scaled)
acc_model4= accuracy_score(y_test, y_hat_4)
print(acc_model1, acc_model2, acc_model3, acc_model4)

0.9052 0.9108 0.9096 0.9048


In [48]:
model_5= Sequential()
model_5.add(Dense(units=100, activation='relu', kernel_regularizer= regularizers.l2(0.005), input_shape=(X_train_transformed_scaled.shape[1],)))
model_5.add(Dense(units=64))
model_5.add(Dense(units=64, activation='relu',kernel_regularizer= regularizers.l2(0.005)))
model_5.add(Dense(units=32, activation='relu'))
model_5.add(Dense(units=32, activation='relu',kernel_regularizer= regularizers.l2(0.005)))
model_5.add(Dense(units=16, activation='relu'))
model_5.add(Dense(units=8, activation='relu',kernel_regularizer= regularizers.l2(0.005)))
model_5.add(Dense(units =1, activation='sigmoid'))
model_5.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
model_5.fit(X_train_transformed_scaled, y_train, batch_size= 50, epochs= 5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acc0f0ef40>

In [83]:
y_hat_5 = model_5.predict_classes(X_test_transformed_scaled)
acc_model5= accuracy_score(y_test, y_hat_5)
print(acc_model1, acc_model2, acc_model3, acc_model4, acc_model5)

0.9052 0.9108 0.9096 0.9048 0.9094


### 4) Model Evaluation

Choose a final `Sequential` model, add layers, and compile.  Fit the model on the preprocessed training data (`X_train_transformed_scaled`, `y_train`) and evaluate on the preprocessed testing data (`X_test_transformed_scaled`, `y_test`) using `accuracy_score`.

In [50]:
from sklearn.metrics import accuracy_score

In [52]:
model_3.fit(X_train_transformed_scaled, y_train, batch_size=50, validation_data=(X_test_transformed_scaled, y_test), epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1acc3f4e5b0>

In [53]:
y_hat_final= model_3.predict_classes(X_test_transformed_scaled) # testing final model predictions

In [55]:
final_model_accuracy= accuracy_score(y_test, y_hat_final) #final model accuracy
final_model_accuracy

0.9096

### 5) Technical Communication

Write a paragraph explaining whether Northwind Trading Company should switch to using your new neural network model, or continue to use the Random Forest Classifier.  Beyond a simple comparison of performance, try to take into consideration additional considerations such as:

 - Computational complexity/resource use
 - Anticipated performance on future datasets (how might the data change over time?)
 - Types of mistakes made by the two kinds of models

You can make guesses or inferences about these considerations.

**Include at least one visualization** comparing the two types of models.  Possible points of comparison could include ROC curves, colorized confusion matrices, or time needed to train.


In [62]:
#from sklearn.metrics import plot_confusion_matrix

y_rfc_hat=rfc.predict(X_test_transformed) #storing rfc predictions to compare accuracy between models

In [60]:
confusion_matrix(y_test, rfc.predict(X_test_transformed)) # confusion matrix showing accuracy perfomance for rfc model

array([[2286,  225],
       [ 217, 2272]], dtype=int64)

In [61]:
confusion_matrix(y_test, y_hat_3) # confusion matrix showing accuracy performance for our final neural network model. model 3


array([[2216,  295],
       [ 172, 2317]], dtype=int64)

In [None]:
# overall both models perform very similarily

In [66]:

from sklearn.metrics import roc_curve, auc

In [68]:
print(accuracy_score(y_test, rfc.predict(X_test_transformed))) #RandomForest Accuracy score 
print(final_model_accuracy) # neural network accuracy

0.9116
0.9096


With the accuracy scores obtained from both the random forest classifier and the final neural network model being within 1% of eachother I would initally recommend telling Northwind Trading Company to stick with their original machine learning model. The reason being that it may not be worth their time to invest in a neural network model that performs just as well, but is not able to communicate to analysts which features they might focus on when it comes to correctly determining what helps the model classify a good review from a bad review. 

This being sad, the computation time for a random forest classifier can be very demanding. What you make up for in not having to scale features, you lose when it goes thru each random iteration of its decision tree. Taking this problem into context, correctly identifying a positive review from a negative review, it might not matter to us as much to identify what parameters lead the model to accurate identification. In this case, I would highly suggest the neural network.

The neural network appears to be computationally faster, which is important if the Northwind Trading Company wants to go thru a even bigger dataset. My suggestion would be to use a Neural Network to correctly identify good reviews from bad reviews. Then retrieving a dataset about what the reviewer bought, how they bought it (online vs in-store), and how they retrieved it(via mail-carrier vs in store pick up), and then run a random forest classifier on what features of this process lead to positive and negative reviews, so they can be identified and respectively fixed or exalted. 