# Final Project


## Problem description

Time to show off everything you learned !

You will be performing a Classification task to analyze the sentiment of product reviews.

This is similar to a prior assignment
- With a different dataset
- Multinomial classification with 5 classes

But, by now, you have many more tools at your disposal.

## Some possible approaches
- A review is a sequence of words.  You will need to deal with sequences in some manner.  Some suggestions
  - Pooling
  - Recurrent Neural Network
- Is there an advantage to recognizing *adjacent* words groups ("n-grams") rather than treating the document as an unordered set of words ?
  - Consider these two sentences
    - "Machine Learning is easy not hard"
    - "Machine Learning is hard not easy"

  - Two sentences with identical words but different meaning.
  - Hint: Convolutional layer
- How should we encode words ?
  - OHE ? Embedding ?

We will **not specify** an approach.  Feel free to experiment.

Your goal is to produce a model with an out of sample accuracy meeting a minimum



## Grading

Prior assignments evaluated you step by step.

This project is **results-based**. Your goal is to create a model
- That achieves an out of sample accuracy of at least 50%
- 60% would be better !



## Learning objectives
- Experimentation !
- Error Analysis leading to model improvement
- Appreciate how choices impact number of weights


In [10]:
from __future__ import print_function

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import os
import re
import json
import gzip
import math

%matplotlib inline

import tensorflow as tf
print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )


from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, LSTM

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from tensorflow.keras.utils import plot_model
import IPython

Running TensorFlow version  2.1.0
Version 2, minor 1


# API for students

We will define some utility routines.

This will simplify problem solving

More importantly: it adds structure to your submission so that it may be easily graded

- getData: Get a collection of labelled images, used as follows

  >`data, labels = getData()`
- showData: Visualize labelled images, used as follows

  >`showData(data, labels)`

- train: train a model and visualize its progress, used as follows

  >`train(model, X_train, y_train, model_name, epochs=max_epochs)`


In [3]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

import final_helper
%aimport final_helper

helper = final_helper.HELPER()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload




## Get the reviews (as text)
- **Teacher version:** include code to read and format the raw data, producing a file for the students.  Hide this from student

- **Student version:** read file prepared by teacher version


In [4]:
DATA_DIR = "./Data"
data_file = "Software_5.json"
data_raw = helper.getDataRaw(DATA_DIR, data_file)

In [None]:
data_raw.head()

In [None]:
data_raw.shape

In [11]:
train, test = train_test_split(data_raw, test_size=0.1, shuffle=False)
test_student = test.loc[:, test.columns != 'overall']
test_grad = test[['reviewerID', 'overall']]
train.to_csv('./Data/train.csv', index=False)
test_student.to_csv('./Data/test.csv', index=False)
test_grad.to_csv('./Data/test_grad.csv', index=False)
test_grad[:20].to_csv('./Data/submit_sample.csv', index=False)

## Data preprocessing

The reviews are in the "reviewText" attribute.



You may try to use other attributes as additional features if you choose, but we suggest that your first model may use this as the only source of features.


# Get the labelled training data
- Features: docs.  Each document is a single review (sequence of characters)
- Targets/Labels: sents. Each is the sentiment associated with the review.

In [None]:
textAttr, sentAttr, titleAttr = "reviewText", "overall", "title"
docs, sents = helper.getTextClean(data_raw, textAttr, sentAttr)

print("Docs shape is ", docs.shape)
print("Sents shape is ", sents.shape)

# docs[:5]
print("\nPossible sentiment values: ",  np.unique(sents) ) 


## More data preprocessing

We will need to convert the text in a *sequence* of numbers
- Break text up into words
- Assign each word a distinct integer

Moreover, it will be easier if all sequences have the same length.
We can add a "padding" character to the front if necessary.

We do this for you below.

Our method returns
- encoded_docs_padded: A matrix of training example *features*
  - Each row is an example
  - Each row is a *sequence* of fixed length
  - Each element of the sequence is an integer, encoding a word in the vocabulary
  - The sequence length of every example is *identical* because we have prepended padding if necessary
- encoded_docs: A matrix of *unpadded* training example *features*
- tok: the Tokenizer used to
  - parse strings of characters into words
  - encoded each word as an integer


You may study our methods parameters and modify them if you wish, e.g., alter the size of the vocabulary or length of sequences.

We suggest that your first model uses
- encoded_docs_padded as your set of training features, e.g., X
- sents: as your targets
with the default settings of the method.



In [None]:
## set parameters:
# vocab_size : number of words in the vocabulary 
# words_in_doc: number of words in a review
vocab_size_sm, words_in_doc_sm = 400, 100

tok, encoded_docs, encoded_docs_padded = helper.encodeDocs(docs, vocab_size=vocab_size_sm, words_in_doc=words_in_doc_sm)

print("Training example features shape: ",encoded_docs_padded.shape)

print("Training example features: preview")
encoded_docs_padded[:3]


## Verify that our encoded documents are the same as the cleaned original

At this point: convince yourself that all we have done was encode words as integers and pad out all text to the same length.  The following will demonstrate this

In [None]:
helper.showEncodedDocs(tok, encoded_docs_padded)

# Split the examples into training and test (out of sample) data
- The number of test examples should be 10% of the total


In [None]:
# Set the following variables
# - X_train: ndarray of training example features
# - X_test:  ndarray of test example features
# - y_train: ndarray of training example targets
# - y_test:  ndarray of test example targets

### BEGIN SOLUTION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(encoded_docs_padded, sents, test_size=0.10, random_state=42)

# OHE the X for some models
vocab_size_sm, words_in_doc_sm = 400, 100
X_train_OHE, _ = helper.getExamplesOHE(X_train, sents, vocab_size_sm)
X_test_OHE, _ = helper.getExamplesOHE(X_test, sents, vocab_size_sm)
### END SOLUTION

In [None]:
# Set two variables
# example_sequence_len: length of the sequence
# example_num_features: number of features in a single element of the sequence (of a single example)

### BEGIN SOLUTION

# If using OHE:
example_shape = X_train_OHE.shape[1:]
example_sequence_len, example_num_features = example_shape[0], example_shape[1]

assert example_sequence_len == words_in_doc_sm
assert example_num_features == vocab_size_sm

# If NOT using OHE
example_shape = X_train.shape[1:]
example_sequence_len, example_num_features = example_shape[0], tok.num_words

assert example_sequence_len == words_in_doc_sm
assert example_num_features == vocab_size_sm

### END SOLUTION

# Create your model



In [None]:
### BEGIN SOLUTION
# Set variables
# - model: a Keras Sequential model to predict sentiment

### END SOLUTION

# Sample models (for teacher review, not for students)



## Simple model: OHE + GlobalMaxPooling + Logistic Regression


In [None]:
### BEGIN SOLUTION

def runModel(model, model_name, X_train, X_test, y_train, y_test):
    plot_file = helper.plotModel(model_simple, model_name)
    IPython.display.Image(plot_file) 

    model.summary()
    
    patience = 5
    min_delta = .005
    max_epochs=30
    history = helper.trainModelCat( model, X_train, y_train, max_epochs)

    helper.plot_training(history)

    helper.eval_model(model, X_test, y_test)


model_simple = Sequential( [ GlobalMaxPooling1D(input_shape=X_train_OHE.shape[-2:]),
                             Dense( len( np.unique(y_train) ), activation="softmax")
                           ]
                         )

runModel(model_simple, "OHE + GlobalMaxPooling", X_train_OHE, X_test_OHE, y_train, y_test)

### END SOLUTION

## Model: OHE + LSTM

In [None]:
### BEGIN SOLUTION
lstm_size_sm = 4
model_lstm = Sequential( [
                          LSTM(lstm_size_sm, input_shape=X_train_OHE.shape[-2:], recurrent_dropout=0.),
                          Dropout(0.3),
                          Dense( len( np.unique(y_train) ), activation="softmax")
                         ]
                       )

runModel(model_lstm, "OHE + LSTM", X_train_OHE, X_test_OHE, y_train, y_test)
### END SOLUTION

## Model: Embedding + GlobalMaxPooling 


In [None]:
### BEGIN SOLUTION
embed_size_sm=16
model_simple_es = Sequential( [Embedding(tok.num_words+1, embed_size_sm, input_length=words_in_doc_sm),
                             GlobalMaxPooling1D(),
                             Dense( len( np.unique(y_train) ), activation="softmax")
                         ]
                       )

runModel(model_simple_es, "Embedding + GlobalMaxPooling", X_train, X_test, y_train, y_test)
### END SOLUTION


# Embedding + LSTM (rather than GlobalPooling)


In [None]:
### BEGIN SOLUTION
model_lstm_e = Sequential( [Embedding(tok.num_words+1, embed_size_sm, input_length=words_in_doc_sm),
                            LSTM(lstm_size_sm),
                          Dense( len( np.unique(y_train) ), activation="softmax")
                         ]
                       )

runModel(model_lstm_e, "Embedding + LSTM", X_train, X_test, y_train, y_test)
### END SOLUTION


# Try a larger vocab, since Embeddings are more compact


## model: Embedding Big + GlobalMaxPooling1D

In [None]:
### BEGIN SOLUTION
vocab_size_b, words_in_doc_b, embed_size_b, lstm_size_b = int(10*vocab_size_sm), words_in_doc_sm, int(1*embed_size_sm), lstm_size_sm
tok_b, encoded_docs_b, encoded_docs_padded_b = encodeDocs(docs, vocab_size=vocab_size_b, words_in_doc=words_in_doc_b)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(encoded_docs_padded_b, sents, test_size=0.10, random_state=42)

X_train_b.shape, y_train_b.shape


model_simple_es_b = Sequential( [Embedding(tok_b.num_words+1, embed_size_b, input_length=words_in_doc_b),
                             GlobalMaxPooling1D(),
                          Dense( len( np.unique(y_train_b) ), activation="softmax")
                         ]
                       )

runModel(model_simple_es_b, "Embedding Big + GlobalMaxPooling", X_train_b, X_test_b, y_train_b, y_test_b)

### END SOLUTION

## Model: Embedding Big + LSTM

In [None]:
### BEGIN SOLUTION
model_lstm_b = Sequential( [Embedding(tok_b.num_words+1, embed_size_b, input_length=words_in_doc_b),
                            LSTM(lstm_size_b),
                            Dropout(0.25),
                          Dense( len( np.unique(y_train_b) ), activation="softmax")
                         ]
                       )

runModel(model_lstm_b, "Embedding Big + LSTM", X_train_b, X_test_b, y_train_b, y_test_b)
### END SOLUTION

## Model: more complex




In [None]:
### BEGIN SOLUTION 
model_x = Sequential( [Embedding(tok_b.num_words+1, embed_size_b, input_length=words_in_doc_b),
                          Dropout(0.25),
                          #LSTM(lstm_size_b), Dense(50, activation="relu"),
                          GlobalMaxPooling1D(),
                          Dropout(0.25),
                           Dense(100, activation="relu"),
                         
                          Dense( len( np.unique(y_train_b) ), activation="softmax")
                         ]
                       )

runModel(model_x, "Embedding Big + Compldex", X_train_b, X_test_b, y_train_b, y_test_b)
### END SOLUTION

# Submit your model


- Was the increase in number of weights compensated by a gain in accuracy when using a Recurrent Layer type compared to the Classifier only model ?
- Can you speculate why this is so ?

In [None]:
### BEGIN SOLUTION
model = model_simple

# This model uses non-standard features
X_test = X_test_OHE
### END SOLUTION

# Evaluate your model on the previously constructed test examples (out of sample)

In [None]:
loss_test, acc_test = eval_model(model, X_test, y_test)
# - loss_test: Loss, out of sample
# - acc_test:  Accuracy, out of sample.  This is what you will be graded on

## How many weights in your Classifier only model ?

How many weights in your model ?

You should always be sensitive to how "big" your model is.

In [None]:
# Set variable
# - num_weight: number of weights in your model

### BEGIN SOLUTION
num_weights = model.count_params()
### END SOLUTION

# Discussion topics
- Compare the number of weights in each model.  Did the added complexity (more weights) lead to better performance ?
- **Where** were the largest increase in weights between models
  - Embeddings consume a lot of weights
    - But eliminates a dimension: single integer representation of a word vs a OHE vector of length words_in_vocab
    - So subsequent layers may need fewer weights compared to OHE
- Should we have formulated this as a Regression task rather than a Classification task ?
  - Is the difference in rating between 0 and 1 the same as between 3 and 4 ?
    - Perhaps there are bigger *absolute* differences in satisfaction  in lower ratings
      - i.e., Big difference between 0 and 1, smaller difference between 3 and 4