# Final Project


## Problem description

Time to show off everything you learned !

You will be performing a Classification task to analyze the sentiment of product reviews.

This is similar to a prior assignment
- With a different dataset
- Multinomial classification with 5 classes

But, by now, you have many more tools at your disposal.

## Some possible approaches
- A review is a sequence of words.  You will need to deal with sequences in some manner.  Some suggestions
  - Pooling
  - Recurrent Neural Network
- Is there an advantage to recognizing *adjacent* words groups ("n-grams") rather than treating the document as an unordered set of words ?
  - Consider these two sentences
    - "Machine Learning is easy not hard"
    - "Machine Learning is hard not easy"

  - Two sentences with identical words but different meaning.
  - Hint: Convolutional layer
- How should we encode words ?
  - OHE ? Embedding ?

We will **not specify** an approach.  Feel free to experiment.

Your goal is to produce a model with an out of sample accuracy meeting a minimum

Your goal is to produce a model with an out of sample accuracy meeting a minimum


  
# Advice 
- Your first model should be *simple* (e.g., OHE + GlobalMaxPooling + Logistic Regression)
    - Use it to study the data and get a feel for the problem
    - It establishes a baseline from which to improve
- Use Error Analysis to understand where your model is failing.
    - Perhaps the failure cases suggest improvements ?
- Remember: this is an *iterative* process
    - Your later models can become increasingly complex (e.g., Embedding + LSTM)

## Grading

Prior assignments evaluated you step by step.

This project is **results-based**. Your goal is to create a model
- That achieves an out of sample accuracy of at least 50%
- 60% would be better !

There are three data files in this directory:  
- `train.csv`:
    - This is the dataset on which you will train your model
- `test.csv`:
    - This is the dataset by which you will be judged !
    - It has no labels so **you** can't use it to train or test your model
        - But **we do have** the labels so we can test your accuracy
    - Once you have built your model, you will make predictions on these examples and submit them for grading
- `submit_sample.csv`:
    - The file of predictions that you will submit should be similar in format to this file

**The file that you submit for grading**
- Should be named "my_submit.csv"
**Submit your file: save outputs of your model in a pandas dataframe, name it "my_submit.csv"**

## Learning objectives
- Experimentation !
- Error Analysis leading to model improvement
- Appreciate how choices impact number of weights


In [None]:
from __future__ import print_function

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn import preprocessing

import os
import re
import json
import math

%matplotlib inline

import tensorflow as tf
print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )


from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, LSTM

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from tensorflow.keras.utils import plot_model
import IPython

# API for students

We will define some utility routines.

This will simplify problem solving

More importantly: it adds structure to your submission so that it may be easily graded

**If you want to take a look at the API or change it, you can open it by selecting "File->Open->final_helper.py"**

`helper = final_helper.HELPER()`

### Preprocess raw dataset
- getDataRaw: get raw data.          
  >`train_raw, test_raw = helper.getDataRaw()`         
- getTextClean: clean text. 
  >`data_raw` is the raw data you get from `helper.getDataRaw()`, which is a pandas DataFrame          
  >`textAttr` is the column of text data      
  >`sentAttr` is the column of label       
  >`docs, sents = helepr.getTextClean(data_raw, textAttr, sentAttr`         
- encodeDocs: text tokenization
  >`docs` is the text data          
  >`vocab_size` is the size of vocabulary           
  >`words_in_doc` is number of words in a review           
  >`tok, encoded_docs, encoded_docs_padded = helper.encodeDocs(docs, vocab_size, words_in_doc)`        
- showEncodedDocs: display data by reversing index back to word. 
  >`tok` is an object of `Tokenizer`             
  >`encoded_docs_padded` is the text data which you have encoded and padded                  
  >`helper.showEncodedDocs(tok, encoded_docs_padded)`                   
- getExamplesOHE: one-hot encode samples. 
  >`encoded_docs_padded` is the text data which you have encoded and padded                 
  >`sents` is the labels                
  >`vocab_size` is number of words in the vocabulary           
  >`X, y = helper.getExamples(encoded_docs_padded, sents, vocab_size)`          

### Train model
- trainModelCat: train model for categorical labels
  >`patience` and `min_delta` are parameters of `EarlyStopping`        
  >`history = helper.trainModelCat(model, X_train, X_val, y_train, y_val, num_epochs=30, metric="acc", patience=5, min_delta=.005)`
  
### Save model and load model
- save model: save a model in `./models` directory
  >`helper.saveModel(model, modelName)`
- save history: save a model history in `./models` directory
  >`helper.saveHistory(history, modelName)`
- load model: load a model in `./models` directory
  >`helper.loadModel(model, modelName)`
- load history: load a model history in `./models` directory
  >`helper.loadHistory(modelName)`

### Plot models and training results
- plotModel: plot your models
  >`helper.plotModel(model, model_name)`
- plot_training: plot your training results
  >`helper.plot_training(history, metric='acc')`


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%reload_ext autoreload
%autoreload 1

%matplotlib inline

import final_helper
%aimport final_helper

helper = final_helper.HELPER()



## Load data


In [None]:
# Load training data and test data
train_raw, test_raw = helper.getDataRaw() 

In [None]:
train_raw.head()

In [None]:
train_raw.shape

## Data preprocessing

The reviews are in the "reviewText" attribute.



You may try to use other attributes as additional features if you choose, but we suggest that your first model may use this as the only source of features.

When you are manipulating your training set, **DON'T FORGET** to do the same manipulation on **test** set, because you need to use your test set as input of your final model!


## Get the labelled training data
- Features: docs.  Each document is a single review (sequence of characters)
- Targets/Labels: sents. Each is the sentiment associated with the review.

In [None]:
textAttr, sentAttr, titleAttr = "reviewText", "overall", "title"

## Clean text
# training data
docs, sents = helper.getTextClean(train_raw, textAttr, sentAttr)

# We will treat the sentiment values as Categorical, rather than numeric
le = sklearn.preprocessing.LabelEncoder()
sents = le.fit_transform(sents)

print("Docs shape is ", docs.shape)
print("Sents shape is ", sents.shape)

print(docs[:5])
print("\nPossible sentiment values: ",  np.unique(sents) ) 


## More data preprocessing

We will need to convert the text in a *sequence* of numbers
- Break text up into words
- Assign each word a distinct integer

Moreover, it will be easier if all sequences have the same length.
We can add a "padding" character to the front if necessary.

We do this for you below.

Our method returns
- encoded_docs_padded: A matrix of training example *features*
  - Each row is an example
  - Each row is a *sequence* of fixed length
  - Each element of the sequence is an integer, encoding a word in the vocabulary
  - The sequence length of every example is *identical* because we have prepended padding if necessary
- encoded_docs: A matrix of *unpadded* training example *features*
- tok: the Tokenizer used to
  - parse strings of characters into words
  - encoded each word as an integer


You may study our methods parameters and modify them if you wish, e.g., alter the size of the vocabulary or length of sequences.

We suggest that your first model uses
- encoded_docs_padded as your set of training features, e.g., X
- sents: as your targets
with the default settings of the method.



In [None]:
## set parameters:
# vocab_size : number of words in the vocabulary 
# words_in_doc: number of words in a review
vocab_size, words_in_doc = 400, 100

tok, encoded_docs, encoded_docs_padded = helper.encodeDocs(docs, vocab_size=vocab_size, words_in_doc=words_in_doc)

print("Training example features shape: ",encoded_docs_padded.shape)
print("Training example features preview: ")
print(encoded_docs_padded[:3])


## Verify that our encoded documents are the same as the cleaned original

At this point: convince yourself that all we have done was encode words as integers and pad out all text to the same length.  The following will demonstrate this

In [None]:
helper.showEncodedDocs(tok, encoded_docs_padded)

# Caution !

How will you encode words ?
  
- Perhaps you want to use OHE.  If so: we provide some utility functions to help.
  >`X_train_OHE, _ = helper.getExamplesOHE(X_train, sents, vocab_size_sm)`       
  >`X_val_OHE, _ = helper.getExamplesOHE(X_val, sents, vocab_size_sm)` 
  
  But be **careful**: Our vocabulary is very large.  One Hot Encoding may use too much memory and your program won't run.
  
- Alternatives to OHE
 - You can try an embedding layer which is a *dense* representation of words.

# Split the examples into training and validation data

- X_train: ndarray of training example features
- X_val:  ndarray of validation example features
- y_train: ndarray of training example targets
- y_val:  ndarray of validation example targets

In [None]:
### BEGIN SOLUTION

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(encoded_docs_padded, sents, test_size=0.30, random_state=42)

# # OHE the X for some models
# vocab_size, words_in_doc = 4000, 100
# X_train_OHE, _ = helper.getExamplesOHE(X_train, sents, vocab_size)
# X_val_OHE, _ = helper.getExamplesOHE(X_val, sents, vocab_size)

### END SOLUTION

In [None]:
## Verify your training and test dataset

# Set two variables
# example_sequence_len: length of the sequence
# example_num_features: number of features in a single element of the sequence (of a single example)

# # If using OHE:
# example_shape = X_train_OHE.shape[1:]
# example_sequence_len, example_num_features = example_shape[0], example_shape[1]

# assert example_sequence_len == words_in_doc
# assert example_num_features == vocab_size

# If NOT using OHE
example_shape = X_train.shape[1:]
example_sequence_len, example_num_features = example_shape[0], tok.num_words

assert example_sequence_len == words_in_doc
assert example_num_features == vocab_size

## Create and train your model

**Note:**

- There is a `trainModelCat()` API already in the `final_helper.py` file. You can directly use it by
  >`history = helper.trainModelCat(model, X_train, X_val, y_train, y_val, num_epochs=30, metric="acc", patience=5, min_delta=.005)`
  
  You can change the `trainModelCat()` code or write training process by yourself if you have better idea!   


- To to see your model performance, use this API is very convenient
  >`helper.plot_training(history)`


In [None]:
### BEGIN SOLUTION

def runModel(model, model_name, X_train, X_test, y_train, y_test):
    plot_file = helper.plotModel(model, model_name)
    IPython.display.Image(plot_file) 

    model.summary()

    history = helper.trainModelCat( model, X_train, X_test, y_train, y_test )

    helper.plot_training(history)

    helper.eval_model(model, X_test, y_test)
    
    return history
    
embed_size_sm = 16
lstm_size_sm = 4
vocab_size_b, words_in_doc_b, embed_size_b, lstm_size_b = int(10*vocab_size), words_in_doc, int(1*embed_size_sm), lstm_size_sm
tok_b, encoded_docs_b, encoded_docs_padded_b = helper.encodeDocs(docs, vocab_size=vocab_size_b, words_in_doc=words_in_doc_b)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(encoded_docs_padded_b, sents, test_size=0.20, random_state=42)

model_x = Sequential( [Embedding(tok_b.num_words+1, embed_size_b, input_length=words_in_doc_b),
                          Dropout(0.25),
                          #LSTM(lstm_size_b), Dense(50, activation="relu"),
                          GlobalMaxPooling1D(),
                          Dropout(0.25),
                           Dense(100, activation="relu"),
                         
                          Dense( len( np.unique(y_train_b) ), activation="softmax")
                         ]
                       )

runModel(model_x, "Embedding Big + Compldex", X_train_b, X_test_b, y_train_b, y_test_b)

### END SOLUTION

## How many weights in your Classifier only model ?


**Question:** How many weights in your model ?
- Set a variable `num_weights` to be the number of weights 

You should always be sensitive to how "big" your model is.

In [None]:
# Set variable
num_weights = None

### BEGIN SOLUTION
num_weights = model_simple.count_params()
### END SOLUTION

print('The number of weights is :', num_weights)

# Submit your predictions for grading

Now that you have built your model, it's time to make predictions for grading.
- Make a prediction for each example in the file `test.csv`
- Create a file `my_submit.csv` with these predictions **in the same order** as the examples in `test.csv`
- The format of `my_submit.csv` should be similar to `submit_sample.csv`

**Hint:**

You may want  (but are not required) to use a Pandas DataFrame to create `my_submit.csv`.
- Look up the Pandas method `to_csv` in order to create a CSV file from a DataFrame
    - Use optional argument `index=False` to prevent line numbers from being inserted in your CSV file

In [None]:
# If you use Pandas DataFrame to store your results
my_results = pd.DataFrame()

### BEGIN SOLUTION
test = helper.getDataRaw(DATA_DIR, 'test.csv')
docs_test = helper.getTextClean(test, textAttr)

tok_test, encoded_docs_test, encoded_docs_padded_test = helper.encodeDocs(docs_test, vocab_size=vocab_size_b, words_in_doc=words_in_doc_b)
# test_submit, _ = helper.getExamplesOHE(encoded_docs_padded_test, None, vocab_size)

my_model = model_x
# predicted_class = le.inverse_transform(np.argmax(my_model.predict(test_submit), axis=-1))
predicted_class = le.inverse_transform(np.argmax(my_model.predict(encoded_docs_padded_test), axis=-1))
my_results = pd.DataFrame({'reviewerID': test_raw['reviewerID'], 'overall':predicted_class})

### END SOLUTION

# Save your results in a csv file
my_results.to_csv('my_submit.csv', index=False)

## Discussion topics
- Compare the number of weights in each model.  Did the added complexity (more weights) lead to better performance ?
- **Where** were the largest increase in weights between models
  - Embeddings consume a lot of weights
    - But eliminates a dimension: single integer representation of a word vs a OHE vector of length words_in_vocab
    - So subsequent layers may need fewer weights compared to OHE
- Should we have formulated this as a Regression task rather than a Classification task ?
  - Is the difference in rating between 0 and 1 the same as between 3 and 4 ?
    - Perhaps there are bigger *absolute* differences in satisfaction  in lower ratings
      - i.e., Big difference between 0 and 1, smaller difference between 3 and 4