# Problem 0: Other activation functions (10%)

### The leaky Relu is defined as $max(0.1x, x)$.
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?

### $tanh$ is defined as $\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?
 - How is it different from the sigmoid activation
 - What is an example of when to use it? When should you not use it?
 
Please put answers as text below

# Problem 1: The Deep Learning Recipe (40%)

In this problem, we'll follow the "deep learning recipe" covered in class on the IMDB data.

In [None]:
import numpy as np
import pandas as pd
import glob
import os
%pylab inline
np.random.seed(1234)

## Step 0: load the data

In [None]:
import sys
sys.path.insert(0, ".")
from helpers import load_imdb_data_text
# or copy the loading function from the notes

In [None]:
(train_docs, y_train), (test_docs, y_test) = load_imdb_data_text('../../data/aclImdb/')
print('found {} train docs and {} test docs'.format(len(train_docs), len(test_docs)))

Steps 
 - be one with data
 - set up e2e harness + get dumb baselines
 - overfit
 - regualarize
 - tune
 - squeeze

### Step 1: be one with the data
 - make some histograms
 - calculate some summary statistics
 - read a bunch of training examples and discuss any oddities you find
 - finally, turn the data into count vectors

In [None]:
# your code here
# make some plots, calculate some summary stats

In [None]:
# your code here
# print out some documents, find some anomalies

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(max_features=50000, lowercase=True)
# vec.fit(...
# x_train = ...
# x_test = ...

## Step 2: set up test harness and get baselines
 - state your baseline models and calculate the log loss and accuracy
   - what is the best constant guess?
   - what about a rules-based model? (e.g. checking if one of a few known words is present)
 - make a function that calculates model performance on the test set
   - `def eval_model(your_model):`
 - make a keras model
   - try to initialize the last layer appropriately (see [here](https://keras.io/api/layers/initializers/))
     - `bias_initializer=Constant(some_constant)`
   - evaluate the model with your function BEFORE training
 - examine data exactly as it is presented to the network
 - make sure you can memorize a batch

In [None]:
from sklearn.metrics import log_loss, accuracy_score


In [None]:
# your code here
# calculate the accuracy and log loss for a constant guess
# calculate the accuracy and log loss for a rules based approach

In [None]:
def eval_model(m):
    # your code here
    # print or return the accuracy and log loss on the test data

In [None]:
# some other keras imports
import keras.backend as K
from keras.initializers import Constant # for last layer initialization

hint: what value of X do I need for $\sigma(x)$ to be 0.5

In [None]:
# make a model
# inpt = Input(shape=...)
# hidden = ... (inpt)
# hidden = ...(hidden)
# ...
# model = ...
# model.compile... # don't forget to compile it

In [None]:
# evaluate the model before training it
eval_model(model)

In [None]:
# examine data as it is presented to the network

In [None]:
# your code here
# print out a few training examples
# they should be vectors of counts.
# turn them back into words

In [None]:
# make sure you can "memorize" or complete fit a small batch of data
# try the first 100 training examples
# the loss should go to near 0 pretty quickly
#model.fit(...)

In [None]:
eval_model(model)
# at this point, the model is probably over fit

## Part 3:  Overfit
 - make the network large, and convince yourself you can overfit the data

In [None]:
# your code here

In [None]:
# fit the model
#model.fit(...)

## Part 4: Regularize
 - use regularizers, dropout, network size, etc

In [None]:
from keras.regularizers ...
from keras.layers import Dropout

In [None]:
# model code here
# just like you did in the previous part
# add dropout, regularization, maybe remove a Dense layer

In [None]:
# fit the model

## Part 5 - 6: Tune and Squeeze
It will take a long time to tune the number of units in the Dense layers, so we will skip the tune phase. 

### Todo
 - Retrain the model
 - Make sure let it train enough
 - use callbacks to make sure the network stops before overfitting too much 
 - use callbacks to reduce the learning rate appropriately. 

In [None]:
# model code here

In [None]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
# add these callbacks just like we did in class

In [None]:
# fit the model

In [None]:
eval_model(model)
# you should be able to get > 88% accuracy

# Problem 2: Transfer Learning (30 %)
In this problem we will explore a technique called transfer learning. Often, we don't have very much labeled data for the problem at hand (we call it __data-poor__), but we can find labeled data for a similar problem (which we call ___data-rich__). 

In transfer learning, we use the __data-rich problem__ to train an network with good performance. We then make a similar network for the __data-poor problem__ but use the weights learned from the first problem in this network. This greatly reduces the amount of data needed to train the data-poor problem. You can think of this as reducing the number of free parameters. 

Here, we will use the mnist digit recognition problem. We will pretend that we are interested in telling the difference between the digits `4` and `9`, but we only have 10 labeled examples. We will pretend that we have tons of labeled examples of all of the other digits. 

In [None]:
import numpy as np
import pandas as pd
%pylab inline

In [None]:
# add some imports

np.random.seed(1234)

# $ \\ $

## Part 0: Subset the data into two datasets
 1. One part will have `x_train_49`, `y_train_49`, etc. which has only `4`s and `9`s. 
 2. The second part will have variables `x_train_rest` etc, which will have the rest of the data and none of the digits `4` and `9`. 

In [None]:
from keras.utils import to_categorical

def preprocess_training_data(data):
    data = data.reshape(data.shape[0], data.shape[1] * data.shape[2])
    data = data.astype('float32') / 255
    return data

def preprocess_targets(target, num_classes):
    return to_categorical(target, num_classes)


def subset_to_9_and_4(x, y):  # this is a new function
    mask = (y == 9) | (y == 4)
    new_x = x[mask]
    new_y = (y[mask] == 4).astype('int64')
    return new_x, new_y

def subset_to_rest(x, y):  # this is a new function
    mask = ~((y == 9) | (y == 4))
    new_x = x[mask]
    new_y = y[mask]
    return new_x, new_y


(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = preprocess_training_data(x_train)
x_test = preprocess_training_data(x_test)

num_classes = np.unique(y_train).shape[0]

y_train_ohe = preprocess_targets(y_train, num_classes)
y_test_ohe = preprocess_targets(y_test, num_classes)

train_frac = 0.8
cutoff = int(x_train.shape[0] * train_frac)
x_train, x_val = x_train[:cutoff], x_train[cutoff:]
y_train, y_val = y_train[:cutoff], y_train[cutoff:]
y_train_ohe, y_val_ohe = y_train_ohe[:cutoff], y_train_ohe[cutoff:]

x_train_49, y_train_49 = subset_to_9_and_4(x_train, y_train)
x_val_49, y_val_49 = subset_to_9_and_4(x_val, y_val)
x_test_49, y_test_49 = subset_to_9_and_4(x_test, y_test)

print(x_train_49.shape)

x_train_rest, y_train_rest = subset_to_rest(x_train, y_train)
x_test_rest, y_test_rest = subset_to_rest(x_test, y_test)

y_train_rest_ohe = to_categorical(y_train_rest, num_classes)
y_test_rest_ohe = to_categorical(y_test_rest, num_classes)



# $ \\ $
## Now we will throw away most of the training data for the 4-9 problem
 - we will keep only 10 points

In [None]:
num_points = 10
x_train_49, y_train_49 = x_train_49[:num_points], y_train_49[:num_points]


# $ \\ $

## Part 1: Build a neural network to fit the `rest` data.
 - ### Include 2 densely connected hidden layers with 256 neurons each.
 - The output dimension should be either 8 or 10, depending on how you do the problem
 - ### Compute the accuracy score for this model

# $ \\ $

In [None]:
K.clear_session()
num_hidden_units = 256

In [None]:
digit_input = Input(shape=(x_train_rest.shape[1],), name='digit_input')
# add code here
#model_rest = ...
#model_rest.compile( ... # to be removed


### Fit the model for 10 epochs and compute the accuracy score

In [None]:
#model_rest.fit(...

In [None]:
#accuracy_score(...

# $ \\ $ 
## Part 2: Fit a model on the `4`-`9` data
 - ### Use the same 2 densely-connected layers with 256 hidden units
 - ### Here the output layer could have 1 or two units, depending on how you set up the problem
 - ### NB: DO NOT use `K.clear_session()` because we need stuff for later. 

In [None]:

digit_input_49 = Input(shape=(x_train_49.shape[1],), name='digit_input')
# add code here
#model49 = Model(...
model49.compile( ...


In [None]:
#model49.fit( ... (NB try epochs=1000)


In [None]:
# accuracy_score...
# f1_score...

# $ \\ $ 
## Part 3: Transfer Learning:
 - ### Make an identical model to part 2, but take the weights learned from the original model on the rest of the data.
 - ### NB: the `Dense` layer takes a `weights=` keyword argument
 - ### Try making the layers static or trainable.


In [None]:
digit_input_transfer = Input(shape=(x_train_49.shape[1],), name='digit_input')
# add code here
#model_transfer = Model(...
#model_transfer.compile(...


In [None]:
# model_transfer.fit(...    epochs=100, 
# accuracy_score...
# f1_score...

## Part 4: Analysis:
 - We only transferred the first two layers and not the last one. Why?
 - Write the answer in a markdown cell

# Problem 4: Data Augmentation (20%)
Another way to prevent overfitting is to augment the data.
More data is always better, but sometimes we can't easily collect more data. 
A set of techniques to turn our current data set into a bigger one are called `data augmentation`. 

Data augmentation can take many forms, and are specific to the data and problem being solve. 
For example, in an image recognition problem, it is very common to rotate, crop, and zoom
images to generate new ones. We can think of this as a form of regularization, since we are, 
in some sense, forcing a pentalty if the model does not have rotation /scale invariance. 
In speech recognition, this can take the form of distoring an audio clip to have higher pitches
(e.g. speeding it up), which should "teach" a model that it should be pitch invariant. 

In text classification problems, it typcially a little more difficult to augment data. 
One common method is known as back-translation: if an autmated machine translation model is 
available, we can translate our text into one language (e.g. english to french) and then back
to the original language again (french to english). This typically yields a very similar 
piece of text to the original, but with different words. 

Here we'll try a simpler approach. In a low-data setting, we do not want the model to be too sensitive
to any given word. Accordingly, we can augment our data by creating additional examples which are 
identical to our current example, but with some words set to unknown words.

This problem is more opened ended.
TODO:
 - Load and process the IMDB sentiment data
 - train two identical models. In one of them, try randomly removing some fraction of the words (this is equivalent to having the model pretend that it is seeing some fraction of unknown words, since unknown words are skipped).
 - Discuss the results. 
   - What is the result of dropping words.
   - How does it compare to the image / audio methods described here