_Note_: The three exercises in this tutorial can be done in any order. Decide which interests you the most, and start with that one. You don't have to do all of them.

## Installation

1. If you haven't already installed Python3, get it from [Python.org](https://www.python.org/downloads/)
1. If you haven't already installed Jupyter Notebook, run `python3 -m pip install jupyter`
1. In Terminal, cd to the folder in which you downloaded this file and run `jupyter notebook`. This should open up a page in your web browser that shows all of the files in the current directory, so that you can open this file. You will need to leave this Terminal window up and running and use a different one for the rest of the instructions.
1. If you didn't install keras previously, install it now
    1. Install the tensorflow machine learning library by typing the following into Terminal:
    `pip3 install --upgrade tensorflow`
    1. Install the keras machine learning library by typing the following into Terminal:
    `pip3 install keras`


## Documentation/Sources
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://keras.io/](https://keras.io/) Keras API documentation
* [Keras recurrent tutorial](https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent)

## The IMDB Dataset
The [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) consists of movie reviews (x_train) that have been marked as positive or negative (y_train). See the [Word Vectors Tutorial](https://github.com/jennselby/MachineLearningTutorials/blob/master/WordVectors.ipynb) for more details on the IMDB dataset.

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
(imdb_x_train, imdb_y_train), (imdb_x_test, imdb_y_test) = imdb.load_data()

For a standard keras model, every input has to be the same length, so we need to set some length after which we will cutoff the rest of the review. (We will also need to pad the shorter reviews with zeros to make them the same length).

In [3]:
cutoff = 500

In [4]:
imdb_x_train_padded = sequence.pad_sequences(imdb_x_train, maxlen=cutoff)
imdb_x_test_padded = sequence.pad_sequences(imdb_x_test, maxlen=cutoff)

## Classification

In [5]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

Define our model.

Unlike last time, when we used convolutional layers, we're going to use an LSTM, a special type of recurrent network.

Using recurrent networks means that rather than seeing these reviews as one input happening all at one, with the convolutional layers taking into account which words are next to each other, we are going to see them as a sequence of inputs, with one word occurring at each timestep.

In [8]:
imdb_lstm_model = Sequential()
imdb_lstm_model.add(Embedding(input_dim=len(imdb.get_word_index())+3, output_dim=100, input_length=cutoff))
# return_sequences tells the LSTM to output the full sequence, for use by the next LSTM layer. The final
# LSTM layer should return only the output sequence, for use in the Dense output layer
imdb_lstm_model.add(LSTM(units=32, return_sequences=True))
imdb_lstm_model.add(LSTM(units=32))
imdb_lstm_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
imdb_lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

Train the model. __This takes awhile. You might not want to re-run it.__

In [7]:
imdb_lstm_model.fit(imdb_x_train_padded, imdb_y_train, epochs=1, batch_size=64)

Epoch 1/1


<keras.callbacks.History at 0x113379438>

Assess the model. __This takes awhile. You might not want to re-run it.__

In [7]:
imdb_lstm_scores = imdb_lstm_model.evaluate(imdb_x_test_padded, imdb_y_test)
print('loss: {} accuracy: {}'.format(*imdb_lstm_scores))

loss: 0.6931445028114319 accuracy: 0.50072


## Exploring Simple Recurrent Layers

**Before we dive into something as complicated as LSTMs, Let's take a deeper look at simple recurrent layer weights.**

In [6]:
import numpy
from keras.layers import SimpleRNN

**The neurons in the recurrent layer pass their output to the next layer, but also back to themselves. The input shape says that we'll be passing in one-dimensional inputs of unspecified length (the None is what makes it unspecified).**

In [11]:
one_unit_SRNN = Sequential()
one_unit_SRNN.add(SimpleRNN(units=1, input_shape=(None, 1), activation='linear', use_bias=False))

In [12]:
one_unit_SRNN_weights = one_unit_SRNN.get_weights()
one_unit_SRNN_weights

[array([[1.2853907]], dtype=float32), array([[-1.]], dtype=float32)]

In [13]:
one_unit_SRNN_weights[0][0][0] = 2
one_unit_SRNN_weights[1][0][0] = -1
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.get_weights()

[array([[2.]], dtype=float32), array([[-1.]], dtype=float32)]

**This passes in a single sample that has three time steps.**

In [14]:
one_unit_SRNN.predict(numpy.array([ [[-1], [3], [7]] ]))

array([[6.]], dtype=float32)

# Exercise 2a
**Figure out what the two weights in the one_unit_SRNN model control. Be sure to test your hypothesis thoroughly. Use different weights and different inputs.**

After testing with different weights and inputs, I found out that the first weight affects each of the inputs and the second weight is applied to the previous output and added to the current one. 
For example if we use the weights and inputs above the network will run somewhat like this:
1. 'input 1' (-1) is multiplied by 'weight 1' (2) --> 'output 1' is -2
2. 'input 2' (3) is multiplied by 'weight 1' (2) --> 'output 2 layer 1' is 6
3. 'output 1' (-2) is multiplied by 'weight 2' (-1) and added to 'output 2 layer 1' (6) --> 'output 2' is 6+2=8
4. 'input 3' (7) is multiplied by 'weight 1'(2) --> 'output 3 layer 1' is 14
5. 'output 2'(8) is multiplied by 'weight 2' (-1) and added to 'output 3 layer 1' (14) --> 'output 3' is 14+(-8)=6

So the final output is 6.

To generalize for any input *n*:

1. *n* x weight 1 = output *n* layer 1
2. (output *n* layer 1) + (output *n-1* x weight 2) = output *n* layer 2


**Let's try a slightly larger simple recurrent model.**

In [15]:
two_unit_SRNN = Sequential()
two_unit_SRNN.add(SimpleRNN(units=2, input_shape=(None, 1), activation='linear', use_bias=False))

In [16]:
two_unit_SRNN_weights = two_unit_SRNN.get_weights()
two_unit_SRNN_weights

[array([[-1.0283942 ,  0.95512116]], dtype=float32),
 array([[-0.0847251 ,  0.99640435],
        [ 0.99640435,  0.0847251 ]], dtype=float32)]

In [60]:
two_unit_SRNN_weights[0][0][0] = 1
two_unit_SRNN_weights[0][0][1] = 1
two_unit_SRNN_weights[1][0][0] = 0
two_unit_SRNN_weights[1][0][1] = 0
two_unit_SRNN_weights[1][1][0] = 0
two_unit_SRNN_weights[1][1][1] = 0
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[1., 1.]], dtype=float32), array([[0., 0.],
        [0., 0.]], dtype=float32)]

**This passes in a single sample with four time steps.**

In [61]:
two_unit_SRNN.predict(numpy.array([ [[1], [2], [3], [4]] ]))

array([[4., 4.]], dtype=float32)

# Exercise 2b
**What do each of the six weights of the two_unit_SRNN control? Again, test out your hypotheses carefully.**

After testing a lot of different weights (starting with setting the 'odd' weights to 0 and the 'even' weights to 1 and vice versa) I started noticing different patterns in the output. I realized that the 'odd' weights (the first, third, and fifth) control the first output and the 'even' weights (the second, fourth, and sixth) control the second output. For example, when I set the weights to [1,0,1,0,1,0] and the input to [1,2,3,4] the output was [10,0]. I also noticed that the output (10) was a sum of all of the inputs.

I then tested setting each weight independently be setting them to 1 one-by-one and the rest to 0. I noticed thtat when the first weight was 1 and the rest 0, the network outputted [last input, 0]. Similarly, when I set the second weight to 1 and the rest 0, the model outputted [0, last output]. This lead me to understand which weights affected which output.

Note to Jen: I understand how the two unit SRNN weights affect the data but didn't have enough time to write it down here. I have a diagram I can submit a photo of. Please let me know if you would want me to do that or finish this explanation. Thank you!

## Exploring LSTMs


In [19]:
one_unit_LSTM = Sequential()
one_unit_LSTM.add(LSTM(units=1, input_shape=(None, 1),
                       activation='linear', recurrent_activation='linear',
                       use_bias=False, unit_forget_bias=False,
                       kernel_initializer='zeros',
                       recurrent_initializer='zeros',
                       return_sequences=True))

In [20]:
one_unit_LSTM_weights = one_unit_LSTM.get_weights()
one_unit_LSTM_weights

[array([[0., 0., 0., 0.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [21]:
one_unit_LSTM_weights[0][0][0] = 1
one_unit_LSTM_weights[0][0][1] = 0
one_unit_LSTM_weights[0][0][2] = 1
one_unit_LSTM_weights[0][0][3] = 1
one_unit_LSTM_weights[1][0][0] = 0
one_unit_LSTM_weights[1][0][1] = 0
one_unit_LSTM_weights[1][0][2] = 0
one_unit_LSTM_weights[1][0][3] = 0
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[1., 0., 1., 1.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [22]:
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4]] ]))

array([[[ 0.],
        [ 1.],
        [ 8.],
        [64.]]], dtype=float32)

# Exercise 2c
Conceptually, the [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) has several _gates_:

* __Forget gate__: these weights allow some long-term memories to be forgotten.
* __Input gate__: these weights decide what new information will be added to the context cell.
* __Output gate__: these weights decide what pieces of the new information and updated context will be passed on to the output.

It also has a __cell__ that can hold onto information from the current input (as well as things it has remembered from previous inputs), so that it can be used in later outputs.

Identify which weights in the one_unit_LSTM model are connected with the context and which are associated with the three gates?

_Note_: The output from the predict call is what the linked explanation calls $h_{t}$.