# Understanding the Weights in RNNs

## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Look at the code in [Part A: Single Unit Simple Recurrent Layer](#Part-A:-Single-Unit-Simple-Recurrent-Layer) and complete the [Part A Exercise](#Part-A-Exercise)
0. Look at the code in [Part B: Two Unit Simple Recurrent Layer](#Part-B:-Two-Unit-Simple-Recurrent-Layer) and complete the [Part B Exercise](#Part-B-Exercise)
0. Optionally, look at the code in [Part C: LSTM Layer](#Part-C:-LSTM-Layer) and complete the [Part C Exercise](#Part-C-Exercise)

## Documentation/Sources
* [Class Notes](https://jennselby.github.io/MachineLearningCourseNotes/#recurrent-neural-networks)
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://keras.io/](https://keras.io/) Keras API documentation
* [Keras recurrent tutorial](https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent)

## Part A: Single Unit Simple Recurrent Layer

Before we dive into something as complicated as LSTMs, Let's take a deeper look at simple recurrent layer weights.

In [3]:
import numpy
from keras.layers import SimpleRNN
from keras.models import Sequential
from keras.layers import LSTM

The neurons in the recurrent layer pass their output to the next layer, but also back to themselves. The input shape says that we'll be passing in one-dimensional inputs of unspecified length (the None is what makes it unspecified).

In [6]:
one_unit_SRNN = Sequential()
one_unit_SRNN.add(SimpleRNN(units=1,input_shape=(None,1), activation='linear', use_bias=False))

In [7]:
one_unit_SRNN_weights = one_unit_SRNN.get_weights()
one_unit_SRNN_weights

[array([[-0.06465936]], dtype=float32), array([[-1.]], dtype=float32)]

We can set the weights to whatever we want, to test out what happens with different weight values.

In [8]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 1
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.get_weights()

[array([[1.]], dtype=float32), array([[1.]], dtype=float32)]

We can then pass in different input values, to see what the model outputs.

The code below passes in a single sample that has three time steps.

In [9]:
one_unit_SRNN.predict(numpy.array([ [[3], [3], [7]] ]))

array([[13.]], dtype=float32)

## Part A Exercise
Figure out what the two weights in the one_unit_SRNN model control. Be sure to test your hypothesis thoroughly. Use different weights and different inputs.

In [14]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 2
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.get_weights()

[array([[1.]], dtype=float32), array([[2.]], dtype=float32)]

In [22]:
print(one_unit_SRNN.predict(numpy.array([ [[1], [0]] ])))
print(one_unit_SRNN.predict(numpy.array([ [[0], [1]] ])))

[[2.]]
[[1.]]


The above results indicate that the `[0][0][0]` weight controls how the current timestep is added, while the `[1][0][0]` controls how the previous timestep is added. In the first case, we get $2\cdot (1\cdot 1)+1\cdot 0=2$, while in the second we get $2\cdot (1\cdot 0)+1\cdot 1=1$

In [31]:
# This should have the value 2(2(2(...)+1)+1)+1=1+2+2^2+...+2^19=2^20-1=1,048,575
print(one_unit_SRNN.predict(numpy.array([ [[1] for i in range(20)] ])))

[[1048575.]]


## Part B: Two Unit Simple Recurrent Layer

In [32]:
two_unit_SRNN = Sequential()
two_unit_SRNN.add(SimpleRNN(units=2, input_shape=(None, 1), activation='linear', use_bias=False))

In [33]:
two_unit_SRNN_weights = two_unit_SRNN.get_weights()
two_unit_SRNN_weights

[array([[1.1917552, 1.2786132]], dtype=float32),
 array([[ 0.73880625, -0.67391807],
        [-0.67391807, -0.7388061 ]], dtype=float32)]

In [42]:
two_unit_SRNN_weights[0][0][0] = 1
two_unit_SRNN_weights[0][0][1] = 1
two_unit_SRNN_weights[1][0][0] = 0
two_unit_SRNN_weights[1][0][1] = 1
two_unit_SRNN_weights[1][1][0] = 0
two_unit_SRNN_weights[1][1][1] = 1
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[1., 1.]], dtype=float32),
 array([[0., 1.],
        [0., 1.]], dtype=float32)]

This passes in a single sample with four time steps.

In [43]:
two_unit_SRNN.predict(numpy.array([ [[3], [3], [7], [5]] ]))

array([[ 5., 31.]], dtype=float32)

## Part B Exercise
What do each of the six weights of the two_unit_SRNN control? Again, test out your hypotheses carefully.

In [41]:
two_unit_SRNN.predict(numpy.array([ [[1], [0], [0], [1]] ]))

array([[1., 0.]], dtype=float32)

In [49]:
two_unit_SRNN_weights[0][0][0] = 1
two_unit_SRNN_weights[0][0][1] = 0
two_unit_SRNN_weights[1][0][0] = 1
two_unit_SRNN_weights[1][0][1] = 0
two_unit_SRNN_weights[1][1][0] = 0
two_unit_SRNN_weights[1][1][1] = 1
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[1., 0.]], dtype=float32),
 array([[1., 0.],
        [0., 1.]], dtype=float32)]

In [48]:
two_unit_SRNN.predict(numpy.array([ [[1], [0], [0], [1]] ]))

array([[2., 0.]], dtype=float32)

Write the two weights starting with `[0]` as $a$ and $b$ and put the other four weights into a matrix $M$ as they are displayed by python. Then, at each timestep, the network perorms the operation
$$y_n=M^Ty_{n-1}+x_n\begin{bmatrix}a\\b\end{bmatrix}.$$
(The transpose here has no mathematical significance and just seems to be an artifact of how everything was set up).

In [72]:
two_unit_SRNN_weights[0][0][0] = 2
two_unit_SRNN_weights[0][0][1] = 1
two_unit_SRNN_weights[1][0][0] = -2
two_unit_SRNN_weights[1][0][1] = 3
two_unit_SRNN_weights[1][1][0] = 4
two_unit_SRNN_weights[1][1][1] = -7
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[2., 1.]], dtype=float32),
 array([[-2.,  3.],
        [ 4., -7.]], dtype=float32)]

In [91]:
two_unit_SRNN.predict(numpy.array([ [[1] for i in range(10)] ]))

array([[ 14733418., -24943680.]], dtype=float32)

Denoting $\vec x=[2, 1]^T$ and $M=\begin{bmatrix}-2&3\\4&-7\end{bmatrix}^T$, the above should give the value

$\begin{align*}
M(M(M(...)+\vec x)+\vec x&=(I+M+M^2+...+M^{9})\vec x\\
&=\left(\sum_{i=0}^{9}M^i\right)\vec x\\
&=(M-I)^{-1}(M^{10}-I)\vec x
\end{align*}$

In [90]:
M=numpy.transpose(numpy.array([[-2, 3],[4, -7]]))
A=numpy.matmul(numpy.linalg.inv((M-numpy.eye(2))),(numpy.linalg.matrix_power(M, 10)-numpy.eye(2)))

numpy.matmul(A,numpy.array([2,1]))

array([ 14733418.00000001, -24943680.        ])

## Part C: LSTM Layer
### Optional

In [92]:
one_unit_LSTM = Sequential()
one_unit_LSTM.add(LSTM(units=1, input_shape=(None, 1),
                       activation='linear', recurrent_activation='linear',
                       use_bias=False, unit_forget_bias=False,
                       kernel_initializer='zeros',
                       recurrent_initializer='zeros',
                       return_sequences=True))

In [93]:
one_unit_LSTM_weights = one_unit_LSTM.get_weights()
one_unit_LSTM_weights

[array([[0., 0., 0., 0.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [94]:
one_unit_LSTM_weights[0][0][0] = 1
one_unit_LSTM_weights[0][0][1] = 0
one_unit_LSTM_weights[0][0][2] = 1
one_unit_LSTM_weights[0][0][3] = 1
one_unit_LSTM_weights[1][0][0] = 0
one_unit_LSTM_weights[1][0][1] = 0
one_unit_LSTM_weights[1][0][2] = 0
one_unit_LSTM_weights[1][0][3] = 0
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[1., 0., 1., 1.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [102]:
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4]] ]))

array([[[ 0.],
        [ 1.],
        [ 8.],
        [64.]]], dtype=float32)

## Part C Exercise
### Optional
Conceptually, the [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) has several _gates_:

* __Forget gate__: these weights allow some long-term memories to be forgotten.
* __Input gate__: these weights decide what new information will be added to the context cell.
* __Output gate__: these weights decide what pieces of the new information and updated context will be passed on to the output.

It also has a __cell__ that can hold onto information from the current input (as well as things it has remembered from previous inputs), so that it can be used in later outputs.

Identify which weights in the one_unit_LSTM model are connected with the context and which are associated with the three gates. This is considerably more difficult to do by looking at the inputs and outputs, so you could also treat this as a code reading exercise and look through the keras code to find the answer.

_Note_: The output from the predict call is what the linked explanation calls $h_{t}$.

In [232]:
one_unit_LSTM_weights[0][0][0] = 2 # Second sigma in diagram
one_unit_LSTM_weights[0][0][1] = 1 # First sigma in diagram
one_unit_LSTM_weights[0][0][2] = 3 # The boxed tanh
one_unit_LSTM_weights[0][0][3] = 1 # Last sigma
one_unit_LSTM_weights[1][0][0] = 0 # These four are in the same order as above but for h_{t-1} not x_t
one_unit_LSTM_weights[1][0][1] = 1
one_unit_LSTM_weights[1][0][2] = 2
one_unit_LSTM_weights[1][0][3] = 2
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[2., 1., 3., 1.]], dtype=float32),
 array([[0., 1., 2., 2.]], dtype=float32)]

From investigating the code and comparing to the diagram on the linked site, the first row of weights multiplies the input $x_t$ before it is fed into the second, first, third, and last activation functions in that order. The second row of weights multiplies $h_{t-1}$ in the same way, with the results being added to the results from weighting $x_t$ before feeding the sums into the activation functions.

In [234]:
one_unit_LSTM.predict(numpy.array([ [[1],[2] ]]))

array([[[   6.],
        [1680.]]], dtype=float32)

The first value here should be
$$(\underbrace{(2\cdot 1)}_{\sigma_2}\cdot \underbrace{(3\cdot 1)}_{\tanh}+\underbrace{(1\cdot 1)}_{\sigma_1}\cdot \underbrace{0}_{C_{0}})\cdot \underbrace{(1\cdot 1)}_{\sigma_3}=6$$
Also, $C_1=6$ as well since the final multiplier is just 1.

The second value should be
$$(\underbrace{(2\cdot 2)}_{\sigma_2}\cdot \underbrace{(3\cdot 2+2\cdot 6)}_{\tanh}+\underbrace{(1\cdot (2+1\cdot 6)}_{\sigma_1}\cdot \underbrace{6}_{C_{1}}))\cdot \underbrace{(1\cdot (2+2\cdot 6)}_{\sigma_3}=1680$$