# Deep Learning with TensorFlow Module 3

[Based on this CognitiveClass Course](https://cognitiveclass.ai/courses/deep-learning-tensorflow/)

## Recurrent Neural Nets

### Sequential Data

Sequential data produces a problem when trying to make use of traditional feed-forward neural networks

NNs assume that inputs are independent and therefore do not take into consideration the impact of previous data

### Recurrent Neural Networks

RNNs are able to maintain a state or context which allows the model to know what was previously calculated. The model feeds back in previous inputs with the new input in order to help us predict sequential information


$$
h_{new}=tanh(W_h\cdot h_{prev}+W_x\cdot x)
$$

Where $h_{prev}$ is the context, and the new state $h_{new}$ which will yield an output $y$

$$
y=W_y\cdot h_{new}
$$

RNNs need to keep track of state at any given timepoint and can become computationally expensive. These are also very sensative to changes in parameters and can suffer from a Vanishing or Exploding gradient

Overall these models can be more difficult to train

### Long Short-Term Memory

A standard RNN can be difficult when trying to learn very long sequences

A solution to this is the LSTM model which maintains a strong gradient over relatively long sequences

An LSTM is contained of a memory cell and 3 logistic gates for writing, reading, and forgeting which control the flow of data

Manipulating these gates allows an LSTM to remember the appropriate level of information

We can also step LSTMs on top of each other to create hierarchial data representations

During the training process the network will learn how much old, and new information to remember as well as the weights and biases based on the different levels of state for each gate at each layer

### Language Modeling

Language modeling is the process of assigning prbabilities to a sequence of words. We can make use of an RNN to form the context based on previous words and thereafter output the new word

Word embedding is a way to process words by decoding a word to a spefici vector. The vectors for the vocabulary are initialized randomly and then are updated based on the context that the word is the same and hence words used in similar contexts 

We will pass in batches of text and the network will output the likelihood that a specific word follows the input set, this will then allow us to compare the actual output and model output probabilities to help train the model through back-prop

### Lab - LSTM Basics

An LSTM makes use of a linear unit which is the information cell and is surrounded by the tree logistic gates which are responsible for maintaining data, the gates selectively:

- Input/Write
    - Handles writing data to the information cell
- Output/Read
    - Sends data back to the RNN
- Keep/Remember
    - Handles maintaining and modification of data in infomation cell
    
The gates above are analog and multiplicative and modify the data based on the signal they receive

The usual flow of operations in an LSTM are as follows:

1. The Keep gate decides whether to keep or forget memory currently in memory and receives the input and state of the RNN and passes it through a Sigmoid activation. If $k_t$ has a value of $1$ the data will be stored perfectly, $0$ means it will be forgotten completely. Given the incoming state $S_{t-1}$ and $x_1$ as the incoming input and $W_k$ and $B_k$ as the weights and biases for the keep gate and the previous state of memory $Old_t$ we have the following equations

$$
K_t = \sigma(W_k\times[S_{t-1},x_t]+B_k)
$$


$$
Old_t=K_t\times Old_{t-1}
$$

We can see from this that the $Old_{t-1}$ is multiplied by the current $K_t$ value

2. The input and state are passed to the Input Gate where there is another Sigmoid activation applied, the result of which is $I_t$. The result of processing the inputs by the RNN are $C_t$

$$
I_t = \sigma(W_i\times[S_{t-1},x_t]+B_i)
$$


$$
New_t=I_t\times C_t
$$

$New$ is the data to be input to the memory cell and is added to whatever is already in the memory cell

$$
Cell_t=Old_t+New_t
$$

This is the candidate data to be kept in the memory cell We use the Output Gate to decide how much of the new output we will keep

$$
O_t=\sigma(W_o\times[S_{t-1},x_t]+b_o)
$$

$$
Output_t=O_t\times tanh(Cell_t)
$$

#### Create an LSTM Network

We'll create a simple LSTM to understand the network architecture

In [1]:
%autosave 120

Autosaving every 120 seconds


In [2]:
import numpy as np
import tensorflow as tf

In [3]:
sess = tf.Session()

We want to create a network with 1 LSTM cell, this will have two inputs - `prv_output` an `prv_state` as inputs, also known as $h$ and $c$ respectively. In order to do this we initialize a `state` vector which is a tuple with 2 elements, each of size `1, 4` and will be made up of `prv_output` and `prv_state`

In [4]:
LSTM_CELL_SIZE = 4

lstm_cell = tf.contrib.rnn.BasicLSTMCell(LSTM_CELL_SIZE, 
                                         state_is_tuple=True)
state = (tf.zeros([1, LSTM_CELL_SIZE]),)*2
print(state)

(<tf.Tensor 'zeros:0' shape=(1, 4) dtype=float32>, <tf.Tensor 'zeros:0' shape=(1, 4) dtype=float32>)


#### Define the Sample Input

We can define a sample output with a `batch_size = 1` and `seq_len = 6`

In [5]:
sample_input = tf.constant([[3, 2, 2, 2, 2, 2]], dtype=tf.float32)
print(sess.run(sample_input))

[[ 3.  2.  2.  2.  2.  2.]]


#### Run the Sample Input

We can run the sample input by passing it to the `lstm_cell` and looking at the output

In [6]:
with tf.variable_scope('LSTM_sample1'):
    output, state_new = lstm_cell(sample_input, state)
sess.run(tf.global_variables_initializer())
print(sess.run(state_new))

LSTMStateTuple(c=array([[-0.03636235,  0.83891821,  0.23306049, -0.55370402]], dtype=float32), h=array([[-0.00570551,  0.08357365,  0.04054929, -0.27129793]], dtype=float32))


Based on the above we can see that we have a $c$ and $h$ value in the LSTM State. We can take a look at the ouput by evaluating it as follows

In [7]:
print(sess.run(output))

[[-0.00570551  0.08357365  0.04054929 -0.27129793]]


In [8]:
sess.close()

#### Stacked LSTM

We can create a two layer LSTM as follows

In [9]:
sess = tf.Session()

In [10]:
input_dim = 6

Create multiple cells

In [12]:
cells = []

In [13]:
LSTM_CELL_SIZE_1 = 4
cell1 = tf.contrib.rnn.LSTMCell(LSTM_CELL_SIZE_1)
cells.append(cell1)

In [15]:
LSTM_CELL_SIZE_2 = 5
cell2 = tf.contrib.rnn.LSTMCell(LSTM_CELL_SIZE_2)
cells.append(cell2)

In [16]:
stacked_lstm = tf.contrib.rnn.MultiRNNCell(cells)

In [17]:
data = tf.placeholder(tf.float32, [None, None, input_dim])
output, state = tf.nn.dynamic_rnn(stacked_lstm, data, dtype=tf.float32)

Sample Input for this will be `2, 3, 6`

In [18]:
#Batch size x time steps x features.
sample_input = [[[1,2,3,4,3,2], [1,2,1,1,1,2],[1,2,2,2,2,2]],[[1,2,3,4,3,2],[3,2,2,1,1,2],[0,0,0,0,3,2]]]
sample_input

[[[1, 2, 3, 4, 3, 2], [1, 2, 1, 1, 1, 2], [1, 2, 2, 2, 2, 2]],
 [[1, 2, 3, 4, 3, 2], [3, 2, 2, 1, 1, 2], [0, 0, 0, 0, 3, 2]]]

#### Evaluate the Output

In [19]:
print(output)

Tensor("rnn/transpose:0", shape=(?, ?, 5), dtype=float32)


In [21]:
sess.run(tf.global_variables_initializer())
print(sess.run(output, feed_dict={data: sample_input}))

[[[-0.00457329  0.01386554  0.00200699 -0.00482764 -0.0085205 ]
  [-0.00555866  0.02601253 -0.00947261 -0.00118891 -0.01125968]
  [-0.00866533  0.04085833 -0.01180315  0.0016048  -0.01417257]]

 [[-0.00457329  0.01386554  0.00200699 -0.00482764 -0.0085205 ]
  [-0.00063138  0.01153464 -0.00904999  0.00572045 -0.00064348]
  [-0.01402552  0.02387906 -0.02340911  0.0096273  -0.00310375]]]


In [22]:
sess.close()

### Lab - LSTM with MNIST

