## Sequences

A simple example of sequence prediction - $[0, 1, 2] \rightarrow [3, 4, 5]$

Application examples:
![](assets/application_examples.jpg)

We can have different types of sequence problem structures

![](assets/sequences.png)

The many to many structure can also be thought of as an encoder-decoder structure:

![](assets/quoc-le.png)

## Problems with standard dense networks

+ Fixed size inputs & outputs
+ Stateless
+ Doesn't share features learned across positions
+ Unaware of temporal structure


## Promise of recurrent neural networks

Network able to learn a mapping from inputs over time
- outputs become conditional the context of the sequence

Learn the temporal dependence of data

An RRN is Turing complete
- they can simulate arbitrary programs

## Being comfortable in three dimensions

We model the temporal structure in data using a dimension in an array - by convention this is the second dimension.

Our dimensions then are: 
- $m$ = the batch dimension (number of samples)
- $T_x$ = timesteps (length of sequence)
- $n_x$ = features at each time-step

## Practical

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras

from tensorflow import random
random.set_seed(3) # for reproducible results


m = 1000
T_x = 32
n_x = 16

samples = tf.random.uniform((m, T_x, n_x))

samples.shape

TensorShape([1000, 32, 16])

Select all samples, first timestep, all features:

In [None]:
samples.numpy()[]

Last sample, all timesteps, first feature:

In [None]:
samples.numpy()[]

Ninth sample, sixth timestep, all features:

In [None]:
samples.numpy()[]

## Recurrent neural networks

A recurrent neural network is a linear stack of the one same model. It passes its output to itself at each timestep.
![](assets/RNN.png)

### RNN cell
![](assets/rnn_step_forward.png)

Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (activation value or previous hidden state containing information from the past), and outputs $a^{\langle t \rangle}$ which is given to the next RNN cell and also used to predict $y^{\langle t \rangle}$

### RNN forward pass 

You can see an RNN as the repetition of the cell you've just built. If your input sequence of data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell takes as input the hidden state from the previous cell ($a^{\langle t-1 \rangle}$) and the current time-step's input data ($x^{\langle t \rangle}$). It outputs a new hidden state ($a^{\langle t \rangle}$) and a prediction ($y^{\langle t \rangle}$) for this time-step.


![](assets/rnn.png)
Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$  is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. 


## Practical 

Let's code a forward propagation similar to the RNN described in the figure above, but simpler, that is without biases.
We also want to have one output for each time step, that means  $T_x$ =  $T_y$

Instructions:

1. Create a random input vector x with 4 samples, 3 time-steps and 2 features at each time-step  
+ Initialize the architecture and weights with hidden size = 4 
+ Create the hidden state vectors ($a$) as a vector of zeros that will store the values computer by RNN
+ Loop over each time-step (index t)
    + Calculate the "next" hidden state using tanh as activation function
    + Calculate the prediciton y in this time-step
 

In [7]:
# Create a random input vector x with m samples,T_x time-steps and n_x features
m = 
T_x = 
n_x = 

x = np.random.uniform(size=(m, T_x, n_x))
x

array([[[0.88571262, 0.59926573],
        [0.47102437, 0.55445859],
        [0.44981887, 0.09304059]],

       [[0.81935112, 0.90132869],
        [0.47578489, 0.96259905],
        [0.82384898, 0.45511733]],

       [[0.01477745, 0.35827748],
        [0.37231846, 0.68358015],
        [0.32437978, 0.41808654]],

       [[0.56037974, 0.02307087],
        [0.71701891, 0.03027807],
        [0.20014658, 0.7836883 ]]])

In [17]:
x.shape

(4, 3, 2)

In [10]:
# xt  -- your input data at timestep "t", numpy array of shape (m, n_x)
# this is the shape of the first input!
x[:,0,:].shape

(4, 2)

In [11]:
# Initialize the architecture and weights with hidden size n_a 
n_a = 

# Weight matrix multiplying the hidden state
Waa = np.random.uniform(size=())

# Weight matrix multiplying the input x
Wax = np.random.uniform(size=())

# Weight matrix multiplying the predictions y
Wya = np.random.uniform(size=())

a0 = np.zeros([])
a0

array([0., 0., 0., 0.])

In [None]:
# perform one iteration
a1 = np.zeros()

# calculate the value of prediction
y1 =

## Backprop through time

Backpropagating error requires error to flow backwards in time
- error must flow back to the first time step to calculate gradients

The loss function for a given layer depends not only on its infulence on layers below it - but also on the layer at the next time step

Backproping through time means unrolling, which makes
-  the memory footprint of recurrent neural network large
- parallel training on multiple sequences inefficient on hardware that shares memory (i.e. GPU)

Further reading - see *Truncated Backprop Through Time*

## Extra

Exercise about Character level language modeling to predict next character in the word goodbye from Adam's notebook [link](https://github.com/ADGEfficiency/teaching-monolith/blob/master/sequences/recurrent.ipynb)