## Sequences

A simple example of sequence prediction - $[0, 1, 2] \rightarrow [3, 4, 5]$

Application examples:
![](assets/application_examples.jpg)

We can have different types of sequence problem structures

![](assets/sequences.png)

The many to many structure can also be thought of as an encoder-decoder structure:

![](assets/quoc-le.png)

## Problems with standard dense networks

+ Fixed size inputs & outputs
+ Stateless
+ Doesn't share features learned across positions
+ Unaware of temporal structure


## Promise of recurrent neural networks

Network able to learn a mapping from inputs over time
- outputs become conditional the context of the sequence

Learn the temporal dependence of data

An RRN is Turing complete
- they can simulate arbitrary programs

## Being comfortable in three dimensions

We model the temporal structure in data using a dimension in an array - by convention this is the second dimension.

Our dimensions then are: 
- m = the batch dimension (number of samples)
- T_x = timesteps (length of sequence)
- n_x = features at each time-step

## Practical

In [6]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

m = 1000
T_x = 32
n_x = 16

samples = tf.random.uniform((m, T_x, n_x))

samples.shape

TensorShape([1000, 32, 16])

Select all samples, first timestep, all features:

Last sample, all timesteps, first feature:

Ninth sample, sixth timestep, all features:

## Recurrent neural networks

A recurrent neural network is a linear stack of the one same model. It passes its output to itself at each timestep.
![](assets/RNN.png)

### RNN cell
![](assets/rnn_step_forward.png)

Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (activation value or previous hidden state containing information from the past), and outputs $a^{\langle t \rangle}$ which is given to the next RNN cell and also used to predict $y^{\langle t \rangle}$

### RNN forward pass 

You can see an RNN as the repetition of the cell you've just built. If your input sequence of data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell takes as input the hidden state from the previous cell ($a^{\langle t-1 \rangle}$) and the current time-step's input data ($x^{\langle t \rangle}$). It outputs a hidden state ($a^{\langle t \rangle}$) and a prediction ($y^{\langle t \rangle}$) for this time-step.


![](assets/rnn.png)
Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$  is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. 


Let's code a forward propagation similar to the RNN described in the figure above, but simpler: without biases.

Instructions:

1. Create a random input vector x with 4 samples, 3 time-steps and 2 features at each time-step  
+ Initialize the architecture and weights with 4 hidden layers
+ Create a vector of zeros ( a ) to store the hidden state computed by the RNN
+ Loop over each time-step (index t)
    + Calculate the "next" hidden state using tanh as activation function
    + Calculate the prediciton y in this time-step
 

## Backprop through time

Backpropagating error requires error to flow backwards in time
- error must flow back to the first time step to calculate gradients

The loss function for a given layer depends not only on its infulence on layers below it - but also on the layer at the next time step

Backproping through time means unrolling, which makes
-  the memory footprint of recurrent neural network large
- parallel training on multiple sequences inefficient on hardware that shares memory (i.e. GPU)

Further reading - see *Truncated Backprop Through Time*

##  Character level language modeling

Lets use a recurrent neural network to predict the next letter in the word *goodbye!*

+ many-to-many model
+ Feeding in the entire input sequence then reading the output sequence

<img src="assets/character_model.png" alt="" width="500"/>

### Data preparation
+ create a vocabulary using unique lower case letters
+ create a index dictionaty to map each char to its index and the other way around too
+ generate sequences
+ transform sequences to one-hot vectors

### Overview of the model

Your model will have the following structure: 

- Initialize parameters 
- Input layer takes sequence with `len_seq` time steps and `vocab_size` features
- RNN layer with 15 memory cells

At each time-step, the RNN tries to predict what is the next character given the previous characters. The dataset $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters in the training set, while $Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ is such that at every time-step $t$, we have $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$. 

In [18]:
def encode(alphabet, samples, seq_len):
    '''
    Creates one-hot encode vector of the sequence of characters.
    
    
    Parameters:
    --------
    alphabet: list with unique characters.
    samples: string of characters to be one-hot encoded.
    seq_len: int.
    
    Returns:
    --------
    one_hot: array of one-hot encoded characters.
    '''


def make_dataset(word, seq_len):
    '''
    Generates random sequences of size `seq_len` from the base `word`, as tuples of input-output one-hot encoded.
    
    
    Parameters:
    --------
    word: string. Word used to train the model. It will be repeated 50 times.
    seq_len: int. How many characters to consider in the input sequence. 
    
    Returns:
    --------
    input: array of size 100. Each object in the array is a random sample from the base `word` with size `seq_len`. 
    output: array of size 100.  Each object in the array is the sequence of size `seq_len` from the sample in the input.
    alphabet: list with unique characters from parameter `word`.
    '''
   

In [None]:
model = 

In [None]:
def decode(alphabet, encoded):
    #  single sample only
    return

test = encode(alphabet, np.array(['goo']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

In [None]:
test = encode(alphabet, np.array(['bye']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

In [None]:
test = encode(alphabet, np.array(['!go']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

## Practical

Character level language model to predict next character in a word