# Recurrent Neural Networks

###### tags: `mlb`

## 1. Time Series Features

Consider a sentence as follows : 

*I am darby. I am handsome.*

How do we extract the features from this sentence ?


### 1.1 Sentence-level Features
- Extract features from each sentence : 
    - Bag-of-words

1. Build a list contain all words of the sentence : 
[ "I", "am", "darby", "handsome" ]

2. Generate a vector which lenght is the same as lenght of the list :
[ 0, 0, 0, 0 ]

3. Replace zero with word frequency : 
[ 2, 2, 1, 1 ]

- Advantage : 
    - It can fit the most of machine learning model.

- Disadvantage : 
    - Loss the information of relation between words.

### 1.2 Word-level Features
- Extract features from each word
    - One-hot
    - Word embedding

- Word-level features will contain 2 domains : 
    - Time domain and word domain.
    - For example, we can get the features as follows by one-hot encoding : 

[[1, 0, 0, 0], => I 
[0, 1, 0, 0],  => am
[0, 0, 1, 0],  => darby
[1, 0, 0, 0],  => I
[0, 1, 0, 0],  => am
[0, 0, 0, 1]]  => handsome

- We can flat the features to a vector and feed it to maching learning model.
    - But this **cannot** let the model learn the relation between words.

## 2. Recurrent Neural Networks

### 2.1 Basic Architecture

Idea : feed a word and a previous state to machine learning model each time.

The model has two output : one for prediction and another for **hidden state**.

![](https://i.imgur.com/9BWm6wW.png)

If we want to classify a sentence, we can treat the last output as prediction.
- Ignore intermediate output.

### 2.2 Backpropagation Through Time

In this architecture, backpropagation is not enough.
- Because we cannot optimize intermediate output.

Solution : recurrent neural network is a cyclic graph. Maybe we can simplify the architecture of recurrent neural network.

![](https://i.imgur.com/eEAcpJW.png)

Here, Model 1, Model 2, ..., Model n share the same weights and bias.

We can treat those models as a large, sequential model.
- $h_n$ is the output of this model.
- Ignore $h_1, h_2, ..., h_{n-1}$

Now, we can use backpropagation calculate the gradients easily.

Work flow of backpropagation through time : 
1. Unroll the recurrent neural network.
2. Calculate the gradients of all trainable variables.
3. Roll-up the recurrent neural network.
4. Update trainable variables.

## 3. Long Short-Term Memory (LSTM)

### 3.1 Introduction

Reference : [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)


#### Disadvantages of recurrent neural networks : 
1. Gradient vanish
2. Gradient explode
3. Performance will be bad if time series is too long.

#### Solution : 
1. An additional new state : **cell state**
2. Three gates : **input gate**, **forget gate**, **output gate**

#### Simple work flow : 

Let 
(1) $x_t$ is a word at time stamp $t$, 
(2) $h_t$ is the hidden state at time stamp $t$, 
(3) $C_t$ is the cell state at time stamp $t$.

Consider the current word and two previous states : $x_t, h_{t-1}, C_{t-1}$. 
A LSTM cell calculates three values : $i_t, f_t, o_t$
Use those three values to generate next two states : $h_t, C_t$

#### Calculate $i_t$ (output of input gate) : 

$$
i_t = Sigmoid(W_i \cdot [h_{t-1}, x_t] + b_i)
$$

#### Calculate $f_t$ (output of forget gate) : 

$$
f_t = Sigmoid(W_f \cdot [h_{t-1}, x_t] + b_f)
$$

#### Calculate $o_t$ (output of output gate) : 

$$
o_t = Sigmoid(W_o \cdot [h_{t-1}, x_t] + b_o)
$$

#### Calculate $C_t$ (cell state) : 

$$
\bar C_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \\
C_t = f_t * C_{t-1} + i_t * \bar C_t
$$

#### Calculate $h_t$ (hidden state) : 

$$
h_t = o_t * tanh(C_t)
$$

### 3.2 Model Architecture

![](https://i.imgur.com/7TPMPnR.png)


Here, we show a simple stacked LSTM model.

- $h_{11}, h_{12}, ..., h_{1n}$ , $h_{21}, h_{22}, ..., h_{2n}$, $h_{3n}$ are 1D vectors. They are hidden states of each time stamp.
- The first two LSTM layers **return all hidden states**. And the last LSTM layer **only return the final hidden state**.
- We will feed the final hidden state $h_{3n}$ to a neural network.

We can unroll this architecture as follows : 

![](https://i.imgur.com/PcIiZEo.png)


### 3.3 Bidirectional LSTM

A basic problem for all recurrent architectures : 
- They will **forget** the features input at the beginning.
    - This will be a problem if we pad all sequences to the same size.

For example, consider raw data as follows : 
1. I am darby. I am handsome.
2. How are you ?
3. To be or not to be, this is a question.

In practice, we will pad those sequences : 
1. i am darby i am handsome [pad] [pad] [pad] [pad]
2. how are you [pad] [pad] [pad] [pad] [pad] [pad] [pad]
3. to be or not to be this is a question

You will find that sample 2 has seven [pad] tokens. This will affect model inference.

#### Solution : 
1. Reverse sequence order
2. Bidirectional LSTM

![](https://cdn-images-1.medium.com/max/1200/1*6QnPUSv_t9BY9Fv8_aLb-Q.png)

## 4. Sequence to Sequence Learning

Paper : [Sequence to Sequence Learning with Neural Networks](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)

Definition : feed a sequence to model and the model will response with a sequence. For example, chatbot, machine translation.

- It is not a simple classification or regression model. Instead, it's a **generative model**.

### 4.1 Limitation of generative problem on a RNN model

Is it possible that we let each hidden states become an output sequence ?

![](https://i.imgur.com/19GryC6.png)

- No, because we may consider whole input sequence to generate a response.

### 4.2 Encoder-Decoder Architecture

Idea : use a RNN model to encode whole input sequence. And decode to a response by another RNN model.
- Sequence-to-sequence model (seq2seq)

![](https://i.imgur.com/V8rxbS5.png)

1. For encoder part, feed whole input sequence to encoder model. And we will get the final cell state and hidden state
2. For decoder part, we **initialize** decoder cell state and hidden state by encoder cell state and hidden state.
3. Finally, we feed a **start token** to decoder. And we can output a word.
    - We feed this word to decoder to get next word. Repeat this process until we get an **end token**.

### 4.3 Input and Output Encoding

Input : feel free to use any kind of word-level encoding method
- One-hot
- Word embedding (memory-friendly)

Output : only consider **one-hot** encoding.
- Seq2seq model will predict the classification of each word.
- If we have total vocabulary size 80000, we can expect that the output layer of decoder has 80000 neurons. 

Think : larger vocabulary size, more memory usage.

### 4.4 More Issues

- Optimize encoder : reverse sequence order.
- Optimize decoding process : beam search algorithm.

## 5. Attention Mechanism

Let's review seq2seq model : 

![](https://i.imgur.com/KCmJPN6.png)

For the first generated word "I", we can know that it is generated by token "[start]", hidden state $h_0$, cell state $C_0$.
- Also, $h_1, C_1$ will be generated.

The second word "am" is generated by "I", $h_1, C_1$.
- $h_2, C_2$ will be generated, too.

Summary : $Word_t$ will be generated by $h_{t-1}, C_{t-1}, Word_{t-1}$.
- $Word_t$ may depend on state $t-n$ or state $t+n$.
- Differnet words has different dependencies.
- For example, consider the following machine translation : 
    - I woke up at 8:00 a.m today. => 我今天早上8點起床
    - The order of action and time is reverse. The second Chinese vocaulary "今天" depends on the final English word "today".

### 5.1 Architecture Overview

![](https://i.imgur.com/trY1JmW.png)

A new conception : **context vector**

### 5.2 Context Vector

Hidden states of encoder may be important.
- Generate a context vector which is a **weighted sum** of all hidden state of encoder.

Where are weights of weighted sum from ?
- Calculate with **current hidden state of decoder**.

Consider all hidden states of encoder $he_1, he_2, ... he_n$ and current hidden state of decoder $hd_t$. We can their weights by the following equation : 

$$
u^t_i = score(he_i, hd_t) \\
U^t = [u^t_1, u^t_2, u^t_3, ... u^t_n] \\
a^t = softmax(U^t) \\
c_t = \sum a^t_i \times he_i
$$

$score(.)$ is a function to calcuate the relation between $he_i$ and $hd_t$. For example : 

$$
score(he_i, hd_t) = v^T tanh(W_e \cdot he_i + W_d \cdot hd_t)
$$

- $v^T$ : a weight matrices
- $W_e, W_d$ : trainable variables

### 5.3 Prediction

Prediction will be calculated by **context vector** and **current hidden state of decoder**.

$$
\bar {hd}_t = tanh(W_c \cdot [c_t; hd_t]) \\
y_t = softmax(W_s \cdot \bar {hd}_t)
$$

### 5.4 More

Two kinds of attention layer : 
- Global attention
- Local attention

Attention on CNN model : 
- Self attention : calculate context vector without hidden states of encoder.

Machine Translation with attention mechanism but without CNN and RNN : 
- Paper : [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

## Reference

- [A Gentle Introduction to Backpropagation Through Time](https://machinelearningmastery.com/gentle-introduction-backpropagation-time/)
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Attention Mechanism(Seq2Seq)](https://www.slideshare.net/healess/attention-mechanismseq2seq)
- [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)
- [Understanding Bidirectional RNN in PyTorch](https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66)

## Appendix

Here, we introduce some interesting topics on deep learning.

### A. Generative Adversarial Networks (GAN)

![](https://image.ibb.co/h5HpoA/gan-reading-list-header-image.jpg)

Train two models : generator, discriminator
- Generator generates fake images.
- Discriminator check if the image is real or not.

Goal : Get a generator that will generate fake images.

[Meow Generator](https://ajolicoeur.wordpress.com/cats/)

![](https://ajolicoeur.files.wordpress.com/2017/07/wgan_1408epoch.png)

More : 
- Conditional GAN
- Cycle GAN (Two-domain transfer)
- Star GAN (Multi-domain transfer)

### B. Deep Reinforcement Learning

![](https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg)

Learn a agent which can act with a specific environment very well.
- AlphaGo
- Self-driving cars

Two strategies to act with environments : 
- Model-based methods
- Model-free methods

Three strategies to train agents : 
- Value-based methods : q-learning
- Policy-based methods : policy gradient
- Combine value-based and policy-based methods : actor-critic

### C. Meta Learning

How to learn a learner ?

Use reinforcement learning to generate deep learning models : 
- Paper : [Network Architecture Search with Reinforcement Learning](https://arxiv.org/pdf/1611.01578.pdf)
- Paper : [Efficient Neural Architecture Search via Parameter Sharing](https://arxiv.org/pdf/1802.03268.pdf)

### D. Security Issues

Adversarial sample : a sample which can make model misclassify that human can recognize it very easily.

![](https://ml.berkeley.edu/blog/assets/2017-10-31-adversarial-examples/goodfellow.png)

- FSGM Attack
- Deepfool
- One-pixel Attack
- Tool : [foolbox](https://github.com/bethgelab/foolbox)

### E. Spiking Neural Networks (keyword only)