<br>
<font size = '6'><b>Recurrent Neural Networks</b></font>

- <a href="./files/771A_lec24_slides.pdf" target="_blank">Slides</a> by Piyush Rai

<table style="border-style: hidden; border-collapse: collapse;" width = "90%"> 
    <tr style="border-style: hidden; border-collapse: collapse;">
        <td width = 60% style="border-style: hidden; border-collapse: collapse;">
             
        </td>
        <td width = 30%>
        Collected by Prof. Seungchul Lee<br>
        iSystems<br>http://isystems.unist.ac.kr/<br>
        UNIST
        </td>
    </tr>
</table>

Table of Contents
<div id="toc"></div>



# 0. Limitation of Feedforward Neural Nets

- FFNN cannot take into account the sequential structure in the data

- For a sequence of observation $x_1, \cdots, x_T$, their corresponding hidden units (states) $h_1, \cdots, h_T$ are assumed independent of each other

<br>
<img src="./image_files/sequence_layer.jpg" width = 400>
<br>

- Not ideal for sequential data, e.g., sentence/paragraph/document (sequence of words), video (sequence of frames), etc.

# 1. Recurrent Neural Nets (RNN)

- Hidden state at each step depends on the hidden state of the previous

<br>
<img src="./image_files/sequence_layer_dependv2.jpg" width = 360>
<br>

- Each hidden state is typically defined as

$$ h_t = f \left( W x_t + U h_{t-1}\right) $$

$\quad \;$where $U$ is like a transition matrix and $f$ is some nonlinear function (e.g., $\tanh$).

- Now $h_t$ acts as a memory which helps us remember what happened up to step $t$

- Note: Unlike sequence data models such as HMM where each state is discrete, RNN states are continuous-valued. In that sense, RNNs are similar to linear Guassian models like Kalman Filters which have continuous states.

- RNNs can also be extended to have more than one hidden layer.


- A more "micro" view of RNN (transition matrix $U$ connects the hidden states across observations, propagating information along the sequence)

<table style="border-style: hidden; border-collapse: collapse;" width = "96%"> 
    <tr style="border-style: hidden; border-collapse: collapse;">
        <td width = 48% style="border-style: hidden; border-collapse: collapse;">
<img src="./image_files/micro_view_rnn.jpg" width = 400>
        </td>
        <td width = 48%>
<img src="./image_files/recurrence_gif.gif" width = 400>
        </td>
    </tr>
</table>

<br>

- RNN: Applications
<br>
<img src="./image_files/rnn_application.jpg" width = 700>
<br>
    - Input, output, or both, can be sequences (possibly of different lengths)
    - Different inputs (and different outputs) need not be of  the same length
    - Regardless of the length of the input sequence, RNN will learn a fixed size embedding for the input sequence.

## 1.1. Training RNN

- Trained using Backpropagation Through Time 
    - Forward  propagate from step 1 to end, and then backward propagate from end to 1
   
- Think of the time-dimension as another hidden layer and then it is just like standard backpropagation for feedforward neural nets.

<img src="./image_files/backprop_through_time.gif" width = 550>

- Black: Prediction, Yellow: Error, Orange: Gradients

## 1.2. RNN: Vanishing/Exploding Gradients Problem
<br>
<img src="./image_files/vanishing_gradient.jpg" width = 550>
<br>

- Sensitivity of hidden states and outputs on a given input becomes weaker as we move away from it along the sequence (weak memory)

- New inputs "overwrite" the activations of previous hidden states

- Repeated multiplications can cause the gradients to vanish or explode

## 1.3. Capturing Long-Range Dependencies

- Idea: Augment the hidden states with gates (with parameters to be learned)

- These gates can help us remember and forget information "selectively"

<img src="./image_files/gates_lstm.png" width = 550>

- The hidden states have 3 types of gates
    - input (bottom), forget (left), output (top)

- Open gate denoted by 'o', closed gate denoted by '-'

- LSTM: Long Short-Term Memory is one such idea

## 1.4. LSTM

- Essentially an RNN, except that the hidden states are computed differently

- Recall that RNN computs the hidden states as $h_t = f \left( W x_t + U h_{t-1}\right)$

- For RNN: State update is multiplicative (weak memory and gradient issues)

- In contrast, LSTM maintains a "context" $C_t$ and computes hidden states

<img src="./image_files/LSTM.png" width = 450>

# 2. Neural Nets for Unsupervised Learning

## 2.1. Autoencoder

- A neural net for unsupervised feature extraction

- Basic principle: Learns an encoding of the inputs so as to recover the original input from the encodings as well as possible

<table style="border-style: hidden; border-collapse: collapse;" width = "90%"> 
    <tr style="border-style: hidden; border-collapse: collapse;">
        <td width = 50% style="border-style: hidden; border-collapse: collapse;">
<img src="./image_files/autoencoder1_V2.png" width = 500>
        </td>
        <td width = 40%>
<img src="./image_files/sDA.png" width = 270>
        </td>
    </tr>
</table>


- Also used to initialize deep learning models (layer-by-layer pre-training)

## 2.2. Autoencoder: an Example

- Real-value inputs, binary-valued encodings

<br>
<img src="./image_files/autoencoder2_V2.png" width = 500>
<br>

- Sigmoid encoder (parameter matrix $W$), linear decoder (parameter matrix $D$), learned via:

$$\arg\min_{D,W} E(D,W) = \sum_{n=1}^{N} \rVert Dz_n - x_n\lVert^2 = \sum_{n=1}^{N} \rVert D \,\sigma(Wx_n) - x_n\lVert^2$$

- If encoder is also linear, then autoencoder is equivalent to PCA

## 2.3. Denoising: Autoencoders

- Idea: introduce stochastic corruption to the input: 
    - Hide some features
    - Add Gaussian noise

<br>
<img src="./image_files/autoencoder3.png" width = 500>
<br>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>