# CS492 전산학특강<인공지능 산업 및 스마트에너지>
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

# Week 3 - Recurrent nerural netowrks (RNNs)
## Schedule of this week
5. **Basic of RNNs** <br>
    5-1. Introduction to RNNs <br>
    5-2. Word embeddings <br>
    5-3. Text classification with an RNN
    
1. **Appling RNNs to Other Domains** <br>
    6-1. Text generation with an RNN <br>
    6-2. Time series forecasting <br>
---

## 5 - Basic of RNNs
### 5-1. Introduction to RNNs
#### Basic RNN structure 

**Why the RNNs?** <br>
Traditional neural networks such as FFNN, CNN have no **persistence properties** when predicting some results (e.g., as you read this essay, you understand each word based on your understanding of previous words). This is a major shortcoming of the neural networks. 

<img src=https://miro.medium.com/max/5412/1*NKhwsOYNUT5xU7Pyf6Znhg.png>

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. RNN is a class of artificial neural network where connections between units form a directed graph along a **temporal sequence**. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (**memory**) to process sequences of inputs. This makes them applicable to tasks such as sentence generation, machine translation, image captioning or speech recognition.

Now, even though RNNs are quite powerful, they suffer from **vanishing gradient problem** which hinders them from using long-term information, like they are good for storing memory 3-4 instances of past iterations but larger number of instances don't provide good results so we don't just use regular RNNs.

#### Vanishing gradient problem

<img src=https://miro.medium.com/max/2288/1*lTeIFg5Ecl0hMd3FeNGDYA.png>



Vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. 

The problem is that in some cases, **the gradient will be vanishingly small**, effectively preventing the weight from changing its value. **In the worst case**, this **may completely stop the neural network from further training**. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range `(0, 1)`, and backpropagation computes gradients by the chain rule. This has the effect of **multiplying $n$ of these small numbers to compute gradients of the "front" layers in an $n$-layer network**, meaning that the gradient (error signal) decreases exponentially with $n$ while the front layers train very slowly.

**Exmaple of vanishing gradient**

<img src=https://miro.medium.com/max/1550/1*qKjN8eiiW46Xs18JTGLeaA.png>

- Sentence 1: That **_cat_**, which already ate ..., **_was_** full. 
- Sentence 2: That **_cats_**, which already ate ..., **_were_** full. 

This is one example when language can have **long term dependencies**. like in above example **cat** which came early in the sentence can affect what came later.
The basic RNN is not very good in remembering long term dependencies.

<img src=https://miro.medium.com/max/1550/1*Kl3zMpEfTe7zbk4XKXiB_Q.png>

Basic RNN model has many **local influences** because of Recurrent Neural networks as the earlier information. Meaning the output is mostly affected by the value close to it.
Thus, $y^{<3>}$ is mainly influenced by value close to $y^{<3>}$ ($x^{<1>},x^{<>},x^{<3>}$).

In the case of $y^{<T_y>}$, this cannot be influenced by the early inputs in the sequences ($x^{<1>},x^{<>},x^{<3>}$). It has hard for the error to back propagate to the beginning of the sequence. This is the weakness of Basic RNN.

**How to Solve the Vanishing Gradients:**
 [LSTM(Long Short Term Memory)](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/LSTM) or [GRU(gated recurrent units)](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/GRU) both are a very effective solution for addressing the vanishing gradient problem and will allow our neural network to capture much longer range dependencies.

#### Long short-term memory (LSTM)

<img src=https://www.foundationai.org/wp-content/uploads/2018/08/948d40857c4118bc7aebffda6e5f4e57.png>

Long Short Term Memory networks – usually just called "LSTMs" – are a special kind of RNN, capable of learning long-term dependencies. They work tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior.

A common LSTM unit is composed of a **cell**, an **input gate**, an **output gate** and a **forget gate**. The cell is responsible for _"remembering" values over arbitrary time intervals_; hence the word "memory" in LSTM. Each of the three gates can be thought intuitively as regulators of the flow of values that goes through the connections of the LSTM. There are connections between these gates and the cell.


Reading material for more detailed LSTM:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

#### Differnet types of RNNs

![Differnt types of RNNs](images/rnn-types.jpg)

Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state.

From left to right:
- One to one: Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
- One to many: Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
- Many to one: Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment).
- Many to many: Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).
- Many to many: Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.