# Recurrent NN

[The unreasonable effectiveness of recurrent neural networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) - Andrej Karpathy

## Language model

輸入一句話的各個詞，評估這樣組合的 概率。如: 

```
P("The", "weather", "is", "good", "<EOS>")
# EOS: End of Sentence
# UNK: UNKnown word
```

$$
P\big( \hat{y}^{<1>}, \hat{y}^{<2>}, \dots, \hat{y}^{<T_y>} \big) \\
x^{<t-1>} = y^{<t>}
$$

$$
\begin{cases}
\mathcal{L}^{<t>} \big( \hat{y}^{<t>},y^{<t>}  \big) & = - \sum_i y_i^{<t>} \ \log \hat{y}_i^{<t>} \\
\mathcal{L} & = - \sum_t \mathcal{L}^{<t>} \big( \hat{y}^{<t>},y^{<t>}  \big) \\
P\Big( y^{<1>}, y^{<2>}, y^{<3>} \Big) = P \big( y^{<1>} \big) \times 
P \big( y^{<2>} \ \big| \ y^{<1>} \big) \times P \big( y^{<3>} \ \big| \  y^{<1>}, y^{<2>} \big)
\end{cases}
$$

## Sample novel sequences

- character level language model
- word level language model

## Vanishing gradients with RNNs

exploded gradient : solved by "gradient clip"

## GRU : Gated Recurrent Unit

為了讓在一連串的 sequence 後，仍然保留某種關聯對應。如 下面句子，希望能將 單/複 的 cat 對應到 was/were.

```
The cat,  which surrounded by ..., was  very shy.
The cats, which surrounded by ..., were very shy.
```

parameters: 

- c: memory cell
- tilde c: candidate of new c value
- Gamma_u: gate function for update
- Gamma_r: gate function for relevance

$$
\begin{cases}
c^{<t-1>}         & = a^{<t-1>} \\
\Gamma_r        & = \sigma \Big( w_r \big[ c^{<t-1>}, x^{<t>}  \big] + b_r \Big) \\
\tilde{c}^{<t>} & = \tanh \Big( w_c \big[ \ \ \Gamma_r \times c^{<t-1>}, x^{<t>} \ \ \big] + b_c \Big) \\
\Gamma_u        & = \sigma \Big( w_u \big[ c^{<t-1>}, x^{<t>}  \big] + b_u \Big) \\
c^{<t>}         & = \Gamma_u \times \tilde{c}^{<t>} + \big( 1 - \Gamma_u \big) \times c^{<t-1>} \\
a^{<t>}         & = c^{<t>} \\
\end{cases}
$$

## LSTM : Long Short Term Memory

parameters:

- Gamme_u : update
- Gamme_f : forget
- Gamme_o : output

$$
\begin{cases}
\tilde{c}^{<t>} & = \tanh \Big( w_c \big[ \ \ a^{<t-1>}, x^{<t>} \ \ \big] + b_c \Big) \\
\Gamma_u        & = \sigma \Big( w_u \big[ a^{<t-1>}, x^{<t>}  \big] + b_u \Big) \\
\Gamma_f        & = \sigma \Big( w_f \big[ a^{<t-1>}, x^{<t>}  \big] + b_f \Big) \\
\Gamma_o        & = \sigma \Big( w_o \big[ a^{<t-1>}, x^{<t>}  \big] + b_o \Big) \\
c^{<t>}         & = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>} \\
a^{<t>}         & = \Gamma_o * \tanh{\big(c^{<t>}\big)}
\end{cases}
$$

peephole connection: 將 $ c^{<t-1>} $ 也加入 $ \Gamma $ 的計算中。

## Bidirectional RNNs

例如在文本辨識，有時需要後面的詞才可以確定前面詞的意思。單向的 RNN 無法獲得後面詞來做前面詞的判斷。

$$
\hat{y}^{<t>} = g \Big( w_y \big[ \overrightarrow{a}^{<t>}, \overleftarrow{a}^{<t>} \big] + b_y \Big)
$$

在 NLP 問題，常用 LSTM + Bidirectional-RNNs

## Deep RNNs