# LSTM unit



refs: 

* https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
* https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21



Consider trying to predict the last word in the text:

> “I grew up in France …  blah, blah (many time) ...I speak fluent ______.”

The answer is **French**. 

Recent information suggests that the next word is probably the name of a **language**, but if we want to narrow down which language, we need the context of **France**, from further back to predict **French**. 

In order to do that a RNN needs to remember **long-term depencies**. In theory RNN can do that but in practice due to gradient vanish and explodig gradient RNN does not have a **good long term memory**

**Long Short Term Memory (LSTM)** networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies

LSTMs are explicitly designed to avoid the long-term dependency problem metioned above. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

<img src="../fig/lstm_neuron.png" width="600" align="left"/> 


LSTM unit works like pipes. The memory is water the valves are the activations functions. The memory flow (water) in the memory pipe are changed by 2 valves:

* memory pipe
* forget valve
* new mwmory valve


## The memory pipe

The flow in this pipe determine how much of the unit memory will be mixed with the old memory to define the new memory state of the unit.   


<img src="../fig/memory_pipe.png" width="600" align="left"/> 



##  Forget valve

$
f_t = \sigma(W_f[c_{t-1}, h_{t-1}, X_t] + b_f)
$


If $f_t = 0$ the unit will copletely forget the previous memory state. On the other hand, ($f_t = 1.0$) the unit will rememember everything.  


<img src="../fig/forget_valve.png" width="600" align="left"/> 




## New memory valve


$
i_t = \sigma(W_i[C_{t-1}, h_{t-1}, X_t] + b_i)
$

Like $f_t$, $i_t$ controsl how much of the new state memory will be used to mix with the old one.

$
G_t = tanh(W_g[C_{t-1}, h_{t-1}, X_t] + b_g)
$

The output of the unit is the memory state $c_t$


$
C_t = f_t*C_{t-1} + i_t*G_t
$

<img src="../fig/new_memory_valve.png" width="600" align="left"/>  


## The output state of the unit

$
o_t = \sigma(W_o[C_{t-1}, h_{t-1}, X_t] + b_o)
$


$
h_t = o_t * tanh(C_t)
$

The ouput state are:
* $h_t$
* $C_t$

<img src="../fig/output_state.png" width="600" align="left"/>  


## Pseudo Code 


```python
def LSTM(prev_ct, prev_ht, X_t):

    combine = [prev_ct. prev_ht, X_t] 
    ft = forget_valve(combine)

    # G_t
    candidate = new_memory_valve(combine)

    it = input_layer(combine)

    C_t = prev_ct * f_t + candidate * it

    ot = output_layer(combine)

    ht = ot * tanh(Ct)

    return ht, Ct


ct = [0, 0, 0]
ht = [0, 0, 0]

for Xt in inputs:
    ct_, ht = LSTM(ct,ht, Xt)
```