# Recurrent vs. Sequence Processing Approaches

### Understanding Context
context of a text input is used to better interpret the true meaning 

Example: "He went for a walk" 

- Unknown context: interpret as a statement of fact  <br/>
- Known context: interpret as an expression of frustration  <br/>
- Context = walking as a way to cool off and clear their head


### Understanding Sequential Text Flow

Information about natural order how text or speach is build up:

- information: word positions and the order about the positions <br/>
- left-to-right approach: for long sequences or during sequence generation <br/>
- unidirectional problem: missing information until full sequence processed <br/>


### Understanding Sequential Text Flow - Example

- "I will be late for training..."  <br/>
- "I will be late for training if I miss the train..." <br/>
- "I will be late for training if I miss the train next week" <br/>
<br/>
- "I will be late for the train next week if I miss the training."


### Understanding Sequential Text Flow - Challanges
Challanges:<br/>
- not easy to determine size of a sequence <br/>
- relevance of words can strongly vary <br/>
- hard to assign specific meaning/rule to a position in a sequence <br/>

One approach is to use the idea of dynamic system modeling.

### Recap Properties of Dynamic System Modeling

- Approach 1: Finite Impulse Response (FIR), sequencal model (feedforward network) <br/>
- Approach 2: Infinite Impulse Response (IIR), reccurent model (feedback network) <br/>

Both approaches will be presented today.

Direct processing:<br/>
- limits the input values (length of input)
- no problem with instability and oscillating
Recurrent processing:<br/>
- input vector/sequence in moderate size corresponing to its immediate neighbourhood (internal state)<br/>
- can be instable and problem of oscillating<br/>


### Dynamic System Modeling - FIR

Finite Impulse Response (FIR):
It is the so called first order system, an example of which is a heat exchange process. 
After heating one end of a metal rod to a certain temperature, the temperature of the other end will change proportionally
to the temperature difference of both ends. 
To a temperature impulse at one end, the other end’s temperature will follow similar like a stochastic moving average.

### Dynamic System Modeling - IIR

Infinite Impulse Response (IIR):
It is the so called second order system, an example of such system is a mass fixed on an elastic
spring, an instance of which being the car and its suspension. 
Depending on the damping of the spring component, this system may oscillate or not.

# Recurrent Neural Networks (RNN)

- process sequences of arbitrary length, theoreticaly <br/>
- unidirectional (typically left-to-right) <br/>
- importand building block for natural language or audio processing applications <br/>
- therefore solve problem of memory with modifications (most prominent LSTM)



# Recurrent Neural Networks (RNN)

$h_t$ as hidden state, sematics of words already processed, dependent on current input, continously updated <br/>
$x_t$ as word in input sequence as vector <br/>
$y_t$ as vector output at particular position <br/>


$h_t = H(h_{t-1}, x_{t})$  <br/> 
$y_t = Y(h_t)$ <br/>
<br/>
Often no notion of an external output besides the hidden state ($Y$ as identity function)

# Recurrent Neural Networks (RNN)
## Discrete Linear System Comparison

$h_t$ as hidden state, sematics of words already processed <br/>
$x_t$ as word in input sequence as vector <br/>
$y_t$ as vector output at particular position <br/>

Very similar to Discrete Linear Systems: <br/>
$h_t = H(h_{t-1}, x_{t-1})$  <br/> 
- typically a non-linear func. in RNN and abritratry complex <br/> 
- RNN internal state is independent of $x_{t}$ and not $x_{t-1}$ <br/>
<br/>
$y_t = Y(h_t, x_t)$ <br/>
- RNN feedtrough (non-dynamic influence) retained by $y_t$ depending on hidden state $h_t$ <br/>



# Recurrent Neural Networks (RNN)

$h_t = H(h_{t-1}, x_{t})$  <br/> 
$y_t = Y(h_t)$<br/>
- weights, biases and activation function stay the same in all cells
- changing input ($h_{t-n}, x_{t-n}$ per cell
- additional feedback loop of hidden layer $h_{t-n}$

<img src="NLP_NEF/NLP_NEF_1.PNG" alt="RNN Cell " />


# Recurrent Neural Networks (RNN)

- In theory arbitrary length of sequence <br/>
- In practice vanishing gradients due to "chaining" <br/>
- Difficulties for long-term memory <br/>

*In RNN the derivatives are recursively passed through the same neural network resulting that gradients will vanish*




 # Long Short-Term Memory (LSTM)
 
 - add additional state $c$ as support for long-term memory <br/>
 - replace RNN state vecotor $h$ with two state vector $h$,$c$ <br/>
<br/>
$h_t$ as hidden state, sematics of words already processed and vector output <br/>
$x_t$ as word in input sequence as vector <br/>
$c_t$ as memory state
 <br/>
$$h_t = H(h_{t-1}, x_{t},c_{t})$$  <br/> 
$$c_t = C(h_{t-1}, x_{t},c_{t-1})$$  <br/> 
$$y_t = h_t$$ <br/>
 

 # Long Short-Term Memory (LSTM)
$$h_t = H(h_{t-1}, x_{t},c_{t})$$  <br/> 
$$c_t = C(h_{t-1}, x_{t},c_{t-1})$$  <br/> 
$$y_t = h_t$$ <br/>

<img src="NLP_NEF/NLP_NEF_2.PNG" alt="RNN Cell " />

 # Updating Memory and Controlling Output with Gates

- Forget Gate $f_t$: controls information to be neglected of previous memory state $c_{t-1}$ <br/>
- Input Gate $i_t$:  controls information retriaval from the current input to current memory state $c_{t}$ <br/>
- Output Gate $o_t$ controls output information is read from the memory state $c_{t}$ to the next cell  <br/>

*0 = no pass through* <br/>
*1 = full pass-through* <br/>


$$h_t = y_t = H(h_{t-1}, x_{t},c_{t}) = o_t ○ tanh(c_t)$$  <br/> 
$$c_t = C(h_{t-1}, x_{t},c_{t-1}) = f_t ○ c_{t-1} + i_t ○ C^*(h_{t-1},x_t)$$  <br/> 
$$C^*(h_{t-1},x_t) = tanh(W_{h_c} h_{t-1} + W_{x_c} X_t + b_c)$$

$○$ = element-wise multiplication, Hadamard product  


 # Updating Memory and Controlling Output with Gates
$$h_t = y_t = H(h_{t-1}, x_{t},c_{t}) = o_t ○ tanh(c_t)$$  <br/> 
$$c_t = C(h_{t-1}, x_{t},c_{t-1}) = f_t ○ c_{t-1} + i_t ○ C^*(h_{t-1},x_t)$$  <br/> 
$$C^*(h_{t-1},x_t) = tanh(W_{h_c} h_{t-1} + W_{x_c} X_t + b_c)$$


<img src="NLP_NEF/NLP_LSTM.PNG" alt="LSTM Cell " />


 # Long Short-Term Memory - Sumup

Where is the magic?
- Saturated activation functions such as sigmoid [0,1] and tanh [0,1]
- Prevents activation values from growing arbitrarily if passed through multiple layers
- Instability only “shadowed” by saturation (Data Processing by Feedback Networks, convergence to some values is enforced)
- Just one approach of LSTM, more to study e.g. Gated Recurrent Units (GRU)


 # Long Short-Term Memory - Sumup

Problems/Disadvantages:
- only information from previous positions can be accessed
- left-to-right restriction can lead to wrong semantics due to missing context
- bi-directional models could help (e.g combo left-to-right right-to-left)
- suffer from the unfavorable mathematical properties, revival of sequence processing