## Recurrent Neural Networks

### Recurrent vs Feed-Forward Networks

All network architectures in the previous chapters have been **feed-forward neural networks**: The input passes through the network, is processed through the network's connections and weights, and results in an output. Such a network can be denoted as a simple function:

$$f(x_i) \to y_i$$ 

A **recurrent neural network (RNN)** on the other hand can be described as a function with a slightly different form:

$$f(x_t, h_t) \to (y_t, h_{t+1})$$

Note the indices $t$ and $t+1$, suggesting that there is an order to inputs and outputs: Recurrent neural networks are strong candidates for **sequence learning** - learning from data points whose order matters (like text, audio, or time series data). Unlike feedforward neural networks, a recurrent neural network can use its internal state $h$ to process sequences of inputs, and update it according to new inputs. That gives the network the ability of **sequential memory**.


![](https://miro.medium.com/max/964/1*xn5kA92_J5KLaKcP7BMRLA.gif)

_Source: animation by [Raimi Karim](https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45)_

## Applications of RNNs

Recurrent neural networks are therefore, in principle, very powerful in what they can learn and compute. One could say they are trainable general-purpose computers. To say that in technical language, they have been shown to be **Turing complete**, i.e. being able to implement **any algorithm**.

Like the one for playing Mario Kart:



In [None]:
from IPython.display import HTML
HTML("""
<iframe width="800" height="600" src="https://www.youtube.com/embed/Ipi40cb_RsI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
""")


Recurrent neural networks are frequently applied in

- speech recognition
- [language translation](http://translate.google.com)
- image description
- time series forecasting

### The Short Memory Problem

When training _deep_ neural networks, we can experience an issue - the **[vanishing gradient problem](https://www.quora.com/What-is-the-vanishing-gradient-problem?share=1)**: During gradient-based training, some layers may receive only tiny updates to their weights, leading to a lengthy or unsuccessful learning process. For feedforward neural networks, the issue is caused by the depth of the network and the choice of activation functions, and can be fixed accordingly (popular: the ReLU activation).


Now consider how an RNN processing a long sequence of input is similar to a deep neural network - in fact, we can replace every RNN with a feed-forward neural network that is its _unrolled_  version, in which the layers share the weights. 


![](https://upload.wikimedia.org/wikipedia/commons/0/05/RNN.png)

_Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:RNN.png)_

Such a network also suffers from the vanishing gradient problem. And when we use the RNN for sequence learning, we see the problem as **short memory**: _As the network processes more input, it has trouble retaining information from earlier steps._

### Solutions: LSTM and GRU

The search of networks that do not suffer so much from short memory, but can remember information from arbitrarily long sequences, brings us to two new network building blocks: LSTM and GRU. 

**Long Short-Term Memory** is a recurrent neural network architecture. LSTM networks contain **LSTM units**, composed of a **cell** and three **gates** - an **input gate**, an **output gate** and a **forget gate**. The cell can store values and the gates regulate read, write and delete operations on the cell. _During training, the LSTM network can learn how to operate memory in order to remember the information that is relevant for the task._  



![](https://miro.medium.com/max/1125/1*goJVQs-p9kgLODFNyhl9zA.gif)

_Source: animation by [Raimi Karim](https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45)_

This ability make LSTM especially suitable for processing sequences of data, including time series data for classification and prediction. LSTM cells that remember information make LSTM networks good at extracting patterns over long input sequences.


An alternative architecture is the **GRU network**: **Gated Recurrent Units (GRU)** is a mechanism that works in a way similar to LSTM cells. They have fewer parameters, and are therefore easier to train that LSTMs, but also have more limited capabilities in comparison.

## Recurrent Neural Networks with `keras` 

`keras` provides a number of recurrent neural network architectures as layers:

- `keras.layers.SimpleRNN`: fully-connected RNN where the output is to be fed back to input
- `keras.layers.LSTM`: LSTM layer
- `keras.layers.GRU`: GRU layer

## References

- [Illustrated Guide to Recurrent Neural Networks: Understanding the Intuition](https://www.youtube.com/watch?v=LHXXI4-IEns)
- [Illustrated Guide to LSTM's and GRU's: A step by step explanation](https://www.youtube.com/watch?v=8HyCNIVRbSU)
- [Meaning (and proof) of “RNN can approximate any algorithm”](https://stats.stackexchange.com/questions/220907/meaning-and-proof-of-rnn-can-approximate-any-algorithm)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_