# Recurrent Neural Networks (RNNs)

Developed to learn to model various temporal processes such as

 * Stock market prices
 * Weather forecast
 * etc.

but finally led to major breakthroughs in

 * [Natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing)

once researchers started to use RNNs in neural machine translation.


## Generative learning

The principal idea is to use a sequence of known inputs to generate an output. By including the generated output to the known inputs, a sequence of infinite length can be generated.

### Jordan RNN (1986)

There was substantial work on the topic at the end of 1980's. For example, control theory and dynamic systems provided a suitable existing framework to study recurrent neural networks. See, for example,

 * A.J. Robinson and F. Fallside (1987): "Static and Dynamic Error Propagation Networks with Application to Speech Coding". In the Proceedings of the Neural Information Processing Systems (NeurIPS). [PDF](https://proceedings.neurips.cc/paper/1987/file/a1d0c6e83f027327d8461063f4ac58a6-Paper.pdf)
 
The main finding was so called ["Backpropagation through time"](https://en.wikipedia.org/wiki/Backpropagation_through_time).

### Elman RNN (1990)

The first with some good practical results. Developed during the previous wave on neural network research:

 * J.L. Elman (1990): Finding structure in time, In Cognitive Sience, Vol. 14, [DOI Link](https://doi.org/10.1016/0364-0213(90)90002-E)

**Example:** Learn to generate sinusoida wave with Elman RNN

See the Codelab notebook:

 * https://colab.research.google.com/drive/1uZLe9BN9Uu6kT6sZvS_SP6syzk0B_wja?usp=sharing


### Long Short-term Memory (LSTM)

The original idea was proposed in

 * S. Hochreiter, J. Schmidhuber (1997): "Long Short-term Memory", Neural Computation, Vol 9 No 8. [PDF](https://ieeexplore.ieee.org/abstract/document/6795963)

More details in this excellent Blog post

 * https://colah.github.io/posts/2015-08-Understanding-LSTMs/

The main motivation of the LSTMs vs. the previous RNNs such as Elman is that they are better able to solve the vanishing (or exploding) gradient problem:

 * https://medium.datadriveninvestor.com/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577?gi=8cc7cd9d3e1f

**Example:** Learn to generate sinusoida wave with LSTM RNN

See the Codelab notebook:

 * https://colab.research.google.com/drive/1uZLe9BN9Uu6kT6sZvS_SP6syzk0B_wja?usp=sharing 


**Example:** Generate text with LSTM RNN

See the Codelab notebook:

 * https://colab.research.google.com/drive/1jbFTXyBFe64J2xtpek-OjsCytqn0OY8Y?usp=sharing


## Sequence-to-sequence learning

Seq2Seq models differ from generative learning in the sense that inputs and outputs can be of different type and length.

### Neural machine translation

The two sequential tasks related to natural language processing (NLP) are word prediction and neural translation. For the neural translation a different structure was proposed than to word prediction. That structure led to the idea of attention and that to the idea of the Transformer structure. Transformer solves the problem of machine translation through word prediction learning (generative AI) which makes it interesting as this capability "emerges" from the next word prediction.

### Encoder-decoder RNN

Idea was originally published in

 * I. Sutskever, O. Vinyals, Q.V. Le (2014): "Sequence to Sequence Learning with Neural Networks", NeurIPS 2014. [PDF](https://doi.org/10.48550/arXiv.1409.3215)

**Example:** Machine translation with encoder-decoder LSTML

See the Colab notebook:

 * [Colab](https://colab.research.google.com/drive/1dP6pBcxwZMd1ZeE1poOPg63yhwKfwjxL?usp=sharing)



### Attention

Attention mechanism for neural translation was proposed in

 * D. Bahdanau, K. Cho, Y. Bengio (2015): "Neural Machine Translation by Jointly Learning to Align and Translate" in Proc. of the Int'l Conf. on Learning Representations (ICLR) [PDF](https://arxiv.org/abs/1409.0473)

### Transformer

Finally the Transformer architecture was introduced in

 * A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, I. (2017): "Attention Is All You Need". In Proceedings of the Neural Information Processing Systems (NeurIPS). [PDF](https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)

## Foundation models

Some year after the Transformer was developed, it was used to show that generative pre-training has many downstream tasks that can be solved by fine-tuning (cf. backbone networs) - a seminal work was the GPT-1 paper

 * A. Radford, K. Narasimhan, T. Salimans, I. Sutskever (2018): "Improving language understanding by generative pre-training" OpenAI technical report. [PDF](https://www.mikecaptain.com/resources/pdf/GPT-1.pdf)

This led to development of multiple other "foundation models" that are trained self-supervised and from where other capabilities emerge through vast amount of data and huge models.

 * Images (Dall-E by OpenAI)
 * Music (MusicLM by Google)
 * Videos (Sora by OpenAI)

## References

 * [Wikipedia page](https://en.wikipedia.org/wiki/Recurrent_neural_network) provides excellent historical review.

 * I. Goodfellow, Y. Bengio and A. Courville (2016): Deep Learning, MIT Press [Web](https://www.deeplearningbook.org/) provides good modern introduction to RNNs and the notation is copied from their book.
 * M. Baroni, G. Dinu and G. Kruszewski (2014): "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL2023).[PDF](https://aclanthology.org/P14-1023/) - early indication that the next word predictive models will work in NLP.