# What are Recurrent Neural Networks (RNNs)?

## Motivation

The No Free Lunch Theorem tells us that certain types of models are better than others for tackling certain types of problems. That's because they take advantage of the type of data they are designed to process. For example, most of the information in image data is contained not in the values of the pixels, but their relative position. As such, better models will take that spatial information into account. 

Additionally, with images, you want to look for the same thing in different regions of the image. Convolutional networks are a type of model that take advantage of this by using the same model parameters when processing different regions of space - we call this parameter sharing through _space_.

Another common type of data is sequential data, such as audio or text.

The order of the words in a sentence matter! The order of beats in a song matter!
Recurrent Neural Networks, or RNNs, are a type of model that takes this into account.
They do this by using the same model parameters when processing different regions of the sequence - we call this parameter sharing through _time_.

## What are RNNs?

Recurrent neural networks (RNNs) are a type of neural network that can naturally process sequential data. This makes them capable of tasks such as language translation, language generation, and speech recognition, where the input and output sequences may have variable length and the order of the elements in the sequence is important.

In recent years, RNNs have also been applied to other domains such as music generation, video classification, and even protein structure prediction.

## Strengths of RNNs

1. Ability to handle variable-length input sequences: RNNs are not limited to processing fixed-length input sequences, unlike many other types of neural networks. This makes them very flexible and able to handle a wide range of input data.

1. Ability to capture temporal dependencies: RNNs are able to capture dependencies between elements in a sequence by using their hidden state. This allows them to make use of contextual information from previous elements in the sequence when processing the current element.

1. Ability to process data in real-time: RNNs can process data in a streaming fashion, allowing them to make predictions in real-time as data becomes available.

1. Wide range of applications: RNNs have been applied to a wide range of tasks, including language modeling, machine translation, speech recognition, and many more.

## Limitations of RNNs

1. Difficulty in training: RNNs can be difficult to train, especially for long sequences. This is due to the vanishing and exploding gradient problem, where the gradients either become very small or very large as they are backpropagated through the network. This can make it difficult for the network to learn long-term dependencies.

1. Limited ability to process long-term dependencies: While RNNs are able to capture some long-term dependencies, they may struggle with very long sequences or dependencies that are separated by a large number of elements because the hidden state is manipulated so much by every sequential item processed between the dependencies.

1. Sensitivity to initialization: Like other neural networks, RNNs can be sensitive to the values chosen for their initial hidden states. It is not yet well understood how parameter initialisation affects optimisation or generalisation.

1. Computational complexity: RNNs can be computationally expensive, especially for large sequences, because they cannot be parallelised in the time dimension. The cost of each parameter update step is $O(T)$. This can be a limitation when working with large datasets or when real-time processing is required.

1. Difficulty in interpreting results: RNNs can be difficult to interpret, as it is not clear exactly what is represented by their hidden states jut by looking at them. This can make it difficult to understand the decisions made by the network and how it is using the input data.

## Training RNNs

### Teacher Forcing

"Teacher forcing" is a technique used when training a sequence-to-sequence model, such as a recurrent neural network (RNN) with attention. It refers to the use of the true target sequence as the input to the decoder at each time step, rather than the predicted output of the model.

In other words, when teacher forcing is used, the decoder is "forced" to generate the next output based on the true target sequence, rather than its own predicted output at the previous time step. This can help the model learn faster and more accurately, but can also make the model more dependent on the teacher forcing, and may reduce its ability to generate reasonable output when teacher forcing is not used.

To use teacher forcing, you can set the teacher_forcing_ratio parameter when training the model. For example:

In [None]:
for input_tensor, target_tensor in dataset:
    # Set the teacher forcing ratio
    teacher_forcing_ratio = 0.5
    
    # Decide whether to use teacher forcing
    use_teacher_forcing = (random.random() < teacher_forcing_ratio)
    
    if use_teacher_forcing:
        # Use the true target as the input to the decoder
        decoder_input = target_tensor[0]
        decoder_target = target_tensor[1:]
    else:
        # Use the model's predicted output as the input to the decoder
        decoder_input = output[0]
        decoder_target = target_tensor
    
    # Train the model
    loss = train(input_tensor, decoder_input, decoder_target, encoder, decoder, optimizer, loss_function)
