# Sequence Models

## What are sequence data?

Examples could include:
* Audio data
* Text ("the quick brown fox jumped over the lazy dog")
* Music
* DNA sequences (duh?)
* A series of video frames

[DNA sequences](https://en.wikipedia.org/wiki/Sequence_database) actually dominate the Google results if you are to search for "sequence data".



### Consider an example

Let's try and determine which of the following words refer to a name.

x: `Harry Potter and Hermione Granger invented a new spell.`

x: [ $x^{<1>}\ x^{<2>}\ ...\ x^{<t>}\ ...\ x^{<9>}$ ]

y: [ 1 1 0 1 1 0 0 0 0 ]

## What are sequence models?

### General architecture

The distinguishing feature of a basic recurrent neural network is that it feeds the outputs of a given value back into itself as an additional input for the next step.

<img src="static/Recurrent_neural_network_unfold.svg.png">
*Source: [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg)*

In essence, allows it to exhibit "dynamic temporal behaviour along a sequence" - or, in other words, what happens before or after in the sequence can impact the current result.

### What are some varieties?

A key way to break down RNNs is by their inputs and outputs:
* **One-to-one** - basically a standard neural network, barely an RNN
* **One-to-many** - e.g. generation of a piece of music from a starting point
* **Many-to-many** - e.g. determining whether a word in a sentence is a name
    * Many-to-many can have two forms - either $T_x = T_y$ or $T_x \neq T_y$. The latter might be used for translation, where you don't expect each word to have a one-to-one equivalent.
* **Many-to-one** - e.g. sentiment analysis of a sentence

<img src="static/rnntypes.png">
*Source: Andrew Ng's Sequence Models course*

### How do we train them?

This can be a bit more complicated than usual, given the recurrent nature. 

[A gentle introduction is provided here](https://machinelearningmastery.com/gentle-introduction-backpropagation-time/).

#### Vanishing gradient problem

As information passes along a network in forward and back prop, information is lost in transmission. For example, $\hat{y}^{<3>}$ is strongly influenced by points immediately before and after it. This makes it challenging to capture long range dependencies. This issue was explored as early as [1991](http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf) (warning - paper in German) and [1994](http://www.comp.hkbu.edu.hk/~markus/teaching/comp7650/tnn-94-gradient.pdf).

In the world of RNNs, vanishing gradients tend to be the more common issue, but exploding gradients can be catastrophic. (This will show up as chains as `NaNs`, for instance, as you get numerical overflow.)

##### A solution: the Gated Recurrent Unit.
One of the approaches to address this are with the GRU principle, [explained here](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be).

##### Another (more widely used solution): the Long Short Term Memory model.
This model is actually over [20 years old](http://www.bioinf.jku.at/publications/older/2604.pdf).

##### But some people have done comparisons.

[Here](http://arxiv.org/pdf/1503.04069.pdf) and [here](http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf).

##### I don't get how these models work.

First, follow [this](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)  explanation of an LSTM - follow it intuitively.

Make sure that you understand the concepts of the gates.

*Then*, try to understand a GRU.

### What distinguishes these networks?

#### In comparison to a standard network

Inputs and outputs can be different lengths in different examples.

A naive standard network will not share features learned across different positions of text.

### What are their shortcomings?

A key shortcoming is that in a typical model, only data prior to a current data point is available for prediction.

Therefore, $\hat{y^<n>}$ can only learn from $x^{<1>, <2>, ..., <n-1>}$.

This can be addressed with "bidirectional recurrent neural networks", in comparison with the vanilla "unidirectional recurrent neural networks". (These have their own shortcomings - for instance, a BRNN needs the full sequence before it can predict.)

## A special case: Language modelling

A language model estimates the 'probability' of a given sequence of words, i.e. $P(y^{<1>},y^{<2>},...,y^{<T_y>})$.

In a well functioning model, we would want logical sentences of words to be ranked higher - e.g. $P(\text{What is going on here}) = 3.2 \times 10^{-13}$, and $P(\text{Who the hecky's a what}) = 1 \times 10^{-16}$.

### How would you build one?

1. Take a large corpus of English text.
2. Tokenise it - map each word or component into a given example.
   * You might like to include an End of Sentence token, sometimes marked as `<EOS>`.
  * If you are using a limited vocabulary, you might like to replace "unknown words" with an "unknown" token, sometimes marked as `<UNK>`.
3. We then build an RNN. 
  * At each step, the model attempts to guess what word comes next, given the activations of the previous step. 
  <img src="static/RNN model summary.png">
  * The loss function is defined as: $\mathcal{L}(\hat{y}^{<t>},y^{<t>})=-\sum_{i}y_i^{<t>}\log\hat{y}^{<t>}$ - i.e. logistic regression.
  
When we want to start producing our own outputs, we provide a starting point (i.e. starting token, or tokens) and then start feeding the predictions into the next layer.
  


### Different approaches

You may like to build it as a character or word level tokens.

A character language model results in much longer sequences. 
* Advantageously, they don't need a limited vocabulary. You only have 26 tokens (plus punctuations). 
* Negatively, they tend not to be so good at capturing long-range dependencies and computationally intensive.

## Applications

### Caption Generation

The goal of caption generation is to accept an *image* as an input, and output a *text caption* as an output.

#### Architecture

The general architecture is as follows:
* Input an image.
* Pass it through a convolutional network (e.g. AlexNet)
* Take the feature vector from the dense network at the end of the model.
* Pass it into an LSTM (or GRU, etc) and output a series of words based on this input.

Many other options can be found in [A Gentle Introduction to Deep Learning Caption Generation Models](https://machinelearningmastery.com/deep-learning-caption-generation-models/).

#### References

* [Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)](https://arxiv.org/pdf/1412.6632)
* [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555.pdf)
* [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf)

### Machine translation

The goal of caption generation is to accept an *sentence of text in language A* as an input, and output *the same sentence of text in language B* as an output.

#### Architecture

The general architecture is as follows:
* Pass in a sentence through an "encoder" network - one word at a time into an RNN.
* This generates a "sentence encoding".
* Pass this encoding into a "decoding" network.
* Generate a new sentence from this decodering network.

We are generating the probability of a given English sentence being the output, given an input foreign sentence.

#### Beam search

Beam search is a common algorithm for NLP.

#### References
* [How to Implement a Beam Search Decoder for Natural Language Processing](https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/)
* [Beam Search Strategies for Neural Machine Translation](https://arxiv.org/abs/1702.01806)
* [Beam search](https://en.wikipedia.org/wiki/Beam_search)