# L4 Deep Models for Text and Sequences

Example problem: Classifying documents into different categories: Politics, Business, Medicine.

Difficulty of text problem:
- The words that you see rarely often matter most. 'Retinopathy' appears with frequency 0.0001% in English so it may not be in your training set, but if it's found in a document that doc is likely a medical doc.
- Often use different words to mean similar things, e.g. 'cat' and 'kitty'. Would like to share parameters between these words, so we have to learn that they are related.

-> Requires collecting a lot of label data. Too much.

**Solution: Unsupervised learning**: Training without any labels. 

**Idea: Similar words tend to occur in similar contexts.**

Try to predict word's context -> Treat 'cat' and 'kitty' similarly and bring them closer together.

Advantage: Don't have to worry about what the words themselves mean, only the context they appear in.

Map words to small vectors called embeddings which are going to be close to each other when words have similar meanings and far apart if they don't have similar meanings.

Have word representation where all cat-like things are all represented by vectors that are very similar.

Model can then generalise from this pattern of cat-like things instead of learning new things for every way there is to talk about a cat.

### Word2vec

A way of learning these embedddings.

Suppose you have a corpus of text with one sentence.

For each word in this sentence, we will map it to an embedding - initially a random one.

Then we will use the embedding to try to predict the context of the word. 

The context here is simply the words around the chosen word.Pick a random word in a window around the original word and that's your target.

Train your model as though it were a supervised model. Use logistic regression. (Not deep)

### Seeing how embeddings are clustering together

1. Nearest neighbour lookup (of the words that are closest to any given word)
2. Try to reduce dimensionality of embedding space to 2D and plot 2D representation
    * Naive way such as PCA loses lots of information. Need a way of projecting embeddings that preserves information, e.g. **t-SNE**.
    
### Comparing Embeddings

1. It is best to measure closeness using **cosine distance** instead of L2 distance. We may also wish to normalise vectors to have unit norm.
* Because length of embedding vector is not relevant to the classification.

2. **Sampled Softmax**: Take only a random sample of words that are not the target (negative targets) and act as though the other words were not there. Makes things faster (more efficient) with no cost to performance. This is important because there may be many words in our vocabulary.

### Words as Vectors

e.g. PUPPY - DOG + CAT = KITTEN
TALLER - TALL + SHORT = SHORTER

Emergent properties of embedding vectors: let you express semantic and syntactic analogies in terms of mathematical operations.

**Semantic analogy**
$$V_{puppy} - V_{dog} \sim  V_{kitten} - V_{cat}$$


**Syntactic analogy**
$$V_{taller} - V_{tall} \sim  V_{shorter} - V_{short}$$

## Text as a Sequence of Words

So far our models have only looked at inputs with fixed size, i.e. you can turn it into a vector and fit it into your neural network. When you have **sequences of varying length**, you can no longer do that.

### Recurrent Networks RNNs

Sequence of events. At each point in time you want to decide what's happened so far in the sequence. If your sequence is reasonably stationary, you can use the same classifier at each step. This simplifies things a lot already. 

But you may want to take into account the past (since this is a sequenc). You can use the state of the previous classifier as a summary of what happened before.

### Backprop through time: Computing parameter updates of RNNs

Calculate for as many steps as we can afford.

All derivatives apply to same parameters -> Lots of correlated updates at once for the same weights. 
-> Bad for SGD, which prefers to have uncorrelated updates to its parameters to keep training stable.

This makes maths unstable: either the gradients grow exponentially and go to infinity (**exploding gradient**) or go to zero and you don't end up training anything (**vanishing gradient**).

#### Exploding Gradient: Gradient Clipping
Compute the norm of the gradients and shrink the steps when the norm gets too big. (Hacky solution)

#### Tackling Vanishing Gradients

**Effect: Memory Loss** Vanishing Gradients make your network only remember recent events.

**LSTM (Long short-term memory)**. 
Conceptually a RNN is composed of many units, with each unit being a simply neural net (set of layers). 

With LSTMs, we replace each module with an LSTM 'cell' and leave the general architecture unchanged.

Three ops
* Write data into memory
* Read data from memory
* Erase data from memory

(Diagram)
Binary instruction gates

Connect to NNs: Imagine if you had continuous gates instead of binary ones.
-> If function is continuous and differentiable, can take derivatives -> can backprop.

This is LSTM.

The gating value for each gate are controlled by a tiny logistic regression on the input parameters. Each of them have its own set of shared parameters.

(insert diagram)
tanh to keep output between -1.0 and 1.0.

Why do they work?
* The little gates help the NN keep the memory longer when it needs to and ignore things when it should. -> Optimisation becomes much easier and gradient vanishing vanishes. (Bit of a black box)

#### LSTM Regularisation
* L2 always works.
* Dropout works as long as you use it on the input and output, not on the recurrent connections (connections to past and future).


e.g. you have a model that predicts the next step of a sequence. 

You can use that to generate sequences.

Take sequence at time t, Predict -> Sample from distribution -> feed sample to next step and predict -> Sample -> ...

Alternative: Sample many times at each step. (Just sampling next prediction every time is greedy.)

-> Have multiple sequences (hypotheses) that you can continue predicting from at every step. Can choose best by looking at total probability over multiple timesteps at a time. Doing this can avoid your network accidentally  making one bad choice and being stuck with that choice forever.
-> But then number of hypotheses grows exponentially if you do this naively.
-> Instead, use beam search: only keep most likely sequences at every time step and **prune** the rest. Works well in practice.

### Use cases

Can play legos with deep models: put them together and then use backprop to optimise the combination of models.

Maps variable-length sequences to fixed-length vectors.

(Can also be made to map fixed-length vectors to variable-length sequences.)

Can stitch these together to map sequences or arbitrary length to other sequences of arbitrary length.
-> e.g. machine translation.
-> e.g. speech recognition

Covnet x RNN -> Image captions e.g. http://mscoco.org
