# Self-Attention

Combining neural attention with a RNN based sequence to sequence model gives you the model shown below

![](../images/RNN%20Seq2seq%20Attention.png)

The goal now is to try to see if we can replace any of the building blocks in this architecture. 

## Problems with RNN encoders
looking in sequential order is how we write text, but it’s not how we understand them. You may have the subject of a sentence “the man…” followed by something describing the man, separated by a very long sequence. It’s hard to persist the representation between the two ends so that they can interact to build the right encoding. RNNs bake in this order into their encoding. 

Long-distance dependencies are also hard to learn because of gradient problems. Lack of parallelisability. 

Future RNN states can’t be computed before all other past sequences. You have to do T steps of computation before you can make a gradient step. 

> Recurrence, despite how useful for encoding, is the cause of these problems. 

### Our goal now is to resolve the issue of time-dependency and long-term dependencies caused by RNNs

## Tackling the time dependency

The recurrent nature of RNNs means that the sequences can only be processed sequentially, which means that the time complexity of parameter update steps is $O(T)$. 
We need to find an alternative building block that can be used to encode/decode our sequences.

One alternative, is to use _word windows_ - a "window" of fixed size applied 

# TODO diagram Show very useful diagram of how many steps required to get to each layer. 

Windows in every position across text can be computed immediately, in parallel, with no dependence in time. This tackles parallelisation, but not long-range dependencies. 

# TODO diagram Shows useful diagram of which neuron can influence each between layers (pyramid-looking). 

As you can see from this diagram, each neuron is a combination of only a few neurons in the layer below. 

## Building block 2: Attention

Attention, in general treats word’s representation as a query to access and incorporate information from a set of values. 

In a RNN seq2seq model, the set of encoder states for the source sentence are the values, the decoder state was the query, and their dot product gave an attention score.

The original attention mechanism allowed the decoder to focus on different inputs to the encoder.

Several questions follow?
1. Is there a way to make the attention mechanism of the **encoder** pay attention to the different tokens of **its own inputs**?
1. Even better, could it pay attention to different inputs of each layer throughout the model rather than just paying attention to the raw inputs?

The answer to both, is yes, and we call that _self-attention_

In self-attention, all hidden states can attend to all words even in the first layer after the embedding. 

Recall that attention operates on queries (q), keys (k) and values (v). 
# TODO put q, k, v in attention notebook

In self-attention, $q$, $k$ and $v$ come from the same source (like the same sentence). 
Where do these come from? 

> Regardless of what form of attention you use, and what $q$, $k$, and $v$ are, you’re doing the same thing: dot product of queries and keys to get the “affinities” (alignment), then creating a affinity-weighted combination of the input values.

How is this different from a fully connected layer now that you’re connecting eveything to everything? 
1. Dynamic connectivity: The connection weights vary as a function of the input, because they are computed from the affinity between the keys and queries. In a neural network, the connections between each layer are the same for every input. Transformers learn the alignment function which determines the connections between layers for each example.
2. The parameterisation is very different. “It has this inductive bias that’s not just everything to everything feedforward”. 

You get a key, query, value for each word embedding. 
You can stack the self-attention layers and have k, q, v at each layer. 

## SELF-ATTENTION as described so far CANNOT yet be used as a building block. 

There are several problems which we need to address:

### PROBLEM 1:
The order of words obviously matters, but the sliding window approach currently contains no information about where each word appears. So we need to encode this. So far, it is an operation on sets rather than an operation on an ordered sequence.

Let’s bound? the sentence length as T. 
For each i \in {1, …, T} get a positional encoding p_i. 
Then just add that to each of the self-attention block inputs (q,k,v). 
Simple way to add this would be to just get q = v_tilde + p_i. You could concat them, but simple and common to just add. 

You can do the sinusoid thing to get positional encodings, which gives you pros: 
- periodicity indicates that absolute position is not as important
- Maybe can extrapolate to longer sequences
and cons: 
- It's not learnable - perhaps a better positional encoding could be learnt?
- Extrapolation doesn’t really work

More commonly nowadays is to learn the $p_i$. 
Set a $d x T$ (size by seq len) matrix $P$. 

Pros:
- Flexible: each position gets to be learned to fit the data.

Cons:
- You can’t extrapolate to sequences longer than $T$ because you haven’t learnt how to represent them. 

Other ways to encode $P$ include relative position between words of position representations that depend on syntax. 

# TODO diagram of positional encoding

### PROBLEM 2 with self-attention
There are no nonlinearities, so the sequential self-attentions just average averages rather than building hierarchically. 

Solution: add a feedforward layer between self-attention blocks. 

Intuition is that the feedforward layers “process the result of the attention”. 

# TODO improve this section

### PROBLEM 3 with self-attention
Self-attention looks at the whole sequence at once, which is cheating for language modelling! It’s ok for that to happen in an encoder, but not in a decoder. So we mask the future in self-attention. 

One solution would be change the keys and values each timestep, but that would be inefficient. Instead, just set the attention affinities to $-inf$, which makes the attention weights 0.
# TODO improve

# TODO diagram

### Having addressed these problems, self-attention can now be used as a building block in the seq2seq model

As a recap:
- We removed recurrence by applying a sliding window
- We then introduced a positional encoding to the inputs to tell the model the position of each word
- We added nonlinearities between each layer of self attention to allow it to build hierarchical representations
- We apply masking to any decoder self-attention inputs to ensure that the model can't "cheat" and see the future of tokens which during evaluation/inference would not be visible