# The working details of transformers

Transformers have proven to be a remarkable architecture for sequence-to-sequence
problems. Almost all NLP tasks, as of the time of writing this book, have state-of-the-
art implementations that come from transformers. This class of networks uses only
linear layers and softmax to create self-attention (which will be explained in detail in
the next sub-section). Self-attention helps in identifying the interdependency among
words in the input text. The input sequence typically does not exceed 2,048 items as
this is large enough for text applications. However, if images are to be used with
transformers, they have to be flattened, which creates a sequence in the order of
thousands/millions of pixels (as a 300 x 300 x 3 image would contain 270K pixels),
which is not feasible. Facebook Research came up with a novel way to bypass this
restriction by giving the feature map (which has a smaller size than the input image)
as input to the transformer. Let's understand the basics of transformers in this section
and understand the relevant code blocks later.

## Basics of transformers

At the heart of a transformer is the self-attention module. It takes three two-
dimensional matrices (called <b> query (Q) </b>, <b> key (K) </b>, and <b> value (V) </b> matrices) as input.
The matrices can have very large embedding sizes (as they would contain text size x
embedding size number of values), so they are split up into smaller components first
(step 1 in the following diagram), before running through the scaled-dot-product-
attention (step 2 in the following diagram).

Let's understand how self-attention works. In a hypothetical scenario where the
sequence length is 3, we have three word embeddings ($W_1$ , $W_2$ , and $W_3$ ) as input. Say
each embedding is of size 512. Each of these embeddings is individually converted
into three additional vectors, which are the query, key, and value vectors
corresponding to each input:

![trans](../imgs/trans0.png)

Since each vector is 512 in size, it is computationally expensive to do a matrix
multiplication between them. So, we split each of these vectors into eight parts,
having eight sets of (64 x 3) vectors for each of key, query, and value tensor, where 64
is obtained from 512 (embedding size) / 8 (multi-heads) and 3 is the sequence length:

![trans](../imgs/trans1.png)

Note tha there will be eight sets of tensors of $K_{w11}$, $K_{w12}$ and so on because there are eight multi-heads.

In each part, we first perform matrix multiplication between the key and query
matrices. This way, we end up with a 3 x 3 matrix. Pass it through softmax activation.
Now, we have a matrix showing how important each word is, in relation to every
other word:

![trans](../imgs/trans2.png)

Finally, we perform matrix multiplication of the preceding tensor output with the
value tensor to get the output of our self-attention operation:

![trans](../imgs/trans3.png)

We then combine the eight outputs of this step, go back using concat layer (step3 in
the following diagram), and end up with a single tensor of size 512 x 3. Because of
the splitting of the Q, K, and V matrices, the layer is also called <b> multi-head self-attention </b>

![trans](../imgs/trans4.png)

The idea behind such a complex-looking network is as follows:

- <b> Values (Vs) </b> are the processed embeddings that need to be learned for a
given input, in its context of key and query matrices.

- <b> Queries (Qs) </b> and <b> Keys (Ks) </b> act in such a way that their combination will
create the right mask so that only the important parts of the value matrix
are fed to the next layer.

For our example in computer vision, when searching for an object such as a horse, the
query should contain information to search for an object that is large in dimension
and is brown, black, or white in general. The softmax output of scaled dot-product
attention will reflect those parts of the key matrix that contain this color (brown,
black, white, and so on) in the image. Thus, the values output from the self-attention
layer will have those parts of the image that are roughly of the desired color and are
present in the values matrix.


We use the self-attention block several times in the network, as illustrated in the
following diagram. The transformer network contains an encoding network (the left
part of the diagram) whose input is the source sequence. The output of the encoding
half is used as the key and query inputs for the decoding half, while the value input is
going to be learned by the neural network independently to the encoding half

![trans](../imgs/trans5.png)

Finally, even though this is a sequence of inputs, there's no sign of which token
(word) is first and which is next (since a linear layer has no positional indication).
Positional encodings are learnable embeddings (and sometimes hardcoded vectors)
that we add to each input as a function of its position in the sequence. This is done so
that the network understands which word embedding is first in the sequence, which
is second, and so on.