# Chapter 1 - Transformers
## Deep Learning Curriculum - Jacob Hilton

### Theory

* Here are some first principle questions to answer:
1. What is different architecturally from the Transformer, vs a normal RNN, like an LSTM? (Specifically, how are recurrence and time managed?)

    In an RNN, input data is processed sequentially. Given an initial hidden-state and the first input token, the RNN rolls-out, where each subsequent copy of the neural network receives the hidden-state of the previous NN (supposedly containing the memory of the network and the context prior to the current position) and the current token being processed. Thus, the RNN deals with the sequential nature of the input data by literally processing the input sequentially. Other than the fact that RNNs keep a memory of the previous outputs used in present computations, in the form of hidden states, there architecture is very similar to an MLP, in that they have input, hidden and output layers. There is no need to positionally encode the tokens since they are being processed one at a time in the position in which they find themselves.

    In a Transformer network, input sequences are not processed sequentially. Given an input sequence of tokens, the attention mechanism operates over the whole sequence at once. Ouery, key and value representations of the encoded tokens are generated using the learned linear projection matrices. The attention pattern is calculated from the dot-product between queries and keys. After softmax is applied to the attention pattern column-wise, each element in each column is used as the scalers on a linear combination of the values in order to update current embeddings. After updating current embeddings, these are passed through an MLP. As the whole sequence is processed at once, in parallel, in order to capture the structure of the data, these must be positionally encoded.

2. Attention is defined as, $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt(d_k)})V$. What are the dimensions for Q, K, and V? Why do we use this setup? What other combinations could we do with (Q,K) that also output weights?

    $dim(Q) = n_{tokens} \times d_k$

    $dim(K) = n_{tokens} \times d_k$

    $dim(V) = n_{tokens} \times d_v$

    Specifically in the paper $d_k = d_v = 64$. In terms of possible combinations of $Q$ and $K$, the only necessary condition is that there is one key and query vector for each token in the sequence and that these vectors are of the same size. Since at the end of a multi-head attention block the outputs from each head are concatenated, it is useful to have $d_k = d_{model}/h$ where $h$ is the number of heads. Thus, as stated in the paper, you can use values for $d_k$ such as 16, 32, 128, 64, 512 (for 32, 16, 4, 8 and 1 head, respectively). Also, due to this, you cannot use $d_k \gt d_{model}$, additionally beacuse it would make no sense to project the embeddings to a space higher than their own.

3. Are the dense layers different at each multi-head attention block? Why or why not?

    All the dense layers between the multi-head attention blocks are the same, in the sense that they have the same architecture, though their learned parameter values are of course different. They all act in the same vector spaces, and are used to capture more complex non-linear relationships in the data that the multi-head attention would have difficulty capturing. As they are applied position-wise, the input layer is of size $d_{model}/h$ and output layers must be of $d_{model}$ size. For simplicities sake the hidden-layer size is kept the same.

4. Why do we have so many skip connections, especially connecting the input of an attention function to the output? Intuitively, what if we didn't?

    We have these skip connections firstly because they represent the updating of the initial embeddings by the the linearly combined value vectors and secondly in order to reduce the problem of vanishing gradients. It is to eliminate loss of information in the residual stream.

### Code

* Now we'll actually implement the code. Make sure each of these is completely correct - it's very easy to get the small details wrong.
  * Implement the positional embedding function first.
  * Then implement the function which calculates attention, given (Q,K,V) as arguments.
  * Now implement the masking function.
  * Put it all together to form an entire attention block.
  * Finish the whole architecture.
* If you get stuck, The Annotated Transformer may help, but don't just copy-paste the code.
* To check you have the attention mask set up correctly, train your model on a toy task, such as reversing a random sequence of tokens. The model should be able to predict the second half of the sequence, but not the first.
* Finally, train your model on the complete works of William Shakespeare.
  * Tokenize the corpus by splitting at word boundaries (re.split(r"\b", ...)).
  * Make sure you don't use overlapping sequences as this can lead to overfitting.

#### Packages

In [3]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
def MultiHeadAttention(nn.Module):

  def __init__(self):
    super(MultiHeadAttention, self).__init__()

  def forward(self):
    pass

In [None]:
def MLP(nn.Module):

  def __init__(self):
    super(MLP, self).__init__()

  def forward(self):
    pass

In [5]:
class Transformer(nn.Module):

  def __init__(self, batch, vocab_size, d_model=512, n_heads=8, hidden_size=2048):
    super(Transformer, self).__init__()

    # initial hyperparams
    self.d_model = d_model
    self.batch = batch
    self.n_heads = n_heads
    self.d_k = d_model/n_heads
    self.hidden_size = hidden_size

    # embedding layer
    self.embedding = nn.Embedding(vocab_size, self.d_model)

    # multi-head attention blocks
    self.Wq, self.Wk, self.Wv, self.W0 = self.multiHeadAttention()

    # layernorm
    self.layernorm1 = nn.LayerNorm(self.d_model)

    # position-wise feed forward
    self.ff = self.mlp()



  def __op_pos_enc(self, dim, pos):

    return pos/(1e4**((2*dim)/self.d_model))

  def _pos_encoding_vec(self, pos):

    # for a given position, returns the vector of encodings
    pos_vec = np.zeros(self.d_model)
    for i in range(int(self.d_model/2)):
      pos_vec[2*i] = np.sin(self.__op_pos_enc(i, pos))
      pos_vec[2*i + 1] = np.cos(self.__op_pos_enc(i, pos))

    return pos_vec

  def getPositionalEncoding(self):

    pos_encodings = np.zeros((self.batch, self.d_model))

    # gets the encoding vector for each position
    for pos in range(pos_encodings.shape[0]):
      pos_encodings[pos] = self._pos_encoding_vec(pos)

    return torch.from_numpy(pos_encodings)

  def attention(self, Q, K, V, decoder=True):

    den = np.sqrt(self.d_k)

    # attention function
    dot = torch.matmul(Q, K.t()) / den

    # masking if it's a decoder network
    if decoder:
      dot = self._masking(dot)
    att = F.softmax(dot, dim = 1)
    att = torch.matmul(att, V)

    return att

  def _masking(self, att_patt):

    rows, cols = att_patt.shape
    temp = torch.ones(rows, cols)
    temp = torch.tril(temp)
    mask = temp == 0

    # masking upper triangle with -inf
    mask_mat = torch.masked_fill(att_patt, mask, -np.inf)

    return mask_mat

  def mlp(self):

    # dense layer
    mlp = nn.Sequential(
        nn.Linear(self.d_model, self.hidden_size),
        nn.ReLU(),
        nn.Linear(self.hidden_size, self.d_model)
    )

    return mlp

  def multiHeadAttention(self):

    # queries
    Wq = nn.ModuleList([nn.Linear(self.d_model, self.d_k) for _ in range(self.n_heads)])

    # keys
    Wk = nn.ModuleList([nn.Linear(self.d_model, self.d_k) for _ in range(self.n_heads)])

    # values
    Wv = nn.ModuleList([nn.Linear(self.d_model, self.d_k) for _ in range(self.n_heads)])

    # last linear layer
    W0 = nn.ModuleList([nn.Linear(self.d_model, self.s_model) for _ in range(self.n_heads)])

    return Wq, Wk, Wv, W0

  def forward(self):
    pass


In [6]:
model = Transformer(100)

TypeError: Transformer.__init__() missing 1 required positional argument: 'vocab_size'

In [None]:
a = model.getPositionalEncoding()

In [None]:
import matplotlib.pyplot as plt

In [None]:
cax = plt.matshow(a)
plt.gcf().colorbar(cax)

In [None]:
Q = torch.randn((5, 64))
K = torch.randn((5, 64))
V = torch.randn((5, 64))

In [None]:
a = model.attention(Q, K, V)

In [14]:
class SmallNet(nn.Module):

  def __init__(self):
    super(SmallNet, self).__init__()

    self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(5)])

  def forward(self, X):

    out_list = []
    for i, l in enumerate(self.layers):
      out = l(X)
      out_list.append(out)
    return out_list

In [15]:
test_net = SmallNet()

In [16]:
X = torch.ones(10)

In [17]:
test_net(X)

[tensor([ 0.3752, -0.0617,  0.0061, -0.3177,  0.3214, -0.8134,  0.1087, -0.3988,
         -0.5155, -0.0663], grad_fn=<ViewBackward0>),
 tensor([-0.7972,  0.9918, -0.5188,  0.4179,  1.1942,  0.0958,  0.6256, -0.3470,
         -0.1079,  0.6537], grad_fn=<ViewBackward0>),
 tensor([-0.6397,  0.6054, -0.0701, -0.2502, -0.3925,  0.4335, -0.5914,  0.4660,
         -0.2802,  0.4107], grad_fn=<ViewBackward0>),
 tensor([ 1.0603,  0.2111, -0.2385,  0.2580,  0.0450, -0.4529, -0.6450,  0.2023,
         -0.0367, -1.1947], grad_fn=<ViewBackward0>),
 tensor([ 0.0514,  0.8784, -0.3908, -1.0010, -0.4124, -0.4319, -0.8391,  0.5553,
         -0.2332,  0.5662], grad_fn=<ViewBackward0>)]