# Welcome to assignment #4!

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry. We need you to upload the .ipynb-file and the exported .pdf of this notebook.

If you have any questions, ask them in either in the tutorials or in the "Mattermost" channel. The channel is called SSL_WS_2324, you can join the server using this [Link](https://mattermost.imp.fu-berlin.de/signup_user_complete/?id=h5ssupqokprtpyf4dr7xabiwpc&md=link&sbr=su) and can search for the public channel.


This week is all about attention.

# Slide Review

[Google Form](https://forms.gle/u2BeWjDEQ5ZW1jZk8) for the slide review. Please take a minute to scroll over the slides again and improve your lecture.

Please make sure to only choose your top 5 slides per lecture!

# PapagAI

From the second week onwards we started the reflective study.
Register on the [PapagAI website](https://www.papag.ai) and write your first reflection about your impressions and challenges in the context of the lectures and tutorials you had this and previous week. The size of reflection can be anywhere bigger than 100 words. You can check out this [YouTube video](https://www.youtube.com/watch?v=QdmZHocZQBk&ab_channel=FernandoRamosL%C3%B3pez) with instructions on how to register, create a reflection and get an ai feedback.

Please note, that this task is an obligatory one for this course and make sure each of you does the reflection, not only one person per group.

#### Please state both names of your group members here:
Authors: Omar Ahmed and Can Aydin

# Assignment 4: Transformers

## Ex. 4.1 Attention

Build the self-attension layer by yourself like it is introduced in [Attention is all you need](https://arxiv.org/abs/1706.03762). Make sure to combine Query, Key and Value in the right way and to apply the softmax function to the attention scores. **(RESULT)**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

In [None]:
class multiheadSelfAttention(nn.Module):
    def __init__(self, heads, embedding_dimension, attention_vectors_dimension):
        super().__init__()
        self.heads = heads
        self.embedding_dimension = embedding_dimension
        self.attention_vector_dimension = attention_vectors_dimension

        # assert that the the attention vector dimension is always divisible by the number of heads
        assert self.attention_vector_dimension % self.heads == 0

        # create one weight matrices for queries, keys, values for one attention head
        # If dimension of attention vector = 64 and number of attention heads = 8 => 64 * 8 = 512
        # then the dimensions of the weight matrices are 512 x 512, if embedding dimensions are 512.
        # The weight matrices are splitted into the number of heads to perfom dot product and then concatenated again
        self.W_q = nn.Linear(self.embedding_dimension, self.heads * self.attention_vector_dimension, bias=False)
        self.W_k = nn.Linear(self.embedding_dimension, self.heads * self.attention_vector_dimension, bias=False)
        self.W_v = nn.Linear(self.embedding_dimension, self.heads * self.attention_vector_dimension, bias=False)

        # create weight matrix that turns all concatenated attention heads into one output
        # example: 64 * 8 = 512 => 512 x 512
        self.W_o = nn.Linear(self.heads * self.attention_vector_dimension, self.embedding_dimension, bias=False)

    def attention(self, Q, K, V, mask=None):
        Q_K = torch.matmul(Q, K.transpose(-2,-1)) # (32 x 8 x seq_len x 64) @ (32 x 8 x 64 x seq_len) = (32 x 8 x seq_len x seq_len) transpose K at the the second to last index position and the last index position
        Q_K /= math.sqrt(self.attention_vector_dimension) # (32 x 8 x seq_len x seq_len)

        if mask is not None:
            Q_K = Q_K.masked_fill(mask==0,-1e9)

        softmax_Q_K = torch.softmax(Q_K, dim=-1) # (32 x 8 x seq_len x seq_len)
        output = torch.matmul(softmax_Q_K, V) # (32 x 8 x seq_len x seq_len) @ (32 x 8 x seq_len x 64) = (32 x 8 x seq_len x 64)
        return output

    def forward(self, X_Q, X_K, X_V, mask=None):
        batch_size = X_Q.size(0)
        # after the matmul of the embedded input vector and the weight matrix you have the dimension (32 x seq_len x 512), where seq_len the
        # number of words in the sequence is. Now, to calculate the attention score, you need to reshape the weight matrix to dimensions (32 x seq_len x 8 x 64),
        # which is just a deconcatennating the weight matrices of all attention heads.
        # Finally, you just transpose the weight matrices at dimenions 1 and 2 which are the dimensions seq_len and heads,
        # since you need the following matrix dimensions to calculate the attention score, see https://jalammar.github.io/illustrated-transformer/:
        # (32 x 8 x seq_len x 64).
        Q = self.W_q(X_Q).view(batch_size, -1, self.heads, self.attention_vector_dimension).transpose(1, 2)
        K = self.W_k(X_K).view(batch_size, -1, self.heads, self.attention_vector_dimension).transpose(1, 2)
        V = self.W_v(X_V).view(batch_size, -1, self.heads, self.attention_vector_dimension).transpose(1, 2)

        attention_heads_scores = self.attention(Q, K, V, mask) # (32 x 8 x seq_len x 64)

        # concatenate the separate attentions head scores
        # that means you have to tranpose the seq_len and 8 (#heads) back, so you get: (32 x seq_len x 8 x 64).
        # After that, revert the reshape and you get: (32 x seq_len x 512). With a matrix of this dimensions,
        # you can use it on the weight matrix W_o, which will output single weight attention score for all heads: (32 x seq_len x 512)
        # using contigous because the matrix value location in memory gets fucked up for some reason (strides??) after transpose is used
        concatenated_heads = attention_heads_scores.transpose(1, 2).contiguous().view(batch_size, -1, self.embedding_dimension)
        return concatenated_heads


## Ex. 4.2 Shakespeare

For this exercise, we want you to generate shakespeare-like sentences using the transformer approach with the attention mechanism as its core.

To solve this, we want you to implement the following pipeline:

**Input -> Embedding -> N x (Transformer-Block) -> FC-Layer -> Softmax -> Output**

The "Transformer-Block" should look like this:

**Self-Attention -> LayerNorm -> FeedForward -> LayerNorm**

The tinyShakespeare dataset is available [Here](https://www.tensorflow.org/datasets/catalog/tiny_shakespeare).

You are free to utilize any Tokenizer for Encoding you want and you are allowed to use implementations/libraries for this. We recommend Byte Pair Encoding (BPE). [Link](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt). You are not allowed to use a transformer implementation to solve the whole task. We want you to build this pipeline around your own attention function. For the feed-forward part and fully connected layer, you may use the PyTorch implementation. **(RESULT)**

How did you pick **N** for your best result? **(RESULT)**

In [None]:
import copy
import torchtext
embedding_dim = 512
heads = 8
attention_vector_dim = 64
class FeedForward(nn.Module):
    def __init__(self, embedding_dimension, feedForwardDimension):
        super().__init__()
        self.input = nn.Linear(embedding_dimension, feedForwardDimension)
        self.output = nn.Linear(feedForwardDimension, embedding_dimension)

    def forward(self, attention_scores):
        # attention_scores has dimension: (32 x seq_len x 512)
        x = F.relu(self.input(attention_scores))
        output = self.output(x)
        return output


class Net(nn.Module):
    def __init__(self, heads, embedding_dim, attention_vector_dim, target_dim, N, vocab_size):
        super(Net, self).__init_()
        self.embedding = nn.Embedding(voacb_size, embedding_dim)
        self.heads = heads
        self.embedding_dim = embedding_dim
        self.attention_vector_dim = attention_vector_dim
        self.N = N
        self.target_dim = target_dim # the dimension of the target vocab

        self.transformerBlock = nn.Sequential(
            multiheadSelfAttention(heads=heads, embedding_dimension=embedding_dim, attention_vectors_dimension=attention_vector_dim),
            nn.LayerNorm(embedding_dim),
            FeedForward(embedding_dimension=embedding_dim, feedForwardDimension=4096),
            nn.LayerNorm(embedding_dim)
        )

        self.transformerLayers = nn.ModuleList(copy.deepcopy(transformerBlock) for i in range(N))
        self.fc = nn.Linear(self.embedding_dim, self.target_dim)

    def forward(self, inp):
        tokenizer = torchtext.get_tokenizer("basic_english")
        tokens = tokenizer(inp)
        x = self.embedding(tokens)
        for i in range(self.N): # NxTransformerBlock
            x = self.transformerLayers[i](x)

        x = self.fc(x)
        x = F.softmax(x, dim= -1)

        return x


