#### This submission is for... (*put up to three people*)
- Erika Mustermann (87654321)
- Max Mustermann (12345678)
- Maxine Musterfrau (87651234)

# Exercise 2A - Transformers

In this exercise, you'll implement a basic encoder-only Transformer architecture with PyTorch. We will start with building the basic building blocks and then integrate them into a fully-fleged Transformer model. We train the model to solve a POS-Tagging problem (more on that later). In the previous exercise, you implemented your work in numpy. Now, we will switch to PyTorch, which will track the gradients for us and allows us to focus more on the network itself.

You can receive up to three points for your implementation of Exercise 2A. Together with Exercise 1, you can get up to six bonus points for the exam.

**Important Notice**: Throughout the notebook, basic structures are provided such as functions and classes without bodies or partial bodies, and variables that you need to assign to. **Don't change the names of functions, variables, and classes - and make sure that you are using them!** You're allowed to introduce helper variables and functions. Occasionally, we use **type annotations** that you should follow. They are not enforced by Python. Whenenver you see an ellipsis `...` you're supposed to insert code.

In [None]:
!pip install torchtext torchdata torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchmetrics
  Downloading torchmetrics-0.11.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchmetrics
Successfully installed torchmetrics-0.11.4


In [2]:
import torch
import math
import torch.nn.functional as F
from torch import nn, Tensor
from torch.nn.modules.dropout import Dropout

if torch.cuda.is_available():
    device = "cuda" 
    print("Cuda is available")
else: 
  device = "cpu"
  print("CUDA is not available. Execution time on the cpu is slow.")

  

CUDA is not available. Execution time on the cpu is slow.


Let's actually start with a few basic functions that we will need throughout the exercise, namely **Softmax** and **ReLu**.

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

$\text{ReLU}(x) = \max(0, x)$

In [None]:
def softmax(input: Tensor) -> Tensor:
    offset,_ = input.max(axis=1,keepdims=True)
    p = torch.exp(input - offset)
    p /= torch.sum(p,axis=1)
    return p 


def relu(input: Tensor) -> Tensor:
    
    return torch.maximum(input,torch.tensor(0))


# Test

# Softmax
input = torch.linspace(0,16,steps = 16).reshape(4,4)
print(input)

print(softmax(input))

print(softmax(input).sum(axis=1))

# Relu
input[:,1]= -1
print(input)
print(relu(input))


tensor([[ 0.0000,  1.0667,  2.1333,  3.2000],
        [ 4.2667,  5.3333,  6.4000,  7.4667],
        [ 8.5333,  9.6000, 10.6667, 11.7333],
        [12.8000, 13.8667, 14.9333, 16.0000]])
tensor([[0.0271, 0.0788, 0.2289, 0.6652],
        [0.0271, 0.0788, 0.2289, 0.6652],
        [0.0271, 0.0788, 0.2289, 0.6652],
        [0.0271, 0.0788, 0.2289, 0.6652]])
tensor([1.0000, 1.0000, 1.0000, 1.0000])
tensor([[ 0.0000, -1.0000,  2.1333,  3.2000],
        [ 4.2667, -1.0000,  6.4000,  7.4667],
        [ 8.5333, -1.0000, 10.6667, 11.7333],
        [12.8000, -1.0000, 14.9333, 16.0000]])
tensor([[ 0.0000,  0.0000,  2.1333,  3.2000],
        [ 4.2667,  0.0000,  6.4000,  7.4667],
        [ 8.5333,  0.0000, 10.6667, 11.7333],
        [12.8000,  0.0000, 14.9333, 16.0000]])


## Transformer Block

A typical transformer block consists of the following 
- Multi-Head Attention
- Layer Normalization
- Linear Layer
- Residual Connections

<center><img src="https://i.imgur.com/ZKgcoe4.png" alt="transformer block visualization" width="200">

In the next few subsections, we will build these basic building blocks.

### Multi-Head Attention

Multi-Head Attention concatenates the outputs of several so called **attention heads**.

$\textrm{MHA}(Q,K,V) = \textrm{Concat}(H_1,...,H_h)$

<center><img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width=300>

One attention head consists of linear projections for each of $Q, K$ and $V$ and an attention mechanism called **Scaled Dot-Product Attention**. The attention mechanism scales down the dot products by $\sqrt{d_k}$.

$\textrm{Attention}(Q,K,V)=\textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$



If we assume that $q$ and $v$ are $d_k$-dimensional vectors and its components are independent random variables with mean $0$ and a variance of $d_k$, then their dot product has a mean of $0$ and variance of $d_k$. It is preferred to have a variance of $1$ and that's why we scale them down by $\sqrt{d_k}$.

The dot product $q \cdot v$ resembles a measure of similarity.


<center><img src="https://www.tensorflow.org/images/tutorials/transformer/scaled_attention.png" width="350">

Let's start implementing these components. Note that our classes inherit from PyTorch's `nn.Module`. These modules allow us to hold our parameters and easily move them to the GPU (with `.to(...)`). It also let's us define the computation that is performed at every call, in the `forward()` method. For example, when we have an `Attention` module, initialize it like `attention = Attention(...)`, we are able to call it with `attention(Q, K, V)` (it'll execute the `forward` function in an optimized way).

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_n:int):
        super().__init__()
        self.hidden_n = hidden_n
        ...

    def forward(self, Q, K, V, mask=None):
        ...

In [50]:
class MultiHeadAttention(nn.Module):
    """
    hidden_n: hidden dimension
    h: number of heads

    Usage:
      attn = MultiHeadAttention(hidden_n, h=2)
      # self-attention
      data = torch.randn(batch_size, sequence_length, hidden_n)
      self_attn_output = attn(query=data, key=data, value=data)
      # attention using two inputs
      other_data = torch.randn(batch_size, sequence_length, hidden_n)
      attn_output = attn(query=data, key=other_data, value=other_data)
    """

    def __init__(self, hidden_n:int, h:int = 2, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.
        Inputs:
         - hidden_n: Dimension of the token embedding
         - h: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert hidden_n % h == 0

        self.key = nn.Linear(hidden_n, hidden_n)
        self.query = nn.Linear(hidden_n, hidden_n)
        self.value = nn.Linear(hidden_n, hidden_n)
        self.proj = nn.Linear(hidden_n, hidden_n)

        self.h = h
        self.dropout = nn.Dropout(p=dropout)
        self.scale = math.sqrt(hidden_n / h)


    def forward(self, query, key, value, attn_mask=None):
        """
        Calculate the masked attention output for the provided data, computing
        all attention heads in parallel.
        In the shape definitions below, N is the batch size, S is the source
        sequence length, T is the target sequence length, and E is the embedding(hidden_dimension)
        dimension.
        Inputs:
        - query: Input data to be used as the query, of shape (N, S, E)
        - key: Input data to be used as the key, of shape (N, T, E)
        - value: Input data to be used as the value, of shape (N, T, E)
        - attn_mask: Array of shape (T, S) where mask[i,j] == 0 indicates token
          i in the target should not be influenced by token j in the source.
        Returns:
        - output: Tensor of shape (N, S, E) giving the weighted combination of
          data in value according to the attention weights calculated using key
          and query.
        """
        N, S, D = query.shape
        N, T, D = value.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, T, D))
        

        # Get num of heads
        H = self.h

        # Compute key, query and value matrices from sequences
        K = self.key(key).view(N, T, H, D//H).moveaxis(1, 2)
        Q = self.query(query).view(N, S, H, D//H).moveaxis(1, 2)
        V = self.value(value).view(N, T, H, D//H).moveaxis(1, 2)

        # (N,H,S,D/H) @ (N,H,D/H,T) -> (N,H,S,T)
        Y = Q @ K.transpose(2, 3) / self.scale

        if attn_mask is not None:
            # Ensure small probabilities in softmax
            Y = Y.masked_fill(attn_mask==0, float("-inf"))
        
        # NOTE: Assignment says apply dropout after attention output. That does
        # not work so dropout is applied right after softmax.

        # (N,H,S,T) @ (N,H,T,D/H) -> (N,H,S,D/H)
        Y = self.dropout(F.softmax(Y, dim=-1)) @ V
        output = self.proj(Y.moveaxis(1, 2).reshape(N, S, D))

        return output


### Layer Normalization

From the lecture, remember layer normalization where the values are normalized across the feature dimension, independently for each sample in the batch. For that, first calculate mean and standard-deviation across the feature dimension and then scale them appropriately such that the mean is 0 and the standard deviation is 1. Introduce **two sets of learnable parameters**, one for shifting the mean (addition) and one for scaling the variance (multiplication) the normalized features (i.e., two parameters for each feature). Tip: Use `nn.Parameter` for that.

$y_{\textrm{norm}}=\frac{x-\mu}{\sqrt{\sigma+\epsilon}}$

$y=y_{\textrm{norm}}\cdot\beta+\alpha$

<center>
<img src="https://i.stack.imgur.com/E3104.png" alt="visualization of layer norm vs. batch norm" width="420">

In [54]:
class LayerNorm(nn.Module):
    def __init__(self, norm_shape):

        """
        norm_shape: The dimension of the layer to be normalized.
        """
        super().__init__()
        self.epsilon = 1e-5
        self.alpha = nn.Parameter(torch.ones(norm_shape))  # seq_len,features
        self.beta = nn.Parameter(torch.zeros(norm_shape))
         
    def forward(self,x: torch.Tensor):
  
        mean = x.mean(dim = -1, keepdim=True)   # Averaging over dim = (-1 ,-2):  hidden_n,  Q_len,
        mean_x2 = (x ** 2).mean(dim = -1, keepdim=True) 
        var = mean_x2 - mean ** 2
        sdt = torch.sqrt(var + self.epsilon)

        x_norm = (x - mean) / sdt
        x_norm = self.alpha * x_norm + self.beta
        return x_norm
        
def _test():


  x = torch.zeros([16, 10, 512])
  print(x.shape)
  ln = LayerNorm(x.shape[1:])

  x = ln(x)
  print(f"{x.shape}output shape")
  print(f"{ln.alpha.shape}www")
  print(ln.beta.shape)

_test()

x= torch.zeros([16, 10, 512])
print(x.shape)
mean = x.mean(dim=(0,1), keepdim=True)
mean.shape

x= torch.linspace(1, 24, 24).reshape(2,3,4)

print(x)

mean = x.mean(dim=(-1), keepdim=True)
print(mean)
print(x.shape[1:])
y = torch.tensor([1,1,2,1]) 

z = mean * y
z

torch.Size([16, 10, 512])
torch.Size([16, 10, 512])output shape
torch.Size([10, 512])www
torch.Size([10, 512])
torch.Size([16, 10, 512])
tensor([[[ 1.,  2.,  3.,  4.],
         [ 5.,  6.,  7.,  8.],
         [ 9., 10., 11., 12.]],

        [[13., 14., 15., 16.],
         [17., 18., 19., 20.],
         [21., 22., 23., 24.]]])
tensor([[[ 2.5000],
         [ 6.5000],
         [10.5000]],

        [[14.5000],
         [18.5000],
         [22.5000]]])
torch.Size([3, 4])


tensor([[[ 2.5000,  2.5000,  5.0000,  2.5000],
         [ 6.5000,  6.5000, 13.0000,  6.5000],
         [10.5000, 10.5000, 21.0000, 10.5000]],

        [[14.5000, 14.5000, 29.0000, 14.5000],
         [18.5000, 18.5000, 37.0000, 18.5000],
         [22.5000, 22.5000, 45.0000, 22.5000]]])

### Transformer Block

Here, we bring all ingredients together into a single module. Don't forget to add the residual connections.

In [51]:
class TransformerBlock(nn.Module):
  def __init__(self, hidden_n:int, h:int = 2,expansion_ratio=4,dropout=0.1,max_len=20):
        """
        hidden_n: hidden dimension
        h: number of heads
        """  
        super().__init__()
        self.attention = MultiHeadAttention(hidden_n,h)
        # self.norm1 = torch.nn.LayerNorm(hidden_n)
        # self.norm2 = torch.nn.LayerNorm(hidden_n)

        self.norm1 = LayerNorm(hidden_n)
        self.norm2 = LayerNorm(hidden_n)

        self.feedforward = nn.Sequential(
          torch.nn.Linear(hidden_n,expansion_ratio * hidden_n),
          torch.nn.ReLU(),
          torch.nn.Linear(expansion_ratio*hidden_n,hidden_n)
            
        )
        self.dropout = torch.nn.Dropout(dropout)
    
  def forward(self,value,key,query,mask=None):   #  there is no difference between q,k..,value in encoder block ,but you need to diverse them because in decoder the three input acutally differ
        out_attention = self.attention.forward(value,key,query,mask)
        out_firstnorm = self.norm1(query + out_attention)
        out_firstnorm = self.dropout(out_firstnorm)
        
        out_feedforward = self.feedforward(out_firstnorm)
        out = self.norm2(out_feedforward+out_firstnorm)
        out = self.dropout(out)
        return out

## A Simple Transformer Architecture

Let's stack our transformer blocks and add an embedding layer for a simple transformer architecture. You are allowed to use `nn.Embedding` here.

In [52]:
class Transformer(nn.Module):
  def __init__(      self,
               embed_n,
               hidden_n,
               n,
               h,
               device,
               dropout,
               num_targ_vocab
               ):
    
      """
      emb_n: embed_dim :number of token embeddings
      hidden_n: hidden dimension
      n: number of layers
      h: number of heads per layer
      """
      super().__init__()
      self.device = device
      self.embed1 = torch.nn.Embedding(embed_n,hidden_n)
      #self.embed2 = torch.nn.Embedding(max_length,embed_n)   # seq len must be equal or smaller than max len
      self.proj = torch.nn.Linear(hidden_n,num_targ_vocab)

      self.layers = torch.nn.ModuleList( [ TransformerBlock(hidden_n,h) for i in range( n )] )


  def forward(self,input,mask=None):    # input : N L
      N,sequence_length = input.shape
      #positional_encoding = torch.arange(0,sequence_length).expand(N,sequence_length).to(self.device)
      #out = self.dropout(self.embed1(input) + self.embed2(positional_encoding))

      out = self.embed1(input) # N S[o to L] -> N S D
      
      for layer in self.layers:
        out = layer(out,out,out,mask)
      # out = self.proj(out).reshape(N,sequence_length,-1)
      return out

In [55]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(
    device
)
trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)




model = Transformer(
            embed_n= 80,   # dim of word_embeded vector
            hidden_n = 256,  # hidden dimension
            n=3,       # num of layers
            h=2,       # heads
            device="cpu",
            dropout=0.1,
            num_targ_vocab=64) # output vocabs space
x1 = ((1/3) *torch.linspace(1,2*78,2*78)).reshape(2,-1).to(torch.int32).to(
    device
)
x1

out = model(x)
print(out.shape)

out = model(x1)

cpu
torch.Size([2, 9, 256])


## POS-Tagging

Part-Of-Speech-Tagging (**POS-Tagging**) is a **sequence labeling problem** where we categorize words in a text in correspondence with a particular part of speech (e.g., "noun" or "adjective"). A few examples and classes are shown in the following table:

|  POS Tag  |  Description  |  Examples  |
|-----------|------------|------------|
|  NN | Noun (singular, common) | mass, wind, ...  |
|  NNP | Noun (singular, proper) | Obama, Liverpool, ...  |
| CD  | Numeral (cardinal)  | 1890, 0.5, ...  |
|  DT | Determiner  | all, any, ... |
| JJ | Adjective (ordinal) | oiled, third, ... |
... many more

### CoNLL2000 Dataset

Let's load our dataset which is the **CoNLL2000 dataset** and look at an example.

In [3]:
%pip install portalocker
!python --version

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting portalocker
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.7.0
Python 3.9.16


In [4]:

from torch.utils.data import Dataset, DataLoader
from torchtext.datasets import CoNLL2000Chunking
import pandas as pd
import portalocker
train_df = pd.DataFrame(CoNLL2000Chunking()[0], columns=['words', 'pos_tags', 'chunk'])
test_df = pd.DataFrame(CoNLL2000Chunking()[1], columns=['words', 'pos_tags', 'chunk'])

train_src, train_tgt = train_df['words'].tolist(), train_df['pos_tags'].tolist()
test_src, test_tgt = test_df['words'].tolist(), test_df['pos_tags'].tolist()

print(train_src[2])
print(train_tgt[2])

['But', 'analysts', 'reckon', 'underlying', 'support', 'for', 'sterling', 'has', 'been', 'eroded', 'by', 'the', 'chancellor', "'s", 'failure', 'to', 'announce', 'any', 'new', 'policy', 'measures', 'in', 'his', 'Mansion', 'House', 'speech', 'last', 'Thursday', '.']
['CC', 'NNS', 'VBP', 'VBG', 'NN', 'IN', 'NN', 'VBZ', 'VBN', 'VBN', 'IN', 'DT', 'NN', 'POS', 'NN', 'TO', 'VB', 'DT', 'JJ', 'NN', 'NNS', 'IN', 'PRP$', 'NNP', 'NNP', 'NN', 'JJ', 'NNP', '.']


First, we need to create a vocabulary. Our dataset is already tokenized. However, we need to assign ids to them in order to input them to the embedding layer. We also need the number of embeddings (`num_embeddings`) for the size of our lookup table of `nn.Embedding`.

Thus, we will iterate over all sentences replace them with ids and the mapping to our vocabulary. It'll be handy to have two different mappings, from id to token, as well as, from token to id. Note that we will add a special token `<unk>` with id `0` for words that are unknown (that are not in the training dataset but could possibly be in the test dataset).

In [110]:
from collections import Counter
from os import sendfile
from collections import OrderedDict

"""""
imporvement: things to do

to much indexs , sort it by frequencie and discard words less frequently appear
"""""

vocabulary_id2token : dict = {0: '<unk>'}
vocabulary_token2id : dict = {'<unk>': 0}

# get all words
word_list = [word    for sentence in train_src for word in sentence]

# sort by frequency
count = Counter(word_list)
result = sorted(word_list, key=lambda x: count[x],reverse=True)

# reomve redundants and andd <unk>
word_list_no_redundant = list(OrderedDict.fromkeys(result))[:8000]

# show
for i in range(300):
    print(word_list_no_redundant[i])

print(f'lenth no redundant:{len(word_list_no_redundant)}')
word_list_no_redundant.insert(0,'<unk>')

# create index list
idx = [ i for i in range(len(word_list_no_redundant))]

# create token 2id and id to token dicts
vocabulary_token2id = dict(zip(word_list_no_redundant,idx))
vocabulary_id2token =  {v: k for k, v in vocabulary_token2id.items()} # swap key value in last dict
vocabulary_token2id

print(f'lenth id list:{len(vocabulary_id2token)}')

# test
print(vocabulary_token2id['if'])
print(vocabulary_id2token[107])
vocabulary_id2token[vocabulary_token2id['if']]


107
if


'if'

In [None]:
import random

rand_idx = [random.randint(0, 10) for _ in range(100)]

print(result)

my_list = ['JACK', 'JACK', 'china', 'africa', 'upstairs', 'upstars', '!', '!']
result = list(set(my_list))

for item in my_list:
    if item not in result:
        result.append(item)

print(result)



Let's do the same for our classes:

In [112]:
classes_id2name : dict = {}
classes_name2id : dict = {}

token_list = [token for sentence in train_tgt + test_tgt for token in sentence]

my_list = token_list
count = Counter(my_list)
result = sorted(my_list, key=lambda x: count[x],reverse=True)
token_list_no_redun =  list(set(result))

print(f'lenth token list:{len(token_list_no_redun)}')
idx = [ i for i in range(len(token_list_no_redun))]

classes_name2id = dict(zip(token_list_no_redun,idx))
classes_id2name =  {v: k for k, v in classes_name2id.items()}

# test 
print(classes_name2id['NN'])
print(classes_id2name[32])

print(classes_id2name[19])
classes_name2id['VBZ']

lenth token list:44
32
NN
VBZ


19

In [70]:
# from collections import Counter

# my_list = [1, 2, 3, 1, 2, 3, 3, 4]
# count = Counter(my_list)
# result = sorted(my_list, key=lambda x: count[x],reverse=True)


# print(result)
# result_no_redun = list(dict.fromkeys(result))
# print(result_no_redun)


[3, 3, 3, 1, 2, 1, 2, 4]
[3, 1, 2, 4]


Now, let's use PyTorch's `Dataset` and `DataLoader` for help us batching our data. Let's also replace tokens and classes with our ids. For that, complete `get_token_ids` and `get_class_ids`.

In [113]:
def get_token_ids(src):
    """
    Usage:get_token_ids( ['word1','word2'])

    """
    #result = [vocabulary_token2id[elm] for elm in src]
    result = [vocabulary_token2id[elm] if elm in vocabulary_token2id else 0 for elm in src]
    #result = vocabulary_token2id[src_tuple]
    return result


# TEST

print(get_token_ids(['<unk>','if']))
print(get_token_ids(['have']))

def get_class_ids(tgt):                       #  maybe need some justification
    result = [classes_name2id[elm] for elm in tgt]
    return result
# TEST
get_class_ids(['NN','VBZ'])


'''
Deine Class ConllDataset
'''

class ConllDataset(Dataset):
  def __init__(self, src, tgt):
        self.src = src
        self.tgt = tgt

  def __len__(self):
        return len(self.src)

  def __getitem__(self, index):
        src = self.src[index]
        tgt = self.tgt[index]
        
        return {
            'src': get_token_ids(src),
            'tgt': get_class_ids(tgt),
        }

train_dataset = ConllDataset(train_src, train_tgt)
test_dataset = ConllDataset(test_src, test_tgt)



what = train_dataset[0]

# print(what)
t =what['src']
# print(t)
# w =what['tgt']
# print(w)

# helpler functions

def idtowords(id):
  result = [vocabulary_id2token[i] for i in id]
  return result
def id2name(id):
  result = [classes_id2name[i] for i in id]
  return result

# test 
sr = idtowords(what['src'])
tg = id2name(what['tgt'])
print(sr , len(sr))
print(tg,len(tg))
print(get_token_ids(idtowords(what['src'])) == what['src'])
print(get_class_ids(id2name(what['tgt'])) == what['tgt'])

print(get_token_ids(['if']))
print(get_token_ids(['<unk>','if']))
print(get_token_ids(['qwewqeqw','if','unknown_word']))

[0, 107]
[34]
['<unk>', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', '<unk>', '.'] 37
['NN', 'IN', 'DT', 'NN', 'VBZ', 'RB', 'VBN', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NNS', 'IN', 'NNP', ',', 'JJ', 'IN', 'NN', 'NN', ',', 'VB', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'CC', 'NNP', 'POS', 'JJ', 'NNS', '.'] 37
True
True
[107]
[0, 107]
[0, 107, 0]


We will use a **batch size of 32**.

In [114]:
BATCH_SIZE = 32

# trainloader = DataLoader(train_dataset,batch_size=BATCH_SIZE,shuffle=True,drop_last=True)  # is error triggering ,as each item inside batch doenst share same size
# testloader = DataLoader(test_dataset,batch_size=BATCH_SIZE,shuffle=False,drop_last=True)
# dataiter = iter(trainloader)

However, since our examples are of different length, we need to pad shorter examples to the length of the example with the maximum length in our batch. So, let's define a special **padding token** in our vocabulary:

In [136]:
padding_token = '<blank>' # index = 1
from collections import Counter
from os import sendfile
from collections import OrderedDict

"""""
imporvement: things to do

to much indexs , sort it by frequencie and discard words less frequently appear
"""""

vocabulary_id2token : dict = {0: '<unk>'}
vocabulary_token2id : dict = {'<unk>': 0}

# get all words
word_list = [word    for sentence in train_src for word in sentence]

# sort by frequency
count = Counter(word_list)
result = sorted(word_list, key=lambda x: count[x],reverse=True)

# reomve redundants and andd <unk>
word_list_no_redundant = list(OrderedDict.fromkeys(result))[:8000]

# show
for i in range(300):
    print(word_list_no_redundant[i])

print(f'lenth no redundant:{len(word_list_no_redundant)}')
word_list_no_redundant.insert(0,'<unk>')
word_list_no_redundant.insert(1,'<blank>')

# create index list
idx = [ i for i in range(len(word_list_no_redundant))]

# create token 2id and id to token dicts
vocabulary_token2id = dict(zip(word_list_no_redundant,idx))
vocabulary_id2token =  {v: k for k, v in vocabulary_token2id.items()} # swap key value in last dict
vocabulary_token2id

print(f'lenth id list:{len(vocabulary_id2token)}')


# test
print(vocabulary_id2token[0])
print(vocabulary_token2id[vocabulary_id2token[0]])
print(vocabulary_id2token[1])
print(vocabulary_token2id[vocabulary_id2token[1]])
for n in [random.randint(0,500) for _ in range(10)]:
  
  print(vocabulary_id2token[n])
  print(vocabulary_token2id[vocabulary_id2token[n]])
  print(vocabulary_id2token[n])
  print(vocabulary_token2id[vocabulary_id2token[n]])
  print(vocabulary_id2token[vocabulary_token2id[vocabulary_id2token[n]]])
  vocabulary_id2token[n]
  print('~'*10)
  #vocabulary_id2token[n]


,
the
.
of
to
a
and
in
's
for
that
$
``
The
''
is
said
%
on
from
million
at
it
by
as
was
be
with
are
Mr.
n't
its
an
have
has
or
will
he
company
year
were
says
they
would
which
about
--
their
more
In
share
this
up
But
market
but
billion
also
than
had
who
been
his
other
I
:
some
new
one
U.S.
out
Corp.
not
New
years
all
;
-RRB-
could
Inc.
-LRB-
into
stock
It
because
can
last
after
when
only
two
shares
cents
over
do
&
rose
York
business
sales
price
quarter
trading
companies
such
may
if
A
Co.
earnings
most
any
investors
people
first
He
government
time
investment
there
net
many
interest
we
week
president
much
'
prices
now
yesterday
down
you
them
months
stocks
say
1
income
earlier
group
bonds
no
make
what
We
so
state
through
money
San
does
just
did
while
rate
And
like
major
American
three
For
earthquake
federal
still
next
10
month
even
chief
rates
made
officials
bank
tax
expected
products
off
That
get
fell
higher
chairman
sell
California
back
plan
operations
unit
industry
before
those
financi

In [144]:
classes_id2name : dict = {}
classes_name2id : dict = {}

token_list = [token for sentence in train_tgt + test_tgt for token in sentence]

my_list = token_list
count = Counter(my_list)
result = sorted(my_list, key=lambda x: count[x],reverse=True)
token_list_no_redun =  list(set(result))

token_list_no_redun.insert(0,'<blank>')

print(f'lenth token list:{len(token_list_no_redun)}')
idx = [ i for i in range(len(token_list_no_redun))]

classes_name2id = dict(zip(token_list_no_redun,idx))
classes_id2name =  {v: k for k, v in classes_name2id.items()}

# test 
print(classes_name2id['<blank>'])
print(classes_id2name[0])
print(classes_name2id['<blank>'])
print(classes_name2id['NN'])
print(classes_id2name[33])
print(classes_id2name[10])
classes_name2id['WDT']

0
<blank>
0
33
NN
WDT


10

The `collate_fn` is the function that actually receives a batch and needs to add the padding tokens, then returns `src` and `tgt` as `Tensor`s of size `[B, S]` where `B` is our batch size and `S` our maximum sequence length. This function should additionally return a `mask`, a `Tensor` with binary values to indicate whether the specific element is a padding token or not (0 if it's a padding token, 1 if not), such that we can ignore padding tokens in our attention mechanism and loss calculation. 

In [154]:
"""
batch:  list of dictionaries  with keys src and tgt (in index) (as defined in ConllDataset)# 
"""
#[B,S] = [BATCH_SIZE,MAX_SEQ_LENGTH]    



#one Batch shape:   N, seq_len, embed_len

# this function moves tensors to cuda , when train on gpu mode is True
USE_GPU = torch.cuda.is_available()

def create_tensor(tensor):              

  return tensor.int().to(device)

# test
tes =create_tensor(torch.tensor([1,2,3]))
tes.is_cuda 

'''
define callate_fn
'''
def collate_fn(batch):   #list[dict]) -> dict[str, Tensor]:
    """


    batch: *** list of dictionaries **** with keys src and tgt (as defined in ConllDataset)



    """
    S = max_seqlen = 50
    seq_lenths = [min(S,len(batch[i]['src'])) for i in range(len(batch))]
    #S = max(seq_lenths)  
    
    B = len(batch)
    # Padding the input vector with zero to maximal seq_lenth
    src = torch.ones(B,S)  # A place holder for padding
    mask = torch.zeros(B,S) # same For mask 
    tgt = torch.zeros(B,S)  # A place holder for padding
    
    # paste them onto padded holder
    for idx in range(B):        # 2. paste each line of name_sequences onto seq_tensor
      """ Truncanate """
      src[ idx,: seq_lenths[idx] ] = torch.LongTensor(batch[idx]['src'][:seq_lenths[idx]])
      mask[ idx,: seq_lenths[idx] ] = 1
      tgt[ idx,: seq_lenths[idx] ] = torch.LongTensor( batch[idx]['tgt'][:seq_lenths[idx]] )  

    """ cast to Tensor with certain type """
    src = create_tensor(src)
    tgt = create_tensor(tgt)
    mask = create_tensor(mask)
    dic_r = {
        'src': src,
        'tgt': tgt,
        'mask': mask,   # src mask
    }

    return dic_r




# test 

list1 = {
            'src': [0, 14],
            'tgt': [0, 14],
        }

list2 = {
            'src': [1,13,14,15],
            'tgt': [1,13,14,15],
        }

list3 = {
    'src': [i for i in range(60)],
    'tgt': [i for i in range(60)],
}
batch = [list1,list2 ,list3] 

print(f"batch as input:{batch}")
# batch[0]['src']   
# # seq_lenths = [len(batch[i]['src']) for i in len(batch)]

# seq_lenths = [len(batch[i]['src']) for i in range(len(batch))]
# S = max(seq_lenths)
# B = len(batch)
# print(seq_lenths,S,B)
# src = torch.ones(B,S)  # A place holder for padding
# for idx in range(B):        # 2. paste each line of name_sequences onto seq_tensor
#   src[ idx,: seq_lenths[idx] ] = torch.LongTensor(batch[idx]['src'])
# src
# srf = create_tensor(src)
# srf
# # seq_lenths

a = collate_fn(batch)
for i in range (len(batch)):
    print(a['src'][i])
    print([  vocabulary_id2token[wid.item()] for wid in a['src'][i]])
    print(a['tgt'][i])
    print(a['mask'][i])

classes_id2name[torch.tensor(13).item()]

batch as input:[{'src': [0, 14], 'tgt': [0, 14]}, {'src': [1, 13, 14, 15], 'tgt': [1, 13, 14, 15]}, {'src': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59], 'tgt': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]}]
tensor([ 0, 14,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       dtype=torch.int32)
['<unk>', '``', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>', '<blank>',

'RBR'

In [187]:
# train_src_test = [['Chancellor', 'of', 'the', 'Exchequer', 'Nigel', 'Lawson', "'s"],
#                   ['Confidence', 'in', 'the', 'pound'],
#                   ['But', 'analysts', 'reckon', 'underlying', 'support']
#                   ]

# train_tgt_test = [['NNP', 'IN', 'DT', 'NNP', 'NNP', 'NNP', 'POS'],
#                   ['NN', 'IN', 'DT', 'NN'],
#                   ['CC', 'NNS', 'VBP', 'VBG', 'NN']
#                   ]
#

train_src_test = train_src[:3]

train_tgt_test = train_tgt[:3]
train_dataset_test = ConllDataset(train_src_test, train_tgt_test)
train_data_loader_test = DataLoader(train_dataset_test, collate_fn=collate_fn, batch_size= 2, shuffle=False)

dataiter_test = iter(train_data_loader_test)

dic_r_test = next(dataiter_test)

for i in range(2):
  print(dic_r_test['src'][i])
  print(dic_r_test['tgt'][i])
  print(dic_r_test['mask'][i])
  print(len(dic_r_test['src'][i]))
  print(len(dic_r_test['tgt'][i]))
  print(len(dic_r_test['mask'][i]))
  print(idtowords(dic_r_test['src'][i].tolist()))
  print(id2name(dic_r_test['tgt'][i].tolist()))

  dic_r_test['src'].shape

torch.Size([2, 50])

With that, we can use PyTorch's `DataLoader` which will shuffle and batch our data automatically.

In [188]:
"""
create train_data_loader  and test_data_loader...
"""

train_data_loader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=32, shuffle=True)#,drop_last=True
test_data_loader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=32, shuffle=True)#,drop_last=True


# test


# dataiter = iter(train_data_loader)

# dic_r= next(dataiter)

# for i in range(2):
#   print(dic_r['src'][i])
#   print(dic_r['tgt'][i])
#   print(dic_r['mask'][i])
#   print(len(dic_r['src'][i]))
#   print(len(dic_r['tgt'][i]))
#   print(len(dic_r['mask'][i]))
#   print(idtowords(dic_r['src'][i].tolist()))
#   print(id2name(dic_r['tgt'][i].tolist()))

"""  test of test_data_loader """

dataiter = iter(test_data_loader)

dic_r= next(dataiter)

for i in range(2):
  print(dic_r['src'][i])
  print(dic_r['tgt'][i])
  print(dic_r['mask'][i])
  print(len(dic_r['src'][i]))
  print(len(dic_r['tgt'][i]))
  print(len(dic_r['mask'][i]))
  print(idtowords(dic_r['src'][i].tolist()))
  print(id2name(dic_r['tgt'][i].tolist()))


torch.Size([32, 50])

### Architecture

Let's build a transformer model with three layers, three attention heads and an embedding dimension of 128. Also, let's not forget to add a classification head to our model.

In [191]:

class ClassificationHead(nn.Module):
    def __init__(self,  input_dim, trg_vocab_size,max_seqlen=50 ):
        super().__init__()
        
        self.hidden_layer = nn.Linear(input_dim * max_seqlen , 2*input_dim* max_seqlen) 
        self.activation_1 = nn.ReLU()
        self.output_layer = nn.Linear(2*input_dim * max_seqlen, max_seqlen * trg_vocab_size)

    def forward(self, x):
        N,L,S = x.shape
        x = x.reshape(x.shape[0], -1)
        x = self.hidden_layer(x)
        x = self.activation_1(x)
        x = self.output_layer(x)
        return x.reshape(N,L,-1)
        # return x
      

class CoNLL2000Transformer(nn.Module):
    def __init__(self, transformer, classifier):
        super().__init__()
        self.transformer = transformer
        self.classification_layer = classifier

    def forward(self,input):
        out = self.transformer(input)
        out = self.classification_layer(out)
        return out








transformermodel = Transformer(
            embed_n=80001,   # dim of word_embeded vector
            hidden_n=256,  # hidden dimension
            n=3,       # num of layers
            h=2,       # heads
            device="cpu",
            dropout=0.1,
            num_targ_vocab=64).to(device) # output vocabs space

max_seqlen = 9
trg_vocab_size = 50
classificationhead = ClassificationHead(input_dim=256, trg_vocab_size= 45,max_seqlen=50).to(device)

classificationhead(transformermodel(x)).shape

model = CoNLL2000Transformer(transformermodel, classificationhead).to(device)
transformermodel(x).shape

print(x.shape, model.forward(x).shape)




torch.Size([32, 50])


### Training

Initialize the **AdamW** optimizer from the `torch.optim` module and choose the most appropriate loss function for our task.

In [None]:
from torch.optim import AdamW

Learning_rate = 0.005
optimizer = torch.optim.AdamW(model.parameters(),Learning_rate)
criterion = torch.nn.CrossEntropyLoss()

Build a basic training loop and train the network for three epochs.
- Use everything we've built to far, including `train_data_loader`, `model`, `optimizer` and `criterion`.
- At every 50th step print the average loss of the last 50 steps. 
- It is suggested to make a basic training procedure to work on the CPU first. Once it successfully runs on the CPU, you can switch to the GPU (click on change runtime and add an hardware accelerator if you use Colab) and run for the whole three epochs. Note: For this to work, you need to transfer the `model` and the input tensors to the GPU memory. This simply works by calling `.to(device)` on the model and tensors, where `device` and either be `cpu` or `cuda` (for the GPU).

In [None]:
"""
try to overfit one single batch first

"""

print(dic_r['src'].shape)

print(x.shape, model.forward(dic_r['src']).shape)

print(dic_r['tgt'].shape)
print(dic_r['mask'].shape)

of_input = dic_r['src']
of_target = dic_r['src']
of_mask = dic_r['mask']

""" End of this part"""


DEVICE = 'cpu' # later replace with 'cuda' for GPU
EPOCHS = 3
val_step = 50
model = model.to(device)



for epoch in range(EPOCHS):
      model.train()
      cur_loss = []
      cur_acc = []

      for batch_idx, dic_r in enumerate(train_data_loader):
          batch = dic_r['src'][batch_idx].to(device)
          labels = dic_r['tgt'][batch_idx].to(device)
          mask = dic_r['mask'][batch_idx].to(device)

          #Reset the optimizer 
          optimizer.zero_grad() 
          prediction = model(batch)
          labels_hat,_ = max(prediction,axis=-1)
          accuracy = (labels_hat == labels).mean
          cur_acc.append(accuracy)
          loss = criterion(prediction, labels)
          cur_loss.append(loss.item())
          #Backpropagate the error
          loss.backward()
          #Update the weights based on the error
          optimizer.step()

      # Validate the model every n-th episonde
      if batch_idx+1 % val_step == 0:
          
          print(f"Accuracy over last 50 iterations : {sum(cur_loss[:-50-1:-1])/50:.3f}")
          


In [None]:
cur_acc = [2,2,3,4,5]
sum(cur_acc[:-3-1:-1])/3

[5, 4, 3]

### Evaluation

Let's see what's the accuracy is of our model. Since we already implemented accuracy in the previous exercise, we'll now let you use the torchmetrics package.

In [None]:
a = [torch.FloatTensor([1]).view(1, -1), torch.FloatTensor([2]).view(1, -1)]
torch.stack(a)




NameError: ignored

In [None]:
from torchmetrics import Accuracy
help(Accuracy)
accuracy = Accuracy(task="multilabel",num_labels=trg_vocab_size,average='micro')

preds_list = []
target_list = []
dataiter_test = iter(train_data_loader_test)

for i in range(n=4):
  dic_r_test = next(dataiter_test)
  preds_list.append(dic_r_test['src'])
  target_list.append(dic_r_test['tgt'])

preds = torch.stack(preds_list)
target = torch.stack(target_list)

accuracy(preds, target)


Help on class Accuracy in module torchmetrics.classification.accuracy:

class Accuracy(builtins.object)
 |  Accuracy(task: Literal['binary', 'multiclass', 'multilabel'], threshold: float = 0.5, num_classes: Optional[int] = None, num_labels: Optional[int] = None, average: Optional[Literal['micro', 'macro', 'weighted', 'none']] = 'micro', multidim_average: Literal['global', 'samplewise'] = 'global', top_k: Optional[int] = 1, ignore_index: Optional[int] = None, validate_args: bool = True, **kwargs: Any) -> torchmetrics.metric.Metric
 |  
 |  Computes `Accuracy`_
 |  
 |  .. math::
 |      \text{Accuracy} = \frac{1}{N}\sum_i^N 1(y_i = \hat{y}_i)
 |  
 |  Where :math:`y` is a tensor of target values, and :math:`\hat{y}` is a tensor of predictions.
 |  
 |  This module is a simple wrapper to get the task specific versions of this metric, which is done by setting the
 |  ``task`` argument to either ``'binary'``, ``'multiclass'`` or ``multilabel``. See the documentation of
 |  :mod:`BinaryAccu

Calculate the average accuracy of all examples in the test dataset.

In [None]:
...

Let's also look at the accuracy **for each class separately**:

In [None]:
...

## Positional Embeddings

The attention mechanism does not consider the position of the tokens which hurts its performance for many problems. We can solve this issue in several ways. We can either add a positional encoding (via trigonometric functions) or we can learn positional embeddings along the way, in a similar way as BERT does. Here, we will add learnable positional embeddings to our exisisting model with another embedding layer.

The longest sequence in our dataset has 78 tokens (you can trust us on that). So, let's set the number of embeddings for our positional embedding layer to that number. Again, you should use `nn.Embedding`.

Copy the inner parts of your `Transformer` class and add positional embeddings to it.

In [None]:

class TransformerPos(nn.Module):
    """
    emb_n: number of token embeddings
    pos_emb_n: number of position embeddings
    hidden_n: hidden dimension
    n: number of layers
    h: number of heads per layer
        """
    def __init__(     self,
                embed_n,
                pos_emb_n,
                hidden_n,
                n,
                h,
                device,
                dropout,
                num_targ_vocab
                ):
        super().__init__()
        self.pos_emb_n = pos_emb_n
        self.device = device
        self.embed1 = torch.nn.Embedding(embed_n,hidden_n)
        self.embed2 = torch.nn.Embedding(pos_emb_n,hidden_n)   # seq len must be equal or smaller than max len
        self.proj = torch.nn.Linear(hidden_n,num_targ_vocab)

        self.layers = torch.nn.ModuleList( [ TransformerBlock(hidden_n,h) for i in range( n )] )


    def forward(self,input,mask=None):    # input : N L
        N,sequence_length = input.shape
        positional_encoding = torch.arange(0,self.pos_emb_n).expand(N,self.pos_emb_n).to(self.device)
        out = self.embed1(input) + self.embed2(positional_encoding)

        #out = self.embed1(input) # N L -> N L D
        
        for layer in self.layers:
          out = layer(out,out,out,mask)
        # out = self.proj(out).reshape(N,sequence_length,-1)
        return out

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(
    device
)

x1 = ((1/3) *torch.linspace(1,2*78,2*78)).reshape(2,-1).to(torch.int32).to(
    device
)

modelPos = TransformerPos(
            embed_n= 80,   # dim of word_embeded vector
            hidden_n = 256,  # hidden dimension
            pos_emb_n = 78,
            n=3,       # num of layers
            h=2,       # heads
            device="cpu",
            dropout=0.1,
            num_targ_vocab=64) # output vocabs space

out = modelPos(x1)
out = model(x1)
print(out.shape)

cpu
torch.Size([2, 78, 256])


In [None]:
model_pos = CoNLL2000Transformer(TransformerPos(...), ...)

### Training

Same procedure as before. Let's reinitialize our optimizer and our loss function and run the same training loop with our new model `model_pos`.

In [None]:
optimizer = ....
criterion = ...

In [None]:
...

### Evaluation

Now, let's check if our performance on the accuracy got improved.

In [None]:
...

Again, let's also check each class. Which classes got improved the most by adding positional embeddings?

In [None]:
...

The last question in this assignment doesn't require you to code anything. Instead, you're asked to point out possible issues with our current approach and name potential improvements. 
* ...
* ...
* ...

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=faa4af3b-d086-4f42-8b7d-d29c91b1d0f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>