<a href="https://colab.research.google.com/github/mbilgrami/Multimodal-Toolkit/blob/master/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Transformers**

Links:
* https://www.youtube.com/watch?v=ISNdQcPhsts&ab_channel=UmarJamil
* https://www.jeremyjordan.me/transformer-architecture/


In [1]:
!pip install torch


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
import math
import numpy as np
import torch.nn as nn

import os

# **Transformer Architecture**

![](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1-727x1024.png)

![](https://www.google.com.au/url?sa=i&url=https%3A%2F%2Fmachinelearningmastery.com%2Fthe-transformer-model%2F&psig=AOvVaw0XarZVK2XDMc08z38MRZkl&ust=1711332315602000&source=images&cd=vfe&opi=89978449&ved=0CBIQjRxqFwoTCLimy-jni4UDFQAAAAAdAAAAABAE)

In [3]:
# Can get from public github folders: with this command:
"""
![](https://raw.githubusercontent.com/ivonnics/Machine-Learning/master/Tabla%20Confussion%20Matrix.png)
"""

'\n![](https://raw.githubusercontent.com/ivonnics/Machine-Learning/master/Tabla%20Confussion%20Matrix.png)\n'

# **Embeddings**

### Embeddings explained

The embeddings class is a matrix of dimension (vocabulary_size, d_model). It's kind of like a look up index for each word, which is why the number of rows is equivalent to vocabulary_size.

The d_model represents the dimensions that each word is represented with. So d_model = 3 means that each word is expected to have 3 dimensions.

The Embedding tensor is initialised randomly, and then the weights are updated over time during training. It is essentially just the weight component of an nn.Linear.

&nbsp;
#### Example
Consider an embedding matrix of shape (10 , 3). This means there are 10 rows (vocabulary is 10) and 3 dimensions for each word - so we get 10 rows of a 1x3 matrix. During training, lets say we feed in a two word sentence, represented by [1,2,3] and [4,5,6]. This will mean that the input word tensor will be [[1,2,3], [4,5,6]].

The output tensor will be a 2x3x3 tensor. This means for each of the two words, there will be a 3x(1x3) tensor, which will be created.

Remember how the tensor was a look up? The word [1,2,3] will have weights for index 1, index 2 and index 3 of the Embedding matrix. So that means, the second row of 1x3, the third row of 1x3 and the 4th row of 1x3. This makes it a 2x3x3 tensor.

I think dim_model represents the relationship that multiple words can have with one another, so for a dim_model = 3, three words can have relationships with one another.



In [4]:
#Implementing above example
test_embedding = nn.Embedding(10,3)
print(test_embedding.weight)
print(test_embedding.weight.shape)

test_input = torch.tensor([[1,2,3],[1,1,1]])
print(test_embedding(test_input))
print(test_embedding(test_input).shape)


Parameter containing:
tensor([[ 0.4508,  0.6154,  0.2267],
        [-0.5689, -0.3536, -0.5198],
        [ 0.2794, -0.3265,  0.3787],
        [-0.9912, -0.2752,  2.4077],
        [ 1.1389,  0.7706, -1.3733],
        [ 0.9573,  1.8728,  0.2682],
        [-0.9222,  0.2009,  1.5517],
        [-0.6816,  1.1102, -1.9568],
        [-0.7798,  0.0606,  0.6875],
        [ 0.5475, -1.3501, -1.2198]], requires_grad=True)
torch.Size([10, 3])
tensor([[[-0.5689, -0.3536, -0.5198],
         [ 0.2794, -0.3265,  0.3787],
         [-0.9912, -0.2752,  2.4077]],

        [[-0.5689, -0.3536, -0.5198],
         [-0.5689, -0.3536, -0.5198],
         [-0.5689, -0.3536, -0.5198]]], grad_fn=<EmbeddingBackward0>)
torch.Size([2, 3, 3])


In [5]:
class InputEmbeddings(nn.Module):
  """ This module creates an embedding matrix of dimensions (vocab_size, d_model) """

  #IMP - swapping d_model and vocab_size input the other way round compared to video
  def __init__(self,  vocab_size: int, d_model: int):
    #the super method inherits properties from another class
    super().__init__()
    #d_model is the dimension of the model
    self.d_model = d_model
    self.vocab_size = vocab_size
    #this step creates an embedding matrix of dimensions (vocab_size, d_model)
    self.embedding = nn.Embedding(vocab_size, d_model)
    #you can check out the embedding matrix by running ""nn.Embedding(4,3).weight""


  def forward(self, x):
    """ multiplying embedding by square root of the model dimension, per paper"""
    #x is the input tokens
    return self.embedding(x) * math.sqrt(self.d_model)


# **Positional Encoding**

The purpose of positional encoding is to add some perturberation to the input sentences in a way that they keep their order. This is done using the sin and cos functions.

![](https://raw.githubusercontent.com/mbilgrami/images/main/transformer_images/PositionalEncoding.png)

In [6]:
class PositionalEncoding(nn.Module):
  """ This class creates a positional encoding matrix of dimensions (seq_len, d_model) """
  #IMP - this will work as long as the constraint d_model is even is met

  #having to change shape around to make consistent with previous part
  # d_model is the dimension of the model, same as in the embeddings.
  #dropout = Dropout layer is added to prevent the model from overfitting
  #seq_len = maximum length of the sentence
  def __init__(self, seq_len: int, d_model: int, dropout: float) -> None:
    super().__init__()
    self.d_model = d_model
    self.seq_len = seq_len
    self.dropout = nn.Dropout(dropout)

    #building matrix of shape seq_len, d_model.
    #the vector has to be of d_model size (to capture each word), but we need seq_len of them to ensure that we can capture all sentences.
    #created a tensor of zeros of the set up ([0,0,... d_model], [0,0,...d_model]... seq_len times/rows).
    pe = torch.zeros(seq_len, d_model)

    #creating a vector of shape (seq_len, 1)
    #this looks like [[0],[1],[2]...[seq_len]].
    position = torch.arange(0, seq_len, dtype = torch.float).unsqueeze(1)

    #div_term is a vector of shape (d_model, 1)
    #torch.arange(0, d_model, 2) creates a tensor of [0,1,2,...,d_model], and then skips over every alternate element.
    #this makes it [0,2,4,... d_model]
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000) / d_model))

    #note that (position * div_term) creates a tensor of size (seq_len, div_term/2).
    #The shape goes funny with odd d_model and is incompatible with the pe slices we do further down.

    #apply sin to even position
    pe[:, 0::2] = torch.sin(position * div_term) #this replaces every 2nd column of the pe with this sin values starting from 0th column

    #apply cos to odd position
    pe[:, 1::2] = torch.cos(position * div_term) #this replaces every 2nd column of the pe with this cos values starting from 1st column

    #accounting for the batch dimensions, as we will get a batch of sentences as input, not just one.
    pe = pe.unsqueeze(0) #tensor of shape (1, seq_len, d_model)
    #want this tensor to be saved when we save the file of the model
    self.register_buffer("pe", pe)

  def forward(self, x):
    """ adding positional encoding to every word inside a batch of sentences.
    The batch comes in the shape (batch_size, num_words, d_model)
     """

    #remember that pe has shape (1, seq_len, d_model). We want to add pe to the word to help order it.
    #pe is ordered as [1 = batch_size, seq_len, d_model].
    #As seq_len is the max words in a sentence, it will always be greater than x.shape[1] (which is the number of words in sentence)
    #adding x to itself means that the tensor will keep getting appended with values in the sentence batch.
    x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False) #requires_grad_(False) means training wont update weights - this isnt needed
    return self.dropout(x) #dropout means some of the numbers will become 0 - helps with trainning

# **Layer Normalisation**

Let's say we have a batch of 3 sentences, and each sentence is made up of its own features (words, etc).

Layer normalisation means that for each item (sentence), we calculate the mean and the variance of each item (sentence) independently from the other items in the batch.

$$
\hat X_j = \frac{X_j - \mu_j}{\sqrt{\sigma^2 + \epsilon}}
$$

where
* $\hat X_j$ = estimated value of item
* $\mu_j$ = mean of item
* $\sigma^2$ = variance of item
* $\epsilon$ : Needed to ensure $\hat X_j$ doesnt get very big in case $\sigma^2$ gets too small. Undesirable because the CPU/GPU can't capture numbers that are very large

&nbsp;

Each item will get its own $\hat X_j$

We also add two parameters, usually called gamma (multiplicative) and beta (additive) that introduce some fluctuations in the data, because maybe having all values between 0 and 1 may be too restrictive for the network. The network will learn to tune these two parameters and introduce fluctuations when necessary.

In [6]:
class LayerNormalisation(nn.Module):
  """ This class creates a layer normalisation layer. Helps keep network in check"""

  def __init__(self, eps: float = 10**-6) -> None:
    super().__init__()
    self.eps = eps
    self.alpha = nn.Parameter(torch.ones(1)) #creates a tensor [1] - multiplied
    self.bias = nn.Parameter(torch.zeros(1)) #creates a tensor [0] - added

  def forward(self, x):
    mean = x.mean(dim = -1, keepdim = True)
    std = x.std(dim = -1, keepdim = True)
    return self.alpha * (x - mean) / (std + self.eps) + self.bias


# **Feedforward**

Fully connected layer used in the encoder and the decoder. It's covered in the paper in the following section:

&nbsp;
""
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.

> FFN(x) = max(0, *x*$W_1$ + $b_1$)$W_2$ + $b_2$

While the linear transformations are the same across different positions, they use different parameters
from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is $d_{model}$ = 512, and the inner-layer has dimensionality
$d_{ff}$ = 2048
""
&nbsp;
&nbsp;

These are basically two matrices: $W_1$ and $W_2$ that are being multiplied by the x (input??) one after another with a ReLu in between with a bias $b_1$.

In [7]:
class FeedForwardBlock(nn.Module):

  def __init__(self, d_model: int, d_ff: int, dropout:float) -> None:
    super().__init__()
    self.linear_1 = nn.Linear(d_model, d_ff) #creating W1 and b1 (Linear automatically defaults to bias = True)
    self.dropout = nn.Dropout(dropout) #creating dropout layer
    self.linear_2 = nn.Linear(d_ff, d_model) #creating W2 and b2 (Linear automatically defaults to bias = True)


  def forward(self, x):

    """takes input x which is a batch of sentences with Tensor (batch_size, seq_len, d_model)
    - linear1 will convert input batch to (batch_size, seq_len, d_model)
    - linear2 will convert back to (batch_size, seq_len, d_model) ... note that the order of d_ff and d_model is reversed between linear1 and linear2 to help with tensor multiplication
    """

    hidden = self.linear_1(x) #multiplying input by linear 1 to give (batch_size, seq_len, d_ff)
    hidden = torch.relu(hidden) #applying relu function
    hidden = self.dropout(hidden) #applying dropout
    output = self.linear_2(hidden) #multiplying hidden by linear 2 to give (batch_size, seq_len, d_model)

    return output





# **Attention**

Attention takes the input of the encoder and uses it three times: Query, Key and Vector... kind of like a duplication of the input three times.

We take an input of (sequence length, d_model) and then transform it into three matrices: Q, K, V. These are exactly the same as the input for the encoder part (left box in initial architecture). So they have the same dimensions as the input matrix.

We split these Q, K, V matrices into h smaller matrices, and we do this split across the d_model dimension... meaning each head will have access to the full sentence, but a different part of an embedding of each word. 'h' represents a head in the multi-headed attention.

We apply attention to the smaller h matrices using the formula

$$
Attention(Q,K,V) = softmax(\frac{Q{K^T}}{\sqrt{d_k}})V
$$

where:


> $head_i$ = Attention($Q{W_i^Q},K{W_i^K},V{W_i}^V)$

&nbsp;

Then we combine the smaller h matrices back into the larger matrix:

> Multihead(Q,K,V) = $Concat(head_1,...,head_i){W^0}$


where:
* H is a concatenation of h with dimensions (seq_len, h * $d_V$)
* ${W^O}$ has dimensions (seq_len, h * $d_{V^T}$)
* Multihead(Q,K,V) has dimensions (seq_len, dim_model), which is a multiplication of H and ${W^O}$
* $d_k$ denotes $\frac{d_{model}}{h}$ where h is the number of heads $d_{model}$ will be divided into.  
* Note that although $d_V$ implies the denotion of the vector matrix, it's exactly the same dimension as the key matrix  matrix K (i.e. $d_K$) and the query (i.e. $d_Q$).

The representation is for one sentence, but we should also remember that there is an additional dimension for the batch size.


![](https://raw.githubusercontent.com/mbilgrami/images/main/transformer_images/AttentionLayout%20-%20Umar%20Jamil.png)




In [13]:
class MultiHeadAttentionBlock(nn.Module):

  def __init__(self, d_model: int, h:int, dropout:float) -> None:
    super().__init__()
    self.d_model = d_model
    self.h = h

    #we need to divide d_model into equal heads, so d_model needs to be divisible by h
    assert d_model % h == 0, "d_model is not divisible by h"

    #getting value d_l
    self.d_k = d_model // h #note that // means that it will divide and floor to nearest round number

    #creating w matrix to apply a linear transformation, so that Q/K/V {dim (seq_len, d_model)} x W {dim (d_model, d_model)} = Q'/W'/K' {dim (seq_len, d_model)}
    self.w_q = nn.Linear(d_model, d_model) #creating wq
    self.w_k = nn.Linear(d_model, d_model) #creating wk
    self.w_v = nn.Linear(d_model, d_model) #creating wv

    #creating w_o matrix, which is multiplied to the concats of the h matrices.
    #Note that d_v * h = d_model, so we've input the dimensions as (d_model, d_model)
    self.w_o = nn.Linear(d_model, d_model) #creating wo

    self.dropout = nn.Dropout(dropout) #creating dropout layer


  #A note on mask: when we implement softmax(Q.K_transposed / sqrt(d_k)), it creates a (seq_len, seq_len) matrix.
  #Remember that seq_len is the maximum length of the sentence
  #if we don't want certain words in the sentences to not interact with other words, we put a mask on them by putting a really small value.
  #The softmax makes those the values for the attention between those two words 0.

  @staticmethod #creates a function without needing an instance of the class
  def attention(query, key, value, mask, dropout: nn.Dropout):
    d_k = query.shape[-1] #d_k is the last dimension of the query matrix Qi (split into different heads) implemented in the forward function below

    #calculating attention scores. See Attention (Q,K,V) formula in image above.
    #the @ sign represents matrix multiplication in pytorch.
    #the transpose transposes the key matrix's last 2 dimensions to change from (seq_len, d_k) to (d_k, seq_len)
    #multiplication changes (batch, h, seq_len, d_k) --> (batch, h, seq_len, seq_len)
    attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k) #will output (batch, h, seq_len, seq_len)

    #we want to apply mask before softmax to hide some words
    if mask is not None:
      attention_scores.masked_fill(mask == 0, -1e9) #replace all values of cases where mask = 0 with a really high negative number. This will become 0 post softmax
    attention_scores = attention_scores.softmax(dim = -1) #(batch, h, seq_len, seq_len)

    #applying dropout
    if dropout is not None:
      attention_scores = dropout(attention_scores)

    #returning multiplication of above and value matrix (per formula) as a tuple for the model
    #also returning the raw attention scores for visualising what is the score given by the model for the interaction
    return (attention_scores @ value), attention_scores



  def forward(self, q, k, v, mask):
    query =self.w_q(q) #this gives Q' from the image. Goes from (batch, seq_len, d_model) ---> (batch, seq_len, d_model), as (batch, seq_len, d_model) x (batch, seq_len, seq_len) = (batch, seq_len, d_model)
    key = self.w_k(k) #this gives K' from the image. Goes from (batch, seq_len, d_model) ---> (batch, seq_len, d_model)
    value = self.w_v(v) #this gives V' from the image. Goes from (batch, seq_len, d_model) ---> (batch, seq_len, d_model)

    #we want to divide the Query, Key and Value matrices in smaller matrices using h heads
    #the view method of pytorch divides a matrix into different dimensions.
    #we want to split the Q,K,V matrices into the following dimensions: (batch {query.shape[0]}, seq_len {query.shape[1]}, h {self.h}, d_k {self.d_dk}), where d_k = d_model / h
    #we transpose the matrix to get it in the order (batch, h, seq_len, d_k).
    #We transpose because we want each head to watch the seq_len and d_k. This means it will see the full sentence, but a smaller part of the embedding

    query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
    key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
    value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)

    x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)

    #CONCATENATING INDIVIDUAL HEADS TOGETHER
    #Takes in Attention(Q,K,V) of dim(batch, h, seq_len, d_k)
    #(batch, h, seq_len, d_k) --> batch(batch, seq_len, h, d_k) --> (Batch, seq_len, d_model)
    #the transpose applies the first transformation. The contiguous applies the second one.
    x = x.transpose(1,2).contiguous().view(x.shape[0], -1, self.h * self.d_k) #where self.h * self.d_k = d_model

    #(batch, seq_len, d_model) --> (batch, seq_len, d_model)
    return self.w_o(x)


# Residual Connection

This is shown as **Add & Norm** on the diagram. We add the input (after PE) to the output of the multi-head attention block.

In [11]:
class ResidualConnection(nn.Module):
  def __init__(self, dropout: float) -> None:
    super().__init__()
    self.dropout = nn.Dropout(dropout)
    self.norm = LayerNormalisation()    #previously defined

  def forward(self, x, sublayer):
    """ x: input (after PE)
        sublayer: previous layer after multi-head attention"""
    y = self.norm(x) #applying normalisation to the input x. Still has (batch_size, seq_len, d_model)
    y = sublayer(y) #applying the sublayer to the normalised input x. Will also end up with (batch_size, seq_len, d_model)... see MultiHeadAttention

    return x + self.dropout(y) #adding dropout with normalised sublayer, and adding it to input: (batch_size, seq_len, d_model)


# Encoder Block

We put together all of the above parts together in the encoder block. Refer to encoder block in the chart above.

In [16]:
class EncoderBlock(nn.Module):
  def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
    super().__init__()
    self.self_attention_block = self_attention_block
    self.feed_forward_block = feed_forward_block

    #adding a couple of residual connectors in, per encoder architecture.
    #the nn.ModuleList is a way to organise modules in pytorch. We will be using 2 residual connection modules
    # the '_' in 'for _ in range(2)' is used instead of i when we don't really care about the value of i
    self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

  def forward(self, x, src_mask):
    """ x: input (after PE)
        src_mask: mask for the input of the encoder to hide the interaction of the padding words with other words """
    #residual_connection[0] calls the first residual connection, while residual_connection[1] calls the second one in the block
    #remember that the ResidualConnection layer takes in the input x and the sublayer.
    #the sublayer is the multiheadattention block (see function), with the query key and value matrices as copies of the input matrix x, along with a mask.
    x = self.residual_connection[0](x, lambda x: self.attention_block(x, x, x, src_mask))
    x = self.residual_connection[1](x, self.feed_forward_block) #this layer takes in the output of the add&norm, and adds it to the output of the feedforward
    return x

#We can have upto N encoder objects according to the paper. stacking Encoder blocks together captures more depth. Paper recommends 6.

class Encoder(nn.Module):
  def __init__(self, layers: nn.ModuleList) #many layers, applied one after another iteratively. So the output of one layer goes into another
    super().__init__()
    self.layers = layers
    self.norm = LayerNormalisation()

  def forward(self, x, mask):
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)


#Scratchpad

## Positional Encoding Implementation

In [None]:
#creating a batch of 2 sentences - tensor dims are (sentences in batch, words in sentences, dim_model)
x = torch.tensor([[[1,2,3,4],[4,5,6,7]],
 [[3,2,3,4],[4,5,6,7]]])
print(x)
print(x.shape)
print(x.shape[1])

tensor([[[1, 2, 3, 4],
         [4, 5, 6, 7]],

        [[3, 2, 3, 4],
         [4, 5, 6, 7]]])
torch.Size([2, 2, 4])
2


In [None]:
test = PositionalEncoding(4,4,0.1).forward(x)
test

tensor([[[1.1111, 3.3333, 3.3333, 5.5556],
         [5.3794, 6.1559, 6.6778, 8.8888]],

        [[3.3333, 3.3333, 3.3333, 5.5556],
         [5.3794, 6.1559, 6.6778, 8.8888]]])

TypeError: masked_fill() received an invalid combination of arguments - got (bool, float), but expected one of:
 * (Tensor mask, Tensor value)
      didn't match because some of the arguments have invalid types: (!bool!, !float!)
 * (Tensor mask, Number value)
      didn't match because some of the arguments have invalid types: (!bool!, !float!)


In [None]:
seq_len = 4
d_model = 4

pe = torch.zeros(seq_len, d_model)
print(pe)

position = torch.arange(0, seq_len, dtype = torch.float).unsqueeze(1)
print(position)

div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000) / d_model))
print("div_term is:")
print(div_term)

print("sin of position * div_term is:")
print(torch.sin(position * div_term))

pe[:, 0::2] = torch.sin(position * div_term) #0::2 means start from 0 and go forward by 2 (every alternate)

print("updated pe with sin is")
print(pe)

print("cos of position * div_term is:")
print(torch.cos(position * div_term))

print("base pe is:")
print(pe)
print("extracted value of pe to be replaced is:")
print(pe[:,1::2])

print("cos of pe is")
pe[:, 1::2] = torch.cos(position * div_term) #0::2 means start from 0 and go forward by 2 (every alternate)
print(pe[:, 1::2])

print("new pe is:")
print(pe)

print("unsqueezed pe is:")
pe = pe.unsqueeze(0) #tensor of shape (1, seq_len, d_model)
print(pe)

print("x input is")
print(x)

print("pe slice is")
print(pe[:, :x.shape[1], :]) #x.shape[1] is the number of words in the sentence

print("final sentence is")
print(x + pe[:, :x.shape[1], :])

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([[0.],
        [1.],
        [2.],
        [3.]])
div_term is:
tensor([1.0000, 0.0100])
sin of position * div_term is:
tensor([[0.0000, 0.0000],
        [0.8415, 0.0100],
        [0.9093, 0.0200],
        [0.1411, 0.0300]])
updated pe with sin is
tensor([[0.0000, 0.0000, 0.0000, 0.0000],
        [0.8415, 0.0000, 0.0100, 0.0000],
        [0.9093, 0.0000, 0.0200, 0.0000],
        [0.1411, 0.0000, 0.0300, 0.0000]])
cos of position * div_term is:
tensor([[ 1.0000,  1.0000],
        [ 0.5403,  0.9999],
        [-0.4161,  0.9998],
        [-0.9900,  0.9996]])
base pe is:
tensor([[0.0000, 0.0000, 0.0000, 0.0000],
        [0.8415, 0.0000, 0.0100, 0.0000],
        [0.9093, 0.0000, 0.0200, 0.0000],
        [0.1411, 0.0000, 0.0300, 0.0000]])
extracted value of pe to be replaced is:
tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])
cos of pe is
tensor([[ 1.0000,  1

In [9]:
#where does the d_model get split?

4