#### Self attention

In [1]:
import numpy as np

np.random.seed(42)

Lets implement a simple self-attention for a tiny input.
- Define Queries $(Q)$, Keys $(K)$, and Values $(V)$ by projecting input vectors.
- Compute attention scores = $Q * K^T$
- Scale scores and apply softmax to get attention weights
- Compute weighted sum of V using the attention weights

In [2]:
x= np.random.rand(2,5,8)
d_model = x.shape[2]

wq = np.random.rand(d_model, d_model)
wk = np.random.rand(d_model, d_model)
wv = np.random.rand(d_model, d_model)

Q = x @ wq
K = x @ wk
V = x @ wv

print(f"Q:{Q.shape}, K:{K.shape}, V:{V.shape}\n")
print(Q)

Q:(2, 5, 8), K:(2, 5, 8), V:(2, 5, 8)

[[[2.81348661 2.04906146 0.98956203 2.24095702 2.00129809 1.50077431
   2.16213921 1.67483965]
  [3.079837   2.04442056 1.41953078 2.24569558 1.86390987 1.03422676
   2.45132443 1.64541432]
  [2.09183202 1.72092826 0.98099999 1.62972733 1.48519581 1.05538966
   1.96457732 1.37443833]
  [2.34229407 1.81915244 1.16653628 1.82111387 1.58675011 1.13408538
   2.37428008 1.47010381]
  [2.73161844 2.09559238 1.16096713 2.2518049  1.73014859 1.67088853
   2.53262628 1.73339155]]

 [[2.43541131 1.87891425 1.16746657 1.92474185 1.52069436 0.97693226
   2.07054    1.45346065]
  [3.9245769  3.55037243 2.01019131 2.80272973 2.43640851 1.94970088
   3.663497   2.8054415 ]
  [1.3404496  1.47947343 0.99550988 1.29542652 1.06938915 0.75063337
   1.85184806 1.0470242 ]
  [2.5002975  2.17031702 1.24631062 1.56444293 1.25689034 1.2935675
   2.60090763 1.84712708]
  [2.59881267 1.82663492 1.12032449 2.08153862 1.60233091 1.29118413
   2.21074913 1.51364117]]]


Lets compute the attention scores, scale them and apply softmax

In [3]:
attention_score = np.matmul(Q, K.transpose(0, 2, 1))
print(f"Attention score shape: {attention_score.shape}\n{attention_score}")

Attention score shape: (2, 5, 5)
[[[25.60465863 25.10269067 21.50456811 25.11184948 31.32003528]
  [26.35741741 26.42247042 22.35663787 26.06745047 32.09337878]
  [20.56056319 20.62310971 17.40520468 20.40647158 25.22323214]
  [22.92875359 23.19098238 19.48989398 22.87787737 28.12536617]
  [26.49358999 26.30407363 22.38241025 26.21700367 32.52453393]]

 [[20.22747003 36.84227124 17.95817877 24.6079159  22.88059561]
  [35.12814843 63.68983939 31.15911807 42.91357857 40.09510641]
  [15.06468876 27.22022164 13.20414129 18.3583466  17.22272494]
  [21.9314704  40.03070232 19.63388768 26.87695165 25.3191665 ]
  [21.23287801 38.8330103  18.98995259 25.92628467 24.12113112]]]


In [4]:
np.var(attention_score)

71.04815372272715

When the embedding dimension d_model is large, the dot products of Q and K vectors can grow large in magnitude. Hence, Dividing by $√d_model$ normalizes the dot product to keep it in a range that produces a smoother, more stable softmax distribution.

In [5]:
scaled_with_d = attention_score / d_model
scaled_with_sqrtd = attention_score / np.sqrt(d_model)
print(f"Scaled (d) Attention score shape:\n{scaled_with_d}\n")
print(f"Scaled (sqrt(d)) Attention score shape:\n{scaled_with_sqrtd}")

Scaled (d) Attention score shape:
[[[3.20058233 3.13783633 2.68807101 3.13898118 3.91500441]
  [3.29467718 3.3028088  2.79457973 3.25843131 4.01167235]
  [2.5700704  2.57788871 2.17565058 2.55080895 3.15290402]
  [2.8660942  2.8988728  2.43623675 2.85973467 3.51567077]
  [3.31169875 3.2880092  2.79780128 3.27712546 4.06556674]]

 [[2.52843375 4.60528391 2.24477235 3.07598949 2.86007445]
  [4.39101855 7.96122992 3.89488976 5.36419732 5.0118883 ]
  [1.8830861  3.4025277  1.65051766 2.29479332 2.15284062]
  [2.7414338  5.00383779 2.45423596 3.35961896 3.16489581]
  [2.65410975 4.85412629 2.37374407 3.24078558 3.01514139]]]

Scaled (sqrt(d)) Attention score shape:
[[[ 9.05261387  8.8751414   7.60301297  8.87837953 11.07330467]
  [ 9.31875429  9.34175401  7.90426512  9.2162355  11.34672288]
  [ 7.26925683  7.29137036  6.15366913  7.21477722  8.91775925]
  [ 8.10653858  8.19925045  6.8907181   8.08855111  9.94381857]
  [ 9.36689857  9.29989442  7.91337703  9.26911054 11.49915925]]

 [[ 7.151

**Why use $√d$ and not $d$ ?**
- Dividing by $√d$ normalizes the variance of the dot product to be roughly constant with respect to $d$.
- Dividing by d would shrink the scores too much, making the softmax output too uniform (values too close together), leading to less confident attention distributions and potentially harming learning.
- Dividing by $√d$ achieves a balance: it prevents scores from getting too large (which would make softmax saturate and gradients vanish) without shrinking them too much.

| Scale factor | Effect on scores                                           | Resulting softmax distribution |
| ------------ | ---------------------------------------------------------- | ------------------------------ |
| No scaling   | Large scores → saturate softmax → vanishing gradients      |                                |
| Divide by d  | Scores too small → softmax nearly uniform → weak attention |                                |
| Divide by √d | Scores normalized → stable gradients → effective attention |                                |


In [6]:
np.var(scaled_with_sqrtd)

8.881019215340894

The key is the variance does not grow proportional to $d$.

In [7]:
def softmax(x):
    e_x = np.exp(x-np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

In [8]:
attention_weight = softmax(scaled_with_sqrtd)
print(f"attention: {attention_weight.shape}\n{attention_weight}")

attention: (2, 5, 5)
[[[9.56417750e-02 8.00888970e-02 2.24436741e-02 8.03486552e-02
   7.21476999e-01]
  [9.28719660e-02 9.50327481e-02 2.25725071e-02 8.38226318e-02
   7.05700147e-01]
  [1.17699012e-01 1.20330744e-01 3.85726284e-02 1.11458353e-01
   6.11939263e-01]
  [1.03570913e-01 1.13632368e-01 3.07053237e-02 1.01724590e-01
   6.50366805e-01]
  [8.68833149e-02 8.12525219e-02 2.03086006e-02 7.87893612e-02
   7.32766201e-01]]

 [[2.74381933e-03 9.76105006e-01 1.23002592e-03 1.29109057e-02
   7.01024332e-03]
  [4.11199442e-05 9.99065824e-01 1.01070088e-05 6.44876282e-04
   2.38073179e-04]
  [1.24390117e-02 9.14582341e-01 6.44326097e-03 3.98579362e-02
   2.66774502e-02]
  [1.63459379e-03 9.82832922e-01 7.25478506e-04 9.39226042e-03
   5.41474489e-03]
  [1.94745757e-03 9.81528606e-01 8.81201312e-04 1.02358324e-02
   5.40690258e-03]]]


In [9]:
output = np.matmul(attention_weight, V)
print(f"Output: {output.shape}\n{output}")

Output: (2, 5, 8)
[[[2.31118748 2.43526663 2.60021598 2.29518082 2.23936964 2.13753245
   1.94199825 1.97020266]
  [2.30583572 2.43189857 2.58808358 2.29171588 2.22366791 2.12819057
   1.93458963 1.96467516]
  [2.26086411 2.39727201 2.53287438 2.26875285 2.15696284 2.07288757
   1.90142164 1.93494217]
  [2.28007454 2.41227043 2.55392498 2.27735632 2.18197051 2.09611181
   1.91318812 1.94656405]
  [2.31723583 2.43992623 2.60447032 2.29587276 2.24382477 2.14487682
   1.94302629 1.97285133]]

 [[3.84598117 3.57303707 3.57988203 2.85057197 3.01064801 2.84959987
   3.02204099 3.24750381]
  [3.88028566 3.60169024 3.61961934 2.87354785 3.03492777 2.87576409
   3.05992551 3.28398129]
  [3.75145444 3.49405481 3.47338966 2.78740291 2.94104723 2.77874597
   2.92033759 3.14940679]
  [3.85601952 3.58152674 3.59187022 2.85743392 3.01790511 2.85756595
   3.03327132 3.25832444]
  [3.85416633 3.579883   3.58931042 2.85604928 3.0165523  2.85585402
   3.0309997  3.25614541]]]


Lets create a self attention on mechanism using pytorch and see.

In [10]:
import torch
import torch.nn as nn
from torch.functional import F 

torch.manual_seed(42)

<torch._C.Generator at 0x7485a9012330>

In [11]:
class SelfAttention(nn.Module):
    def __init__(self, d_in, d_out, bias=False):
        super().__init__()
        self.d_in = d_in
        self.wq = nn.Linear(d_in, d_out, bias=bias)
        self.wk = nn.Linear(d_in, d_out, bias=bias)
        self.wv = nn.Linear(d_in, d_out, bias=bias)

    def forward(self, x):
        Q = self.wq(x)
        K = self.wk(x)
        V = self.wv(x)

        attention_score = Q @ K.transpose(2,1) / np.sqrt(self.d_in)
        attention_weight = F.softmax(attention_score, dim=-1)

        output = attention_weight @ V
        return output


In [12]:
X = torch.randn(2,5,8)
d_in, d_out = x.shape[-1], x.shape[-1]
self_attn = SelfAttention(d_in, d_out)

In [13]:
output = self_attn(X)
print(f"Output shape: {output.shape}\n{output}")

Output shape: torch.Size([2, 5, 8])
tensor([[[ 0.0207, -0.0072,  0.0424,  0.6898,  0.1978,  0.4199,  0.0109,
           0.6590],
         [ 0.2304, -0.0192,  0.0728,  0.1547,  0.0503, -0.0697, -0.1201,
           0.1993],
         [ 0.3303,  0.0403, -0.0099, -0.0942,  0.0513, -0.3193, -0.2431,
           0.0046],
         [ 0.3393,  0.0349, -0.0301, -0.1476,  0.0267, -0.3847, -0.2255,
          -0.0411],
         [ 0.1075, -0.2178,  0.2254,  0.2262, -0.1503,  0.0407,  0.2039,
           0.2271]],

        [[ 0.2868,  0.4197,  0.5635,  0.1785, -0.1162, -0.1099, -0.7239,
          -0.5140],
         [ 0.2571,  0.3941,  0.5608,  0.2562, -0.0864, -0.0646, -0.6615,
          -0.4587],
         [ 0.1722,  0.4810,  0.4679, -0.0181, -0.2029, -0.0472, -0.9151,
          -0.6055],
         [ 0.2445,  0.4674,  0.5278,  0.0388, -0.1765, -0.0824, -0.8920,
          -0.5864],
         [ 0.2123,  0.4595,  0.5051,  0.1072, -0.1573, -0.0396, -0.8268,
          -0.5240]]], grad_fn=<UnsafeViewBackward0>)

![image.png](media/attn.png)

#### Hiding the future words with causal attention 
In causal attention, the attention weight above diagonlas is masked. This means the LLM can not utilize the future tokens to create context vector. 

![image.png](media/cmask.png)

![image](media/cmaskprocess.png)

In [14]:
# lets use the previous example. 

queries = self_attn.wq(X)
keys = self_attn.wk(X)

causal_attention_score = queries @ keys.transpose(1,2) 
print(f"causal attention score shape: {causal_attention_score.shape}\n {causal_attention_score}")

causal attention score shape: torch.Size([2, 5, 5])
 tensor([[[ 2.9553,  0.9264,  0.0417,  0.6856, -2.0050],
         [-0.4110,  0.0301,  1.2666, -0.1143, -0.8055],
         [-0.9198,  1.1590,  3.5975,  1.1701, -2.0234],
         [-1.6577,  0.7734,  3.6305,  1.1457, -1.4466],
         [-1.0227, -1.0987, -2.8207, -1.0246,  2.7426]],

        [[-0.2186, -0.7892, -0.1712,  0.4683, -1.5515],
         [-0.3396, -0.2679, -0.3899,  0.5105, -0.3881],
         [ 0.3817,  0.1084,  0.5549, -1.2537, -1.8937],
         [ 0.5263, -0.9256,  0.4854, -0.1689, -2.1769],
         [ 0.7852,  0.1437,  0.1935,  0.1885, -1.2148]]],
       grad_fn=<UnsafeViewBackward0>)


In [15]:
# lets create a mask
mask = torch.tril(torch.ones(causal_attention_score.shape[1], causal_attention_score.shape[2]))
print(f"mask shape: {mask.shape}\n{mask}")

mask shape: torch.Size([5, 5])
tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])


we created the mask of size ze (seq_len, seq_len) and this will be broadcasted to batch dimension.

In [16]:
masked_attention_score = causal_attention_score * mask
print(f"masked attention score shape: {masked_attention_score.shape}\n{masked_attention_score}")

masked attention score shape: torch.Size([2, 5, 5])
tensor([[[ 2.9553,  0.0000,  0.0000,  0.0000, -0.0000],
         [-0.4110,  0.0301,  0.0000, -0.0000, -0.0000],
         [-0.9198,  1.1590,  3.5975,  0.0000, -0.0000],
         [-1.6577,  0.7734,  3.6305,  1.1457, -0.0000],
         [-1.0227, -1.0987, -2.8207, -1.0246,  2.7426]],

        [[-0.2186, -0.0000, -0.0000,  0.0000, -0.0000],
         [-0.3396, -0.2679, -0.0000,  0.0000, -0.0000],
         [ 0.3817,  0.1084,  0.5549, -0.0000, -0.0000],
         [ 0.5263, -0.9256,  0.4854, -0.1689, -0.0000],
         [ 0.7852,  0.1437,  0.1935,  0.1885, -1.2148]]],
       grad_fn=<MulBackward0>)


In [17]:
# lets normalize the masked attention score
row_sum = masked_attention_score.sum(axis=-1, keepdim=True)
norm_attn_score = masked_attention_score / row_sum
print(norm_attn_score)


tensor([[[  1.0000,   0.0000,   0.0000,   0.0000,  -0.0000],
         [  1.0790,  -0.0790,  -0.0000,   0.0000,   0.0000],
         [ -0.2397,   0.3021,   0.9377,   0.0000,  -0.0000],
         [ -0.4259,   0.1987,   0.9328,   0.2944,  -0.0000],
         [  0.3172,   0.3408,   0.8749,   0.3178,  -0.8507]],

        [[  1.0000,   0.0000,   0.0000,  -0.0000,   0.0000],
         [  0.5590,   0.4410,   0.0000,  -0.0000,   0.0000],
         [  0.3653,   0.1037,   0.5310,  -0.0000,  -0.0000],
         [ -6.3572,  11.1813,  -5.8639,   2.0398,   0.0000],
         [  8.1838,   1.4973,   2.0165,   1.9646, -12.6622]]],
       grad_fn=<DivBackward0>)


Masking after softmax would be incincorrect as it would change the distribution. Softmax ensures that the probability sum to 1. Effciently, we can set -ve infinity to for masked positions and then apply softmax which will make them 0. 
![iamge](media/cmsoftmax.png)

In [18]:
mask = torch.triu(torch.ones(causal_attention_score.shape[1], causal_attention_score.shape[1]), diagonal=1)
print(mask)
mask_causal_attention_score  = causal_attention_score.masked_fill(mask.bool(), -torch.inf)
print(mask_causal_attention_score)

tensor([[0., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1.],
        [0., 0., 0., 1., 1.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]])
tensor([[[ 2.9553,    -inf,    -inf,    -inf,    -inf],
         [-0.4110,  0.0301,    -inf,    -inf,    -inf],
         [-0.9198,  1.1590,  3.5975,    -inf,    -inf],
         [-1.6577,  0.7734,  3.6305,  1.1457,    -inf],
         [-1.0227, -1.0987, -2.8207, -1.0246,  2.7426]],

        [[-0.2186,    -inf,    -inf,    -inf,    -inf],
         [-0.3396, -0.2679,    -inf,    -inf,    -inf],
         [ 0.3817,  0.1084,  0.5549,    -inf,    -inf],
         [ 0.5263, -0.9256,  0.4854, -0.1689,    -inf],
         [ 0.7852,  0.1437,  0.1935,  0.1885, -1.2148]]],
       grad_fn=<MaskedFillBackward0>)


In [19]:
causal_attention_weight = F.softmax(mask_causal_attention_score / np.sqrt(self_attn.d_in), dim=-1)
print(f"Shape:{causal_attention_weight.shape}\n{causal_attention_weight}")

Shape:torch.Size([2, 5, 5])
tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4611, 0.5389, 0.0000, 0.0000, 0.0000],
         [0.1246, 0.2599, 0.6155, 0.0000, 0.0000],
         [0.0797, 0.1883, 0.5171, 0.2148, 0.0000],
         [0.1372, 0.1336, 0.0727, 0.1371, 0.5194]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4937, 0.5063, 0.0000, 0.0000, 0.0000],
         [0.3366, 0.3056, 0.3578, 0.0000, 0.0000],
         [0.2971, 0.1778, 0.2928, 0.2323, 0.0000],
         [0.2557, 0.2038, 0.2074, 0.2070, 0.1261]]],
       grad_fn=<SoftmaxBackward0>)


In addition, we can add droput to prevent overfit. It can be applied to anywhere. This can be applied anywhere, but in transformer, we mostly apply after multihead attention. 
![image](media/dropout.png)

In [20]:
# example 
dropout = nn.Dropout(0.5)
print(dropout(causal_attention_weight))

tensor([[[2.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.9222, 1.0778, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 1.2310, 0.0000, 0.0000],
         [0.0000, 0.0000, 1.0343, 0.0000, 0.0000],
         [0.2744, 0.2671, 0.1453, 0.2742, 0.0000]],

        [[0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.9873, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5941, 0.3556, 0.5856, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.4148, 0.0000, 0.0000]]], grad_fn=<MulBackward0>)


Lets create simple attention class with all these above components.

In [21]:
class SAV2(nn.Module):
    def __init__(self, d_model:int, bias=False):
        super().__init__()
        
        self.d_model = d_model

        self.wq = nn.Linear(d_model, d_model, bias=bias)
        self.wk = nn.Linear(d_model, d_model, bias=bias)
        self.wv = nn.Linear(d_model, d_model, bias=bias)

        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        Q = self.wq(x)
        K = self.wk(x)
        V =self.wv(x)

        attention_score = Q @ K.transpose(1,2)
        batch_size, seq_len, _ = attention_score.size()
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
        # this happens automatically via broadcasting, Yet lets be explicit 
        mask = mask.unsqueeze(0).expand(batch_size, -1, -1) # (b, seq, seq)
        masked_attention_score = attention_score.masked_fill(mask.bool(), -torch.inf)

        attention_weight = F.softmax(masked_attention_score / self.d_model**0.5, dim=-1)
        attention_weight = self.dropout(attention_weight)
        output = attention_weight @ V
        return output



In [22]:
input = torch.rand(2,3,4)
sv2 = SAV2(d_model=4)
output = sv2(input)
print(f"Shape:{output.size()}\n{output}")

Shape:torch.Size([2, 3, 4])
tensor([[[ 0.3118, -0.3660, -0.0574, -0.7948],
         [ 0.1792, -0.2334, -0.0644, -0.6985],
         [ 0.1169, -0.1521, -0.0418, -0.4542]],

        [[-0.4388, -0.7512, -0.0193, -1.1770],
         [ 0.0764, -0.1267, -0.0145, -0.2312],
         [ 0.0494, -0.0818, -0.0094, -0.1493]]], grad_fn=<UnsafeViewBackward0>)


This summarizes the whole process above. 
![image](media/sha.png)

Next, we will develop multi-head attention mechanism. The picture are take from [rasbt](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb)