# Key Insights

[01 Self Attention in Transformer Neural Networks (with Code!)](https://www.youtube.com/watch?v=QCJQG4DuHT0)


🔄 Bi-directional recurrent neural networks suffer from the limitation of looking at left to right and right to left context separately, potentially missing out on important information.

💡 Attention helps determine which parts of the input sentence each word needs to focus on, allowing for more effective incorporation of context and improving the overall feature vector representation.

💡 Transformers can be used for various sequence to sequence tasks, such as translating from English to other languages like Kannada.

🧠 The attention mechanism in the Transformer architecture helps generate higher quality and more context-aware vectors, overcoming the disadvantages of slow training in recurrent neural networks.

🧠 Applying a mask in the attention function allows the model to focus only on relevant words and ignore context after a certain point.

In [23]:
import numpy as np
import math

L, d_k, d_v = 4,8,8  #seq_len,d_k,d_v
q = np.random.randn(L, d_k) 
k = np.random.randn(L, d_k) 
v = np.random.randn(L, d_v) 


In [24]:
print("Q\n",q)
print("K\n",k)
print("V\n",v)

Q
 [[ 0.93005057 -0.79701316 -0.22866933  1.06596114 -1.42342832 -0.8891726
  -0.03123426 -1.18051679]
 [ 0.08437427 -0.70255813  0.05857633  0.92003131  2.00003795  0.82948422
  -0.52968141 -0.85584601]
 [-0.80326185  0.46506673 -0.45196857 -1.57426684  0.27630791  0.12351596
   0.11217196 -1.68406874]
 [-0.89171967 -0.28107851  0.02281117  1.15788562 -0.29592827 -1.64883181
  -0.35067397 -0.41625857]]
K
 [[-3.31827659e-01 -1.44672117e-01 -2.22624694e-01  1.21953293e-01
   1.39899895e+00  4.26960040e-01 -4.79485000e-01 -1.65315233e-01]
 [-1.67427059e-01  1.72024470e-03  8.68835595e-01  1.88825191e+00
   7.11572054e-01  6.07053026e-03 -3.98780916e-01 -1.05044594e+00]
 [-8.37955031e-01  1.54644995e-01  8.73803326e-02  1.63268236e+00
   1.24830978e+00 -2.41390942e-01  7.14414816e-01  8.34320116e-01]
 [-8.37804868e-01 -2.72736424e-01 -2.42600315e-01 -1.40852362e+00
   1.99673570e+00 -2.61595430e-01  9.29500353e-01  2.03795632e-02]]
V
 [[-0.49978462 -0.85906258  1.09441924 -0.22010661  1.0

# Self-Attention

$$
\text { self attention }=\operatorname{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}+M\right) V
$$

In [25]:
np.matmul(q, k.T)

array([[-2.17328813,  1.89129554, -1.75168432, -4.67047989],
       [ 3.72046944,  4.3112623 ,  2.53186353,  2.07760462],
       [ 0.77180294, -1.30835704, -2.8745662 ,  3.46252174],
       [-0.40834074,  2.71152901,  2.0269896 , -1.30669423]])

- 第i行最大值，代表和第i个words相关度最高
- $\sqrt{d_k}$ 为了减少注意力矩阵数值的方差，训练过程更加稳定


In [26]:
q.var(), k.var(), np.matmul(q, k.T).var()

(0.7571056926026465, 0.6735425966378498, 6.697288483667932)

In [27]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k)
q.var(), k.var(), scaled.var() # 低方差

(0.7571056926026465, 0.6735425966378498, 0.8371610604584914)

# Masking

- This is to ensure words don't get context from words generated in the future.
- No required in the encoders, but required in the decoders

In [28]:
mask = np.tril(np.ones((L,L)))
mask  # 只能看到前面的字符

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [29]:
mask[mask == 0] = -np.infty
mask[mask == 1] = 0

mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [30]:
scaled + mask

array([[-0.76837339,        -inf,        -inf,        -inf],
       [ 1.31538459,  1.5242614 ,        -inf,        -inf],
       [ 0.27287355, -0.46257407, -1.01631263,        -inf],
       [-0.14437025,  0.95867027,  0.71664904, -0.46198618]])

# Softmax

$$
\operatorname{softmax}=\frac{e^{x_i}}{\sum_j e_j^x}
$$

In [31]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [32]:
attention = softmax(scaled + mask)
attention  # 行之和为1， 代表概率分布

array([[1.        , 0.        , 0.        , 0.        ],
       [0.44796983, 0.55203017, 0.        , 0.        ],
       [0.56987013, 0.27313355, 0.15699631, 0.        ],
       [0.14071096, 0.42400632, 0.33286171, 0.10242101]])

In [33]:
attention = softmax(scaled) # no mask 
attention  # 行之和为1， 代表概率分布

array([[0.14743653, 0.62045006, 0.17113577, 0.06097764],
       [0.28997163, 0.35733007, 0.19048005, 0.16221825],
       [0.2302083 , 0.11033674, 0.06342121, 0.59603375],
       [0.14071096, 0.42400632, 0.33286171, 0.10242101]])

In [34]:
new_v = np.matmul(attention, v)
new_v

array([[ 0.07751186, -0.75277534,  0.5592909 ,  0.51682059, -1.44796057,
        -0.0511884 , -0.07613615, -1.059968  ],
       [-0.04362156, -0.33433688,  0.643523  , -0.28127416, -0.93331464,
         0.12468582,  0.29482048, -0.26826846],
       [-0.44120537,  0.54520831,  0.59443298, -2.47760462, -0.78479624,
         0.1243503 ,  0.90484772,  0.98299058],
       [ 0.22424808, -0.23491487,  0.54935916,  0.11877295, -1.40746208,
         0.04153057,  0.02867885, -0.55457889]])

In [35]:
v # before attention

array([[-0.49978462, -0.85906258,  1.09441924, -0.22010661,  1.04392234,
         0.653697  ,  0.94631293,  0.44715966],
       [ 0.01309478, -1.43901571,  0.47461646,  1.26289208, -1.9705294 ,
        -0.30194412, -0.4281863 , -2.04666523],
       [ 1.07252581,  1.05962982,  0.44808896,  0.17161306, -1.80158358,
         0.23710998, -0.15248516,  0.18719277],
       [-0.66374852,  1.40016388,  0.43907347, -4.32385169, -1.16341638,
        -0.01318457,  1.24810734,  1.83546834]])

# self-attention calculation (masked selection)
**`k` and `v` scaled_dot_product_attention**

In [36]:
def scaled_dot_product_attention(q, k, v, mask = None):
    d_k = q.shape[-1]
    scaled = np.matmul(q, k.T) / math.sqrt(d_k)
    if mask is not None:
        scaled = scaled + mask
    
    attention = softmax(scaled)
    out = np.matmul(attention, v)
    return out, attention

In [37]:
# no masked self- attention
values, attention = scaled_dot_product_attention(q, k, v, mask = None)
print("Q\n",q)
print("K\n",k)
print("V\n",v)
print("New V\n",values)
print("Attention\n",attention)

Q
 [[ 0.93005057 -0.79701316 -0.22866933  1.06596114 -1.42342832 -0.8891726
  -0.03123426 -1.18051679]
 [ 0.08437427 -0.70255813  0.05857633  0.92003131  2.00003795  0.82948422
  -0.52968141 -0.85584601]
 [-0.80326185  0.46506673 -0.45196857 -1.57426684  0.27630791  0.12351596
   0.11217196 -1.68406874]
 [-0.89171967 -0.28107851  0.02281117  1.15788562 -0.29592827 -1.64883181
  -0.35067397 -0.41625857]]
K
 [[-3.31827659e-01 -1.44672117e-01 -2.22624694e-01  1.21953293e-01
   1.39899895e+00  4.26960040e-01 -4.79485000e-01 -1.65315233e-01]
 [-1.67427059e-01  1.72024470e-03  8.68835595e-01  1.88825191e+00
   7.11572054e-01  6.07053026e-03 -3.98780916e-01 -1.05044594e+00]
 [-8.37955031e-01  1.54644995e-01  8.73803326e-02  1.63268236e+00
   1.24830978e+00 -2.41390942e-01  7.14414816e-01  8.34320116e-01]
 [-8.37804868e-01 -2.72736424e-01 -2.42600315e-01 -1.40852362e+00
   1.99673570e+00 -2.61595430e-01  9.29500353e-01  2.03795632e-02]]
V
 [[-0.49978462 -0.85906258  1.09441924 -0.22010661  1.0

In [38]:
# masked self- attention
# no masked self- attention
values, attention = scaled_dot_product_attention(q, k, v, mask = mask)
print("Q\n",q)
print("K\n",k)
print("V\n",v)
print("New V\n",values)
print("Attention\n",attention)

Q
 [[ 0.93005057 -0.79701316 -0.22866933  1.06596114 -1.42342832 -0.8891726
  -0.03123426 -1.18051679]
 [ 0.08437427 -0.70255813  0.05857633  0.92003131  2.00003795  0.82948422
  -0.52968141 -0.85584601]
 [-0.80326185  0.46506673 -0.45196857 -1.57426684  0.27630791  0.12351596
   0.11217196 -1.68406874]
 [-0.89171967 -0.28107851  0.02281117  1.15788562 -0.29592827 -1.64883181
  -0.35067397 -0.41625857]]
K
 [[-3.31827659e-01 -1.44672117e-01 -2.22624694e-01  1.21953293e-01
   1.39899895e+00  4.26960040e-01 -4.79485000e-01 -1.65315233e-01]
 [-1.67427059e-01  1.72024470e-03  8.68835595e-01  1.88825191e+00
   7.11572054e-01  6.07053026e-03 -3.98780916e-01 -1.05044594e+00]
 [-8.37955031e-01  1.54644995e-01  8.73803326e-02  1.63268236e+00
   1.24830978e+00 -2.41390942e-01  7.14414816e-01  8.34320116e-01]
 [-8.37804868e-01 -2.72736424e-01 -2.42600315e-01 -1.40852362e+00
   1.99673570e+00 -2.61595430e-01  9.29500353e-01  2.03795632e-02]]
V
 [[-0.49978462 -0.85906258  1.09441924 -0.22010661  1.0