<a href="https://colab.research.google.com/github/psygrammer/psypy_lm/blob/main/notebooks/ch01/ch01_start_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Getting Started with the Model Architecture of the Transformer

* 싸이그래머 / 싸이파이 [1]
* 김무성

----------------

# Contents
* The background of the Transformer
* The rise of the Tranformer : Attention is All You Need
  - The encoder stack
    - Input embedding
    - Positional encoding
    - Sub-layer 1 : Multi-head attention
    - Sub-layer 2 : Feedforward network
  - The decoder stack
    - Output embedding and position encoding
    - The attention layers
    - The FFN sub-layer, the Post-LN, and the linear layer
* Training and performance
  - Before we end the chapter
* Summary

----------------

# The background of the Transformer

----------------

# The rise of the Tranformer : Attention is All You Need
* The encoder stack
* The decoder stack


--------------------

## The encoder stack
* Input embedding
* Positional encoding
* Sub-layer 1 : Multi-head attention
* Sub-layer 2 : Feedforward network


In [1]:
!git clone https://github.com/psygrammer/psypy_lm.git

Cloning into 'psypy_lm'...
remote: Enumerating objects: 63, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 63 (delta 21), reused 13 (delta 2), pack-reused 0[K
Unpacking objects: 100% (63/63), done.


In [2]:
!ls

psypy_lm  sample_data


In [3]:
# change work_dir
%cd /content/psypy_lm/notebooks/ch01/

/content/psypy_lm/notebooks/ch01


In [4]:
!ls

ch01_start_transformer.ipynb  text.txt


In [5]:
#!pip install --upgrade gensim
import torch
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings(action = 'ignore') 

### Input embedding


In [7]:
input_str = "the Transformer is an innovative NLP model!"

In [8]:
tokens = word_tokenize(input_str)
tokens

['the', 'Transformer', 'is', 'an', 'innovative', 'NLP', 'model', '!']

In [9]:
# word2vec을 사용하기 위해 학습을 해보자. 
# 아래처럼 코퍼스 파일이 있는 상황.
!ls

ch01_start_transformer.ipynb  text.txt


In [10]:
with open('text.txt', 'r') as sample:
  s = sample.read()

In [11]:
s

'The black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.The black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.\nThe black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.\nThe black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.\nThe black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.\nThe black cat sat on the couch and the brown dog slept on the rug.The cat did not cross the street because it was too wet.The dog sat on the couch near the rug.\nThe black cat sat on the couc

In [12]:
# processing escape characters 
f = s.replace("\n", " ")

In [13]:
data = []

# sentence parsing 
for i in sent_tokenize(f): 
	temp = [] 
	# tokenize the sentence into words 
	for j in word_tokenize(i): 
		temp.append(j.lower()) 
	data.append(temp) 

In [14]:
data[:2]

[['the',
  'black',
  'cat',
  'sat',
  'on',
  'the',
  'couch',
  'and',
  'the',
  'brown',
  'dog',
  'slept',
  'on',
  'the',
  'rug.the',
  'cat',
  'did',
  'not',
  'cross',
  'the',
  'street',
  'because',
  'it',
  'was',
  'too',
  'wet.the',
  'dog',
  'sat',
  'on',
  'the',
  'couch',
  'near',
  'the',
  'rug.the',
  'black',
  'cat',
  'sat',
  'on',
  'the',
  'couch',
  'and',
  'the',
  'brown',
  'dog',
  'slept',
  'on',
  'the',
  'rug.the',
  'cat',
  'did',
  'not',
  'cross',
  'the',
  'street',
  'because',
  'it',
  'was',
  'too',
  'wet.the',
  'dog',
  'sat',
  'on',
  'the',
  'couch',
  'near',
  'the',
  'rug',
  '.'],
 ['the',
  'black',
  'cat',
  'sat',
  'on',
  'the',
  'couch',
  'and',
  'the',
  'brown',
  'dog',
  'slept',
  'on',
  'the',
  'rug.the',
  'cat',
  'did',
  'not',
  'cross',
  'the',
  'street',
  'because',
  'it',
  'was',
  'too',
  'wet.the',
  'dog',
  'sat',
  'on',
  'the',
  'couch',
  'near',
  'the',
  'rug',
  '.']]

In [15]:
# Creating Skip Gram model 
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 512,window = 5, sg = 1) 

In [16]:
# 1-The 2-black 3-cat 4-sat 5-on 6-the 7-couch 8-and 9-the 10-brown 11-dog 12-slept 13-on 14-the 15-rug.
word1='black'
word2='brown'
pos1=2
pos2=10

In [17]:
a=model2[word1]
a

array([ 0.05094784, -0.1194302 , -0.01305547,  0.04961686,  0.0010808 ,
        0.03239628,  0.03873453,  0.03194608,  0.00582984, -0.06031943,
       -0.01661889, -0.05935328,  0.02509349, -0.0413798 , -0.02748247,
        0.04381291,  0.05036264, -0.09866952,  0.02364658,  0.03341468,
       -0.0993137 , -0.05040878, -0.04516822,  0.02754081, -0.07625183,
        0.11011225, -0.00776107,  0.04898146, -0.00616252,  0.01729531,
        0.01331398, -0.04029948,  0.09127349,  0.04529446, -0.03115183,
       -0.02364399, -0.04764993, -0.10479593,  0.03877261, -0.06110352,
       -0.01739862, -0.02384081, -0.00138791, -0.04493561,  0.0623336 ,
        0.00536566,  0.05302285,  0.00665313,  0.1257884 , -0.00639011,
       -0.08840448, -0.01504067, -0.08169718,  0.05542751,  0.02467211,
        0.00269305,  0.04367587,  0.04461711,  0.05469487,  0.03485423,
        0.12378426,  0.13468914, -0.05267326,  0.00861393,  0.00605997,
       -0.01199449,  0.02924055, -0.074156  ,  0.12336732, -0.02

In [18]:
b=model2[word2]
b

array([ 0.0509727 , -0.11916501, -0.0135867 ,  0.05009186,  0.00088694,
        0.03348342,  0.03898058,  0.03166577,  0.0067538 , -0.0604352 ,
       -0.01778465, -0.0591553 ,  0.02561564, -0.04118758, -0.02612484,
        0.04252522,  0.05060094, -0.0998862 ,  0.02287019,  0.03346855,
       -0.09849545, -0.05042516, -0.04646672,  0.02835454, -0.07562537,
        0.1091126 , -0.00833838,  0.04912784, -0.00696082,  0.01576121,
        0.01306915, -0.04080422,  0.09144068,  0.04443427, -0.03103787,
       -0.0219572 , -0.04711007, -0.10483744,  0.03959152, -0.06153628,
       -0.01602719, -0.02380065, -0.00152218, -0.04545695,  0.0633172 ,
        0.0042417 ,  0.05314684,  0.00689458,  0.12504321, -0.00660356,
       -0.08878986, -0.01425187, -0.08189036,  0.05547153,  0.02559732,
        0.00352398,  0.04273253,  0.04436868,  0.05621254,  0.03374501,
        0.12457498,  0.13435008, -0.05270749,  0.00940662,  0.0051468 ,
       -0.01135553,  0.02909545, -0.07415076,  0.12196563, -0.02

In [19]:
# compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
cos

0.99988705

In [20]:
a.shape

(512,)

In [21]:
a

array([ 0.05094784, -0.1194302 , -0.01305547,  0.04961686,  0.0010808 ,
        0.03239628,  0.03873453,  0.03194608,  0.00582984, -0.06031943,
       -0.01661889, -0.05935328,  0.02509349, -0.0413798 , -0.02748247,
        0.04381291,  0.05036264, -0.09866952,  0.02364658,  0.03341468,
       -0.0993137 , -0.05040878, -0.04516822,  0.02754081, -0.07625183,
        0.11011225, -0.00776107,  0.04898146, -0.00616252,  0.01729531,
        0.01331398, -0.04029948,  0.09127349,  0.04529446, -0.03115183,
       -0.02364399, -0.04764993, -0.10479593,  0.03877261, -0.06110352,
       -0.01739862, -0.02384081, -0.00138791, -0.04493561,  0.0623336 ,
        0.00536566,  0.05302285,  0.00665313,  0.1257884 , -0.00639011,
       -0.08840448, -0.01504067, -0.08169718,  0.05542751,  0.02467211,
        0.00269305,  0.04367587,  0.04461711,  0.05469487,  0.03485423,
        0.12378426,  0.13468914, -0.05267326,  0.00861393,  0.00605997,
       -0.01199449,  0.02924055, -0.074156  ,  0.12336732, -0.02

In [22]:
aa = a.reshape(1,512)
aa 

array([[ 0.05094784, -0.1194302 , -0.01305547,  0.04961686,  0.0010808 ,
         0.03239628,  0.03873453,  0.03194608,  0.00582984, -0.06031943,
        -0.01661889, -0.05935328,  0.02509349, -0.0413798 , -0.02748247,
         0.04381291,  0.05036264, -0.09866952,  0.02364658,  0.03341468,
        -0.0993137 , -0.05040878, -0.04516822,  0.02754081, -0.07625183,
         0.11011225, -0.00776107,  0.04898146, -0.00616252,  0.01729531,
         0.01331398, -0.04029948,  0.09127349,  0.04529446, -0.03115183,
        -0.02364399, -0.04764993, -0.10479593,  0.03877261, -0.06110352,
        -0.01739862, -0.02384081, -0.00138791, -0.04493561,  0.0623336 ,
         0.00536566,  0.05302285,  0.00665313,  0.1257884 , -0.00639011,
        -0.08840448, -0.01504067, -0.08169718,  0.05542751,  0.02467211,
         0.00269305,  0.04367587,  0.04461711,  0.05469487,  0.03485423,
         0.12378426,  0.13468914, -0.05267326,  0.00861393,  0.00605997,
        -0.01199449,  0.02924055, -0.074156  ,  0.1

In [23]:
ba = b.reshape(1,512)

In [24]:
cos_lib = cosine_similarity(aa, ba)
cos_lib

array([[0.9998871]], dtype=float32)

### Positional encoding


In [32]:
pe1=aa.copy()
pe2=aa.copy()
pe3=aa.copy()
paa=aa.copy()
pba=ba.copy()
d_model=512
max_print=d_model
max_length=20

In [33]:
pe1[0]

array([ 0.05094784, -0.1194302 , -0.01305547,  0.04961686,  0.0010808 ,
        0.03239628,  0.03873453,  0.03194608,  0.00582984, -0.06031943,
       -0.01661889, -0.05935328,  0.02509349, -0.0413798 , -0.02748247,
        0.04381291,  0.05036264, -0.09866952,  0.02364658,  0.03341468,
       -0.0993137 , -0.05040878, -0.04516822,  0.02754081, -0.07625183,
        0.11011225, -0.00776107,  0.04898146, -0.00616252,  0.01729531,
        0.01331398, -0.04029948,  0.09127349,  0.04529446, -0.03115183,
       -0.02364399, -0.04764993, -0.10479593,  0.03877261, -0.06110352,
       -0.01739862, -0.02384081, -0.00138791, -0.04493561,  0.0623336 ,
        0.00536566,  0.05302285,  0.00665313,  0.1257884 , -0.00639011,
       -0.08840448, -0.01504067, -0.08169718,  0.05542751,  0.02467211,
        0.00269305,  0.04367587,  0.04461711,  0.05469487,  0.03485423,
        0.12378426,  0.13468914, -0.05267326,  0.00861393,  0.00605997,
       -0.01199449,  0.02924055, -0.074156  ,  0.12336732, -0.02

In [34]:
for i in range(0, max_print,2):
  pe1[0][i] = math.sin(pos1 / (10000 ** ((2 * i)/d_model)))
  paa[0][i] = (paa[0][i]*math.sqrt(d_model))+ pe1[0][i]
  pe1[0][i+1] = math.cos(pos1 / (10000 ** ((2 * i)/d_model)))
  paa[0][i+1] = (paa[0][i+1]*math.sqrt(d_model))+pe1[0][i+1]
  print(i,pe1[0][i],i+1,pe1[0][i+1])
  print(i,paa[0][i],i+1,paa[0][i+1])
  print("\n")

0 0.9092974 1 -0.41614684
0 2.0621154 1 -3.1185439


2 0.95814437 3 -0.28628543
2 0.6627328 3 0.836416


4 0.98704624 5 -0.16043596
4 1.0115019 5 0.57260823


6 0.9991642 7 -0.040876657
6 1.8756264 7 0.68198067


8 0.99748 9 0.07094825
8 1.1293943 9 -1.2939247


10 0.984703 11 0.17424123
10 0.60866046 11 -1.1687702


12 0.9632266 13 0.2686903
12 1.5310274 13 -0.66762775


14 0.9351183 15 0.35433567
14 0.313261 15 1.3457086


16 0.9021307 17 0.43146282
16 2.041707 17 -1.8011736


18 0.8657256 19 0.5005189
18 1.4007865 19 1.2566069


20 0.8271038 21 0.5620492
20 -1.4201087 21 -0.57857126


22 0.7872378 23 0.6166495
22 -0.23480235 23 1.2398268


24 0.74690354 25 0.6649324
24 -0.9784785 25 3.1564882


26 0.7067105 27 0.7075028
26 0.53109753 27 1.8158268


28 0.6671291 29 0.7449421
28 0.5276871 29 1.1362903


30 0.62851435 31 0.777798
30 0.92977536 31 -0.13407512


32 0.5911271 33 0.8065784
32 2.6564105 33 1.8314749


34 0.55515176 35 0.8317491
34 -0.14973363 35 0.29674655


36 0.5207113 37

In [35]:
#print(pe1)
# A  method in Pytorch using torch.exp and math.log :
max_len=max_length                
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
#print(pe[:, 0::2])

In [36]:
for i in range(0, max_print,2):
  pe2[0][i] = math.sin(pos2 / (10000 ** ((2 * i)/d_model)))
  pba[0][i] = (pba[0][i]*math.sqrt(d_model))+ pe2[0][i]
            
  pe2[0][i+1] = math.cos(pos2 / (10000 ** ((2 * i)/d_model)))
  pba[0][i+1] = (pba[0][i+1]*math.sqrt(d_model))+ pe2[0][i+1]
               
  #print(i,pe2[0][i],i+1,pe2[0][i+1])
  #print(i,paa[0][i],i+1,paa[0][i+1])
  #print("\n")

In [37]:
print(word1,word2)
cos_lib = cosine_similarity(aa, ba)
print(cos_lib,"word similarity")
cos_lib = cosine_similarity(pe1, pe2)
print(cos_lib,"positional similarity")
cos_lib = cosine_similarity(paa, pba)
print(cos_lib,"positional encoding similarity")

black brown
[[0.9998871]] word similarity
[[0.8600013]] positional similarity
[[0.96447504]] positional encoding similarity


In [38]:
print(word1)
print("embedding")
print(aa)
print("positional encoding")
print(pe1)
print("encoded embedding")
print(paa)

print("========================")

print(word2)
print("embedding")
print(ba)
print("positional encoding")
print(pe2)
print("encoded embedding")
print(pba)

black
embedding
[[ 0.05094784 -0.1194302  -0.01305547  0.04961686  0.0010808   0.03239628
   0.03873453  0.03194608  0.00582984 -0.06031943 -0.01661889 -0.05935328
   0.02509349 -0.0413798  -0.02748247  0.04381291  0.05036264 -0.09866952
   0.02364658  0.03341468 -0.0993137  -0.05040878 -0.04516822  0.02754081
  -0.07625183  0.11011225 -0.00776107  0.04898146 -0.00616252  0.01729531
   0.01331398 -0.04029948  0.09127349  0.04529446 -0.03115183 -0.02364399
  -0.04764993 -0.10479593  0.03877261 -0.06110352 -0.01739862 -0.02384081
  -0.00138791 -0.04493561  0.0623336   0.00536566  0.05302285  0.00665313
   0.1257884  -0.00639011 -0.08840448 -0.01504067 -0.08169718  0.05542751
   0.02467211  0.00269305  0.04367587  0.04461711  0.05469487  0.03485423
   0.12378426  0.13468914 -0.05267326  0.00861393  0.00605997 -0.01199449
   0.02924055 -0.074156    0.12336732 -0.02915745 -0.02962208  0.08576041
   0.06050992 -0.04313795 -0.03156101 -0.04592148  0.04085413  0.01844241
  -0.02559156  0.02241

### Sub-layer 1 : Multi-head attention


##### Step 1: Represent the input

In [None]:
import numpy as np
from scipy.special import softmax

In [None]:
print("Step 1: Input : 3 inputs, d_model=4")
x =np.array([[1.0, 0.0, 1.0, 0.0],   # Input 1
             [0.0, 2.0, 0.0, 2.0],   # Input 2
             [1.0, 1.0, 1.0, 1.0]])  # Input 3
print(x)

Step 1: Input : 3 inputs, d_model=4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


##### Step 2: Initializing the weight matrices

In [None]:
print("Step 2: weights 3 dimensions x d_model=4")
print("w_query")
w_query =np.array([[1, 0, 1],
                   [1, 0, 0],
                   [0, 0, 1],
                   [0, 1, 1]])
print(w_query)

Step 2: weights 3 dimensions x d_model=4
w_query
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]


In [None]:
print("w_key")
w_key =np.array([[0, 0, 1],
                 [1, 1, 0],
                 [0, 1, 0],
                 [1, 1, 0]])
print(w_key)

w_key
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]


In [None]:
print("w_value")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])
print(w_value)

w_value
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


##### Step 3: Matrix multiplication to obtain Q, K, V

In [None]:
print("Step 3: Matrix multiplication to obtain Q,K,V")
print("Query: x * w_query")
Q=np.matmul(x,w_query)
print(Q)

Step 3: Matrix multiplication to obtain Q,K,V
Query: x * w_query
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]


In [None]:
print("Key: x * w_key")
K=np.matmul(x,w_key)
print(K)

Key: x * w_key
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]


In [None]:
print("Value: x * w_value")
V=np.matmul(x,w_value)
print(V)

Value: x * w_value
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


##### Step 4: Scaled attention scores

In [None]:
print("Step 4: Scaled Attention Scores")
k_d = 1 #square root of k_d=3 rounded down to 1 for this example 
attention_scores = (Q @ K.transpose())/k_d 
print(attention_scores)

Step 4: Scaled Attention Scores
[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


##### Step 5: Scaled softmax attention scores for each vector

In [None]:
print("Step 5: Scaled softmax attention_scores for each vector")
attention_scores[0]=softmax(attention_scores[0])
attention_scores[1]=softmax(attention_scores[1])
attention_scores[2]=softmax(attention_scores[2])
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

Step 5: Scaled softmax attention_scores for each vector
[0.06337894 0.46831053 0.46831053]
[6.03366485e-06 9.82007865e-01 1.79861014e-02]
[2.95387223e-04 8.80536902e-01 1.19167711e-01]


##### Step 6: The final attention representations

In [None]:
print("Step 6: attention value obtained by score1/k_d * V")
print(V[0])
print(V[1])
print(V[2])

Step 6: attention value obtained by score1/k_d * V
[1. 2. 3.]
[2. 8. 0.]
[2. 6. 3.]


In [None]:
print("Attention 1")
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)

Attention 1
[0.06337894 0.12675788 0.19013681]


In [None]:
print("Attention 2")
attention2=attention_scores[0][1]*V[1]
print(attention2)

Attention 2
[0.93662106 3.74648425 0.        ]


In [None]:
print("Attention 3")
attention3=attention_scores[0][2]*V[2]
print(attention3)

Attention 3
[0.93662106 2.80986319 1.40493159]


##### Step 7: Summing up the results

In [None]:
print("Step7: summed the results to create the first line of the output matrix")
attention_input1=attention1+attention2+attention3
print(attention_input1)

Step7: summed the results to create the first line of the output matrix
[1.93662106 6.68310531 1.59506841]


##### Step 8: Steps 1 to 7 for all the inputs

In [None]:
print("Step 8: Step 1 to 7 for inputs 1 to 3")
#We assume we have 3 results with learned weights (they were not trained in this example)
#We assume we are implementing the original Transformer paper.We will have 3 results of 64 dimensions each 
attention_head1=np.random.random((3, 64))
print(attention_head1)

Step 8: Step 1 to 7 for inputs 1 to 3
[[0.98550421 0.69239609 0.41630611 0.77804287 0.24200857 0.67972555
  0.53339776 0.89381814 0.68144074 0.32319275 0.05732555 0.53927535
  0.00385359 0.78189817 0.29055389 0.0443541  0.85531549 0.91210496
  0.83753919 0.80081719 0.33810596 0.83194161 0.49140682 0.42388299
  0.72766921 0.77616781 0.47770755 0.28895931 0.36764413 0.64254159
  0.33070623 0.39721833 0.79803024 0.72290943 0.1831662  0.30067515
  0.46350314 0.28110154 0.49200543 0.28043443 0.63799939 0.97766918
  0.79009447 0.80628712 0.21188533 0.04959939 0.7192935  0.72066895
  0.16076112 0.40831617 0.07500193 0.88284874 0.42545103 0.84235532
  0.72652677 0.63126199 0.29851319 0.56029458 0.30546748 0.14275568
  0.4424857  0.67414772 0.8411103  0.73755977]
 [0.06845164 0.7830146  0.30075872 0.92254343 0.84102332 0.53608184
  0.48808664 0.31126187 0.41049226 0.88877125 0.55029036 0.61472428
  0.44580105 0.5125898  0.87953088 0.79155344 0.81651916 0.22353947
  0.34829946 0.40479455 0.33525

##### Step 9: The output of the heads of the attention sub-layer

In [None]:
print("Step 9: We assume we have trained the 8 heads of the attention sub-layer")
z0h1=np.random.random((3, 64))
z1h2=np.random.random((3, 64))
z2h3=np.random.random((3, 64))
z3h4=np.random.random((3, 64))
z4h5=np.random.random((3, 64))
z5h6=np.random.random((3, 64))
z6h7=np.random.random((3, 64))
z7h8=np.random.random((3, 64))
print("shape of one head",z0h1.shape,"dimension of 8 heads",64*8)

Step 9: We assume we have trained the 8 heads of the attention sub-layer
shape of one head (3, 64) dimension of 8 heads 512


##### Step 10: Concatenation of the output of the heads

In [None]:
print("Step 10: Concantenation of heads 1 to 8 to obtain the original 8x64=512 ouput dimension of the model")
output_attention=np.hstack((z0h1,z1h2,z2h3,z3h4,z4h5,z5h6,z6h7,z7h8))
print(output_attention)

Step 10: Concantenation of heads 1 to 8 to obtain the original 8x64=512 ouput dimension of the model
[[0.41873839 0.18751283 0.40838379 ... 0.16434145 0.60697705 0.52614693]
 [0.59202435 0.44543335 0.32357525 ... 0.80058391 0.58525933 0.78519285]
 [0.48441961 0.3221019  0.49370963 ... 0.00242846 0.94489259 0.22310239]]


### Sub-layer 2 : Feedforward network


## The decoder stack
* Output embedding and position encoding
* The attention layers
* The FFN sub-layer, the Post-LN, and the linear layer


### Output embedding and position encoding


### The attention layers


### The FFN sub-layer, the Post-LN, and the linear layer


# Training and performance
* Before we end the chapter


# Summary