# The transformer model for sequence prediction

Deep learning is all about *learning* useful *functions* from big *datasets*. These useful functions are called nevral networks, and are put together from smaller functions with parameters that are decided through optimization. In opposition to conventional programming, where we tell the computer what to do, nevral networks learns from observational data and figure out its own solution to the given problem. Here we will implement the transformer model, one of the main components in big languagemodels like *ChatGPT*.

## **1.0** Structure of the datasets and the transformermodel

**(1)** Let          $a = 15$, $b = 7$, $c = 47$, $d = 152$

then we have   $[1, 5, 7, 4, 7, 1, 5]$, $y =[1, 5, 2]$


**(2)** Let   

$x^{(0)} = [1, 5, 7, 4, 7]$

$x^{(1)} = [1, 5, 7, 4, 7, \hat{z_4}]$

$x^{(2)} = [1, 5, 7, 4, 7, \hat{z_4}, \hat{z_5}]$

$x^{(3)} = [1, 5, 7, 4, 7, \hat{z_4}, \hat{z_5}, \hat{z_6}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(0)}}, \hat{z_1^{(0)}}, \hat{z_2^{(0)}}, \hat{z_3^{(0)}}, \hat{z_4^{(0)}}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(1)}}, \hat{z_1^{(1)}}, \hat{z_2^{(1)}}, \hat{z_3^{(1)}}, \hat{z_4^{(1)}}, \hat{z_5^{(1)}}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(2)}}, \hat{z_1^{(2)}}, \hat{z_2^{(2)}}, \hat{z_3^{(2)}}, \hat{z_4^{(2)}}, \hat{z_5^{(2)}}, \hat{z_6^{(2)}}]$

If the optimization is good, the result should be:

$\hat{z_4^{(0)}} = 1, \hat{z_5^{(1)}} = 5$ og $\hat{z_6^{(2)}} = 2$

**(3)**

For the object function to be $\mathcal{L}(\theta, \mathcal{D}) = 0$, the probability distribution must be given by:

$\hat{Y} = onehot(y) = \begin{bmatrix}
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}$

In this case $\hat{y}$ will be given by:

$\hat{y} := argmax(\hat{Y}) = y$

Then, $\mathcal(L) = 0$ will be fulfilled.

**(4)**

The number of parameters is given by:

$d(2m + n_{max} + L(4k + 2p))$

**(5)**

$X = onehot(x) = \begin{bmatrix}
0 \\
1
\end{bmatrix}, z_0 = W_Ex + [W_P]_{0:n} = \begin{bmatrix}
1 & 0 \\
0 & \alpha
\end{bmatrix} \begin{bmatrix}
0 \\
1
\end{bmatrix} + \begin{bmatrix}
1 \\
0
\end{bmatrix} = \begin{bmatrix}
0 \\
\alpha
\end{bmatrix} + \begin{bmatrix}
1 \\
0
\end{bmatrix} = \begin{bmatrix}
1 \\
\alpha
\end{bmatrix}$

$Z = softmax(\begin{bmatrix}
1
\alpha
\end{bmatrix}) = \begin{bmatrix}
\frac{e^1}{e^1+1^{\alpha}} \\
\frac{e^{\alpha}}{e^1+e^{\alpha}}
\end{bmatrix}$

$\hat{z} = 1 \Rightarrow \alpha > 1$ (when $\alpha=1$, undefined)


## **2.0** Implementing the transformermodel

**(1)** 

1) If the type of layer is identified as `LinearLayer` or `Attention`, `NeuralNetwork` will inherit `step_gd` from the `Layer` class. 

2) If the type of layer is identified as `EmbedPosition`, `NeuralNetwork` will inherit `step_gd` from the `EmbedPosition` class. 

3) If the type of layer is identified as `FeedForward`, `NeuralNetwork` will inherit `step_gd` from the `FeedForward` class.


In [1]:
from neural_network import *
from layers import *
from training import trainModel
import numpy as np
from data_generators import get_train_test_addition, get_train_test_sorting
from training import *
import pickle


In [2]:
r = 5
m = 2
batchSize = 250
batches = 10
d = 10
k = 5
p = 15
L = 2
n_max = 2*r-1
sigma = Relu

data = get_train_test_sorting(r,m,batchSize, batches)


In [3]:
embed = EmbedPosition(n_max,m,d)
un_embed = LinearLayer(d,m)
softmax = Softmax()
loss = CrossEntropy()

att_ffd_list = []
for layer in range(L):
    att = Attention(d,k)
    ff = FeedForward(d,p)
    att_ffd_list.append(att)
    att_ffd_list.append(ff)

layers = [embed] + att_ffd_list + [un_embed] + [softmax]
nn = NeuralNetwork(layers)

In [4]:
losses = trainModel(nn,data,300,loss, m)

Iterasjon  0  L =  0.6046926985748596 
Iterasjon  1  L =  0.49801911377953906 
Iterasjon  2  L =  0.43948746565609637 
Iterasjon  3  L =  0.40947410475133345 
Iterasjon  4  L =  0.38332806004625847 
Iterasjon  5  L =  0.3515574405690137 
Iterasjon  6  L =  0.3193117206256091 
Iterasjon  7  L =  0.31121071436912584 
Iterasjon  8  L =  0.3101651232492624 
Iterasjon  9  L =  0.30950412245753367 
Iterasjon  10  L =  0.3093691368499022 
Iterasjon  11  L =  0.30889027358656695 
Iterasjon  12  L =  0.30845554358385374 
Iterasjon  13  L =  0.3081874923249515 
Iterasjon  14  L =  0.3080559940166335 
Iterasjon  15  L =  0.30797661109480085 
Iterasjon  16  L =  0.30791050773803547 
Iterasjon  17  L =  0.30788247376032657 
Iterasjon  18  L =  0.3078496047361677 
Iterasjon  19  L =  0.30783263600469113 
Iterasjon  20  L =  0.30781044867617663 
Iterasjon  21  L =  0.3078028521151273 
Iterasjon  22  L =  0.30778956012321446 
Iterasjon  23  L =  0.30777026286586906 
Iterasjon  24  L =  0.3077499889576

In [5]:

# DO NOT RUN IF NOT NEW TRAINED MODEL
# with open("sortingTrained_v1", 'wb') as f:
#     pickle.dump(nn, f)

In [6]:
with open("savedObject", 'rb') as f:
     nn2 = pickle.load(f)

type(nn2)

neural_network.NeuralNetwork

In [7]:
with open("sortingTrained_v1", "rb") as f:
    nn = pickle.load(f)

y_pred = predict(nn, data['x_test'], r, m)

0
(250, 5)
(250, 5)
1
(250, 6)
(250, 6)
2
(250, 7)
(250, 7)
3
(250, 8)
(250, 8)
4
(250, 9)
(250, 9)


In [8]:
print(y_pred)
print()
print(data['y_test'])
np.count_nonzero(np.count_nonzero(y_pred == data['y_test'], axis=2) == y_pred.shape[-1])

print

[[[1. 1. 0. 1. 0.]
  [0. 1. 0. 1. 0.]
  [1. 0. 1. 1. 1.]
  ...
  [1. 0. 1. 0. 0.]
  [0. 0. 1. 0. 1.]
  [0. 0. 0. 1. 1.]]]

[[[0. 0. 1. 1. 1.]
  [0. 0. 0. 1. 1.]
  [0. 1. 1. 1. 1.]
  ...
  [0. 0. 0. 1. 1.]
  [0. 0. 0. 1. 1.]
  [0. 0. 0. 1. 1.]]]


<function print>

In [9]:
r = 7
m = 5
batchSize = 250
batches = 10
iterations = 300
d = 20
k = 10
p = 25
L = 2
n_max = 2*r-1
sigma = Relu

In [10]:
data = get_train_test_sorting(r,m,batchSize, batches)

In [11]:
embed = EmbedPosition(n_max,m,d)
att1 = Attention(d,k)
ff1 = FeedForward(d,p)
un_embed = LinearLayer(d,m)
softmax = Softmax()
loss = CrossEntropy()

nn = NeuralNetwork([embed,att1,ff1,un_embed,softmax])

In [12]:
losses = trainModel(nn,data, iterations, loss, m)

Iterasjon  0  L =  1.4944273295782693 
Iterasjon  1  L =  1.258161980523269 
Iterasjon  2  L =  1.2075041194257472 
Iterasjon  3  L =  1.1789771313743267 
Iterasjon  4  L =  1.1609216421805795 
Iterasjon  5  L =  1.1410543004512221 
Iterasjon  6  L =  1.1132539553655407 
Iterasjon  7  L =  1.0747589545173892 
Iterasjon  8  L =  1.0360798532922657 
Iterasjon  9  L =  1.0004070711238895 
Iterasjon  10  L =  0.9673399660441205 
Iterasjon  11  L =  0.9349262570498654 
Iterasjon  12  L =  0.9084424151223031 
Iterasjon  13  L =  0.8875280956722753 
Iterasjon  14  L =  0.8711623277260498 
Iterasjon  15  L =  0.8582133654631935 
Iterasjon  16  L =  0.8498469733492792 
Iterasjon  17  L =  0.843530321822405 
Iterasjon  18  L =  0.8359587790824647 
Iterasjon  19  L =  0.8282925983602558 
Iterasjon  20  L =  0.8230280246956824 
Iterasjon  21  L =  0.8094255094587242 
Iterasjon  22  L =  0.8012635962284435 
Iterasjon  23  L =  0.7884111113685399 
Iterasjon  24  L =  0.7756826969101736 
Iterasjon  2

In [13]:
# DO NOT RUN IF NOT NEW TRAINED MODEL
with open("sortingTrained_v2", 'wb') as f:
    pickle.dump(nn, f)