# The transformer model for sequence prediction

Deep learning is all about *learning* useful *functions* from big *datasets*. These useful functions are called nevral networks, and are put together from smaller functions with parameters that are decided through optimization. In opposition to conventional programming, where we tell the computer what to do, nevral networks learns from observational data and figure out its own solution to the given problem. Here we will implement the transformer model, one of the main components in big languagemodels like *ChatGPT*.

## **1.0** Structure of the datasets and the transformermodel

**(1)** Let          $a = 15$, $b = 7$, $c = 47$, $d = 152$

then we have   $[1, 5, 7, 4, 7, 1, 5]$, $y =[1, 5, 2]$


**(2)** Let   

$x^{(0)} = [1, 5, 7, 4, 7]$

$x^{(1)} = [1, 5, 7, 4, 7, \hat{z_4}]$

$x^{(2)} = [1, 5, 7, 4, 7, \hat{z_4}, \hat{z_5}]$

$x^{(3)} = [1, 5, 7, 4, 7, \hat{z_4}, \hat{z_5}, \hat{z_6}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(0)}}, \hat{z_1^{(0)}}, \hat{z_2^{(0)}}, \hat{z_3^{(0)}}, \hat{z_4^{(0)}}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(1)}}, \hat{z_1^{(1)}}, \hat{z_2^{(1)}}, \hat{z_3^{(1)}}, \hat{z_4^{(1)}}, \hat{z_5^{(1)}}]$

$f_{\theta}(x^{(0)}) = [\hat{z_0^{(2)}}, \hat{z_1^{(2)}}, \hat{z_2^{(2)}}, \hat{z_3^{(2)}}, \hat{z_4^{(2)}}, \hat{z_5^{(2)}}, \hat{z_6^{(2)}}]$

If the optimization is good, the result should be:

$\hat{z_4^{(0)}} = 1, \hat{z_5^{(1)}} = 5$ og $\hat{z_6^{(2)}} = 2$

**(3)**

For the object function to be $\mathcal{L}(\theta, \mathcal{D}) = 0$, the probability distribution must be given by:

$\hat{Y} = onehot(y) = \begin{bmatrix}
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}$

In this case $\hat{y}$ will be given by:

$\hat{y} := argmax(\hat{Y}) = y$

Then, $\mathcal(L) = 0$ will be fulfilled.

**(4)**

The number of parameters is given by:

$d(2m + n_{max} + L(4k + 2p))$

**(5)**

$X = onehot(x) = \begin{bmatrix}
0 \\
1
\end{bmatrix}, z_0 = W_Ex + [W_P]_{0:n} = \begin{bmatrix}
1 & 0 \\
0 & \alpha
\end{bmatrix} \begin{bmatrix}
0 \\
1
\end{bmatrix} + \begin{bmatrix}
1 \\
0
\end{bmatrix} = \begin{bmatrix}
0 \\
\alpha
\end{bmatrix} + \begin{bmatrix}
1 \\
0
\end{bmatrix} = \begin{bmatrix}
1 \\
\alpha
\end{bmatrix}$

$Z = softmax(\begin{bmatrix}
1
\alpha
\end{bmatrix}) = \begin{bmatrix}
\frac{e^1}{e^1+1^{\alpha}} \\
\frac{e^{\alpha}}{e^1+e^{\alpha}}
\end{bmatrix}$

$\hat{z} = 1 \Rightarrow \alpha > 1$ (when $\alpha=1$, undefined)


## **2.0** Implementing the transformermodel

**(1)** 

1) If the type of layer is identified as `LinearLayer` or `Attention`, `NeuralNetwork` will inherit `step_gd` from the `Layer` class. 

2) If the type of layer is identified as `EmbedPosition`, `NeuralNetwork` will inherit `step_gd` from the `EmbedPosition` class. 

3) If the type of layer is identified as `FeedForward`, `NeuralNetwork` will inherit `step_gd` from the `FeedForward` class.


In [None]:
from neural_network import *
from layers import *
from training import trainModel
import numpy as np
from data_generators import get_train_test_addition, get_train_test_sorting
from training import *
import pickle


In [None]:
r = 5
m = 2
batchSize = 250
batches = 10
d = 10
k = 5
p = 15
L = 2
n_max = 2*r-1
sigma = Relu

data = get_train_test_sorting(r,m,batchSize, batches)


In [None]:
embed = EmbedPosition(n_max,m,d)
att1 = Attention(d,k)
ff1 = FeedForward(d,p)
un_embed = LinearLayer(d,m)
softmax = Softmax()
loss = CrossEntropy()

nn = NeuralNetwork([embed,att1,ff1,un_embed,softmax])

In [None]:
losses = trainModel(nn,data,100,loss, m)

In [None]:

# DO NOT RUN IF NOT NEW TRAINED MODEL
# with open("sortingTrained_v1", 'wb') as f:
#     pickle.dump(nn, f)

In [None]:
with open("savedObject", 'rb') as f:
     nn2 = pickle.load(f)

type(nn2)

In [None]:
with open("sortingTrained_v1", "rb") as f:
    nn = pickle.load(f)

y_pred = predict(nn, data['x_test'], r, m)

In [None]:
print(y_pred)
print()
print(data['y_test'])
np.count_nonzero(np.count_nonzero(y_pred == data['y_test'], axis=2) == y_pred.shape[-1])

print

In [None]:
r = 7
m = 5
batchSize = 250
batches = 10
iterations = 300
d = 20
k = 10
p = 25
L = 2
n_max = 2*r-1
sigma = Relu

In [None]:
data = get_train_test_sorting(r,m,batchSize, batches)

In [None]:
embed = EmbedPosition(n_max,m,d)
att1 = Attention(d,k)
ff1 = FeedForward(d,p)
un_embed = LinearLayer(d,m)
softmax = Softmax()
loss = CrossEntropy()

nn = NeuralNetwork([embed,att1,ff1,un_embed,softmax])

In [None]:
losses = trainModel(nn,data, iterations, loss, m)

In [None]:
# DO NOT RUN IF NOT NEW TRAINED MODEL
with open("sortingTrained_v2", 'wb') as f:
    pickle.dump(nn, f)