# Lets build a dataset for GPT

In [81]:
data_path = "../../data/input.txt"

with open(data_path, "r", encoding="utf-8") as file:
    text = file.read()

Lets examine the data

In [82]:
print(f"Length of text: {len(text)}")

Length of text: 1115390


In [83]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
# Examine the unique characters in the data
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocab size: {vocab_size}")
print(f"Unique characters: {''.join(chars)}")


Vocab size: 65
Unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


## Tokenize the data

Convert the text to tokens, i.e. some integer representation of the characters. For now we will use a simple mapping from characters to integers.

In [5]:
str2int = {ch: i for i, ch in enumerate(chars)}
int2str = {i: ch for i, ch in enumerate(chars)}

# Encode the text into integers
encode = lambda s: [str2int[c] for c in s]  # Encode the text into integers
decode = lambda l: "".join([int2str[i] for i in l]) # Decode the integers back to text

print(encode("Hello, world!"))
print(decode(encode("Hello, world!")))

[20, 43, 50, 50, 53, 6, 1, 61, 53, 56, 50, 42, 2]
Hello, world!


Now lets run this through the entire dataset

In [None]:
import torch

: 

In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

## Train / Val split

We need to split the data into training and validation sets. We will use 90% of the data for training and 10% for validation. This means that the model cannot simply memorise the entire dataset.

In [8]:
n = int(0.9 * len(data)) # 90% of the data for training
train_data = data[:n]
val_data = data[n:]

print(len(train_data), len(val_data))

1003854 111540


## Data loaders: Batching and chunking

We need to batch the data into smaller chunks. We will use a simple sliding window approach.

In [9]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [10]:
# The target is the next character in the sequence
x = train_data[:block_size+1]
y = train_data[1:block_size+2]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is {target}")

When input is tensor([18]) the target is 47
When input is tensor([18, 47]) the target is 56
When input is tensor([18, 47, 56]) the target is 57
When input is tensor([18, 47, 56, 57]) the target is 58
When input is tensor([18, 47, 56, 57, 58]) the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


In [11]:
# Set seet for reproducibility
torch.manual_seed(1337)
# How many independent sequences will we process in parallel
batch_size = 4
# The max sequence length
block_size = 8

def get_batch(split):
    data = train_data if split == "train" else val_data
    # Get random indices for the start of each sequence
    random_indices = torch.randint(len(data) - block_size, (batch_size,))
    # Stack the 1D data into 2D (4 x 8) tensors
    x = torch.stack([data[i:i+block_size] for i in random_indices])
    # Stack the 1D data into 2D (4 x 8) tensors
    y = torch.stack([data[i+1:i+block_size+1] for i in random_indices])
    return x, y

xb, yb = get_batch("train")
print(xb.shape)
print(yb.shape)
print("-----")

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context.tolist()} the target is {target}")


torch.Size([4, 8])
torch.Size([4, 8])
-----
When input is [24] the target is 43
When input is [24, 43] the target is 58
When input is [24, 43, 58] the target is 5
When input is [24, 43, 58, 5] the target is 57
When input is [24, 43, 58, 5, 57] the target is 1
When input is [24, 43, 58, 5, 57, 1] the target is 46
When input is [24, 43, 58, 5, 57, 1, 46] the target is 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target is 39
When input is [44] the target is 53
When input is [44, 53] the target is 56
When input is [44, 53, 56] the target is 1
When input is [44, 53, 56, 1] the target is 58
When input is [44, 53, 56, 1, 58] the target is 46
When input is [44, 53, 56, 1, 58, 46] the target is 39
When input is [44, 53, 56, 1, 58, 46, 39] the target is 58
When input is [44, 53, 56, 1, 58, 46, 39, 58] the target is 1
When input is [52] the target is 58
When input is [52, 58] the target is 1
When input is [52, 58, 1] the target is 58
When input is [52, 58, 1, 58] the target is 46
When inp

# Baseline Model: BiGram Model

We will start with a simple BiGram model. This model will predict the next character in the sequence based on the previous character.


In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [21]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size: int):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx: int, targets: torch.Tensor | None = None) -> torch.Tensor:
        # idx and targets are both (B, T) tensor of integers
        logits = self.token_embedding_table(idx)    # (B, T, C)
        
        if targets is None:
            loss = None
        else:
            # Compute the loss, cross entropy loss expects tensor of shape (B*T, C)
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) 
        
        return logits, loss
    
    def generate(self, idx: torch.Tensor, max_new_tokens: int) -> torch.Tensor:
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # Get the predictions
            logits, loss = self(idx)
            # Focus only on the last time step
            logits = logits[:, -1, :] # (B, C)
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx


In [22]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [23]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


## Training the model

In [24]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=3e-4)

In [29]:
batch_size = 32

for step in range(20000):
    # Get a batch of data
    xb, yb = get_batch("train")
    # Compute the loss
    logits, loss = m(xb, yb)
    # Backpropagate the loss
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"Loss: {loss.item()}")

Loss: 2.443176031112671


Lets try the model again post training

In [31]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=300)[0].tolist()))


Tisthakes:
Th.
vethigrtiow?
N h ad h. deveare mat spulkert mal:

tiche manggre t MVAT:
Whoe,

HE:
Tet VORI haf d'd blll e ck.
WAUEGoshearo or t inter sheee!
AMand, f ts thanir INIsh s wis fowon aw; f t,
Wo ncrs, wal I:
Bucoite, muchusEThit inqzeleeY lalit de,
D wousikitefansthegod viverm te 'd IEThe


# Self Attention

## Mathematical trick in self attention

We need to enure information only flows in one direction (from the past to the future). 

In [33]:
# Start with a simple example
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [52]:
# We want x[b, t] = mean_{i<=t} x[b, i]
# This is a bag of words representation, but is very inefficient
x_bow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        x_prev = x[b, :t+1]
        x_bow[b, t] = torch.mean(x_prev, dim=0)


In [46]:
# The trick - this gives us a lower triangular matrix. Looks exactly like the mask for self attention
torch.tril(torch.ones(T, T))
# We can use this with matrix multiplication to get the bag of words representation
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
# We can get the average of the past tokens by normalizing the sum of the row to 1
a = a / torch.sum(a, dim=1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a = ', a)
print('b = ', b)
print('c = ', c)

a =  tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b =  tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c =  tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


We can now use batch matrix multiplication to get the bag of words representation for the entire batch efficiently

In [54]:
# wei (weights)
torch.manual_seed(1337)
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(dim=1, keepdim=True)
x_bow2 = wei @ x     # (B, T, T) @ (B, T, C) -> (B, T, C)

torch.allclose(x_bow, x_bow2, atol=1e-4)

True

The final trick is using the softmax function to get the weights for the self attention mechanism.

Tril is a lower triangular matrix, and we use the masked_fill function to set the values to -inf for the upper triangular part.
Wei is initally a zero matrix, and we use the masked_fill function to set the values to -inf for the upper triangular part.
We then apply the softmax function to get the weights for the self attention mechanism. This normalizes the weights to sum to 1.

The weights are essentially the attention scores for each token in the sequence. By setting the values to -inf for the upper triangular part, we ensure that the attention scores only flow from the past to the future.

In the future we will initialize the weights with values (not just zeros) and learn the attention scores during training.

In [56]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(x_bow, xbow3, atol=1e-4)


True

## Self Attention in the Transformer

Every token in the sequene emits a query, key and value. To compute the attention scores for each token in the sequence, we need to compute the dot product of the query and key for each token in the sequence. This gives us the weights for the self attention mechanism.

In [65]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Initialise the head to perform the self attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Key: Information about what each token has
k = key(x)  # (B, T, head_size)
# Query: Information about what each token is looking for
q = query(x)  # (B, T, head_size)
# Value: Information about the token I will pass on if you are looking for it
v = value(x)  # (B, T, head_size)

# At this stage no communication between the tokens has occurred

# Compute the attention scores
wei = q @ k.transpose(-2, -1)  # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)

# Apply the mask to the attention scores
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# Output is dot product of the attention scores and the value
out = wei @ v  # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)

In [60]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In [63]:
out[0]

tensor([[-0.1571,  0.8801,  0.1615, -0.7824, -0.1429,  0.7468,  0.1007, -0.5239,
         -0.8873,  0.1907,  0.1762, -0.5943, -0.4812, -0.4860,  0.2862,  0.5710],
        [ 0.6764, -0.5477, -0.2478,  0.3143, -0.1280, -0.2952, -0.4296, -0.1089,
         -0.0493,  0.7268,  0.7130, -0.1164,  0.3266,  0.3431, -0.0710,  1.2716],
        [ 0.4823, -0.1069, -0.4055,  0.1770,  0.1581, -0.1697,  0.0162,  0.0215,
         -0.2490, -0.3773,  0.2787,  0.1629, -0.2895, -0.0676, -0.1416,  1.2194],
        [ 0.1971,  0.2856, -0.1303, -0.2655,  0.0668,  0.1954,  0.0281, -0.2451,
         -0.4647,  0.0693,  0.1528, -0.2032, -0.2479, -0.1621,  0.1947,  0.7678],
        [ 0.2510,  0.7346,  0.5939,  0.2516,  0.2606,  0.7582,  0.5595,  0.3539,
         -0.5934, -1.0807, -0.3111, -0.2781, -0.9054,  0.1318, -0.1382,  0.6371],
        [ 0.3428,  0.4960,  0.4725,  0.3028,  0.1844,  0.5814,  0.3824,  0.2952,
         -0.4897, -0.7705, -0.1172, -0.2541, -0.6892,  0.1979, -0.1513,  0.7666],
        [ 0.1866, -0.0

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [68]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size ** -0.5

We can see from this scaled approach that the variance of the attention scores is ~1, which is what we want.

In [70]:
print(k.var())
print(q.var())
print(wei.var())

tensor(1.0966)
tensor(0.9416)
tensor(1.0065)


This is important because it ensures that the attention scores are not too large or too small, and that the softmax function does not saturate. This may happen if the attention scores are highly negative or positive, leading to a very peaked distribution (close to 1-hot vectors).

In [71]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [74]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1)

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

## Multi-Head Attention

We can use multiple attention heads to process the data in parallel. This allows the model to learn different representations of the data.

We can do this by concatenating the outputs of the attention heads.



In [75]:
n_embed = 32

class Head(nn.Module):
    """One head of self-attention"""

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # (B, T, C)
        # Compute attention scores (affinity scores) with scaled dot product
        wei = q @ k.transpose(-2, -1) * (C**-0.5)  # (B, T, C) @ (B, C, T) -> (B, T, T)
        # Apply the mask to the attention scores
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
        # Apply the softmax function to the attention scores
        wei = F.softmax(wei, dim=-1)  # (B, T, T)

        # Perform the weighted aggregation of the values
        v = self.value(x)  # (B, T, C)
        out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out


class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

## Blocks

We can now create a block of the transformer model. This consists of a self attention layer followed by a feedforward layer.

In [76]:
class FeedForward(nn.Module):
    """A simple linear layer followed by a non-linearity"""

    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),  # 4x the size
            nn.ReLU(),  # ReLU activation function
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    """Transformer block: communication followed by computation"""

    def __init__(self, n_embed, n_heads):
        super().__init__()
        head_size = n_embed // n_heads
        self.sa_heads = MultiHeadAttention(num_heads=n_heads, head_size=head_size)
        self.ffwd = FeedForward(n_embed)

    def forward(self, x):
        x = self.sa_heads(x)  # (B,T,C)
        x = self.ffwd(x)  # (B,T,C)
        return x

## Residual connections

We can now add residual connections to the model. This allows the model to learn the residual between the input and the output.

This basically adds a direct connection from one layer to the next without passing through the network/batch norm/activation function.

Initially during training the model may learn to ignore the residual connection, but over time it will learn to use it.

This can be simply implemented by adding the input to the output of the layer.

We also need to add a projection layer to the output of the multi-head attention layer and the feedforward layer. This is to ensure that the output has the same shape as the input.

In [77]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedForward(nn.Module):
    """A simple linear layer followed by a non-linearity"""

    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),  # 4x the size
            nn.ReLU(),  # ReLU activation function
            nn.Linear(n_embed, n_embed),  # 4x the size - projection layer
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """Transformer block: communication followed by computation"""

    def __init__(self, n_embed, n_heads):
        super().__init__()
        head_size = n_embed // n_heads
        self.sa_heads = MultiHeadAttention(num_heads=n_heads, head_size=head_size)
        self.ffwd = FeedForward(n_embed)

    def forward(self, x):
        x = x + self.sa_heads(x)  # (B,T,C)
        x = x + self.ffwd(x)  # (B,T,C)
        return x

## Layer Normalization

We can now add layer normalization to the model. This is to ensure that the output has the same mean and variance as the input.

This is done by subtracting the mean and dividing by the standard deviation of the input.

This is important because it ensures that the output has the same mean and variance as the input, and that the model does not learn to ignore the residual connection.

In [79]:
class LayerNorm1d: # (used to be BatchNorm1d)
  """Normalize the rows of the input"""

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

In [80]:
torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

## Above is the code for decorder only model.

In order to to create an encoder model, we would remove the masking from the self attention layer, so each token can communicate with all other tokens.

To create a encoder decoder model, we would add a cross attention layer between the encoder and decoder. This allows the decoder to attend to the input tokens when generating the output tokens.

A encoder decoder model is a model that can be used for tasks such as translation, where the input is a full sentence and the output is a translation of the sentence. Or it may be used for something like a classification task, where the input is a sentence and the output is a class label.

A translation example is shown below:

```
<--------------encoder--------------><----------------decoder---------------->
les reseaux de neurones sont geniaux <START>the neural networks are great<END>
```

The encoder only sees the input tokens, and produces a representation of the input. Which is then passed to the decoder. The decoder then generates the output tokens, using the output of the encoder as context along with the previous output tokens from the decoder.

## ChatGPT

ChatGPT is a decoder only model. It uses the transformer architecture, but with a decoder only architecture.

The input is a sequence of tokens, and the output is a sequence of tokens. The model is trained to predict the next token in the sequence.

The model is trained on a large dataset of text, and learns to predict the next token in the sequence.

The model is then fine-tuned on a smaller dataset of conversational text.

The main difference between the model we created and ChatGPT is the size of the model and the fine tuning. ChatGPT is a 175B parameter model. 

The model is fine tuned on prompt and response pairs. The model then learns to generate the response given the prompt. The model is then used to generate several responses to the prompt, and a labeler ranks the responses. A reward model then learns to rank the responses, given the labels. The reward model is then used to train the model to generate responses that are ranked higher by the reward model. This is done through a process called Reinforcement Learning from Human Feedback (RLHF).
