# **Implementing a GPT-Like Language Model from Scratch in PyTorch**

Author: Alejandro Meza Tudela

We are going to try to implement GPT architecture using such a good resource as:

* **[Let's build GPT: from scratch, in code, spelled out]**  from Andrej Karpathy

https://www.youtube.com/watch?v=kCc8FmEb1nY

Specifically, we are going to try to implement the decoder of the transfomer architecture.

## 1. Preparations for the implementation

In [None]:
import torch

In [None]:
!wget https://raw.githubusercontent.com/karpathy/ng-video-lecture/refs/heads/master/input.txt

--2025-02-09 07:44:14--  https://raw.githubusercontent.com/karpathy/ng-video-lecture/refs/heads/master/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-02-09 07:44:15 (33.0 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print(f'length of the text: {len(text)}')

length of the text: 1115394


In [None]:
#The first step is to get the unique characters that are in the dataset
chars = sorted(list(set(text)))
vocab_size = len(chars) #possible elements of our sequence
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## 2.Easy definition of encoder/decoder and train/val split

In [None]:
#define a way to map string to integers
string_to_integer = {ch:i for i,ch in enumerate(chars)}
integer_to_string = {i:ch for i,ch in enumerate(chars)}

#define functions to encode and decode our strings
encode = lambda s: [string_to_integer[c] for c in s]
decode = lambda l: ''.join([integer_to_string[i] for i in l])

print(encode('this is a just an example!'))
print(decode(encode('this is a just an example!')))

[58, 46, 47, 57, 1, 47, 57, 1, 39, 1, 48, 59, 57, 58, 1, 39, 52, 1, 43, 62, 39, 51, 54, 50, 43, 2]
this is a just an example!


Time to try to encode/decode the text that we read previously


In [None]:
#time to try to encode/decode the text that we read previously
data = torch.tensor(encode(text),dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Let's define our 2 train/val sets.

In [None]:
threshold = int(0.9*len(data))
train_data = data[:threshold]
validation_data = data[threshold:]

In [None]:
# Define the number of consecutive data elements to be processed at once
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

## 3.Basic encoder logic and implementation of function to obtain batch

Since in a basic encoder definition, the block X mission is to predict the block X+1, let's see visually what is the meaning of this.

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context} the target: {target}')

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


Now, it's time to define the way to obtain a batch to train our transformer model.

In [None]:
torch.manual_seed(1337) #for reproducibility
batch_size = 4 #groups to be process in parallel
block_size = 8 #context window lenght

def get_batch(split,batch_size):
  data = train_data if split=='train' else validation_data
  #define the index to obtain the information
  ix = torch.randint(len(data) - block_size, (batch_size,))
  #obtain the current context window
  x = torch.stack([data[i:i+block_size] for i in ix])
   #stack the inputs
  y = torch.stack([data[i+1:i+block_size+1] for i in ix]) #stack the targets
  return x,y

xb,yb = get_batch('train',batch_size)
print(f'input shape: {xb.shape} ')
print(xb)
print(f'target shape {yb.shape}')
print(yb)
print('--------')

input shape: torch.Size([4, 8]) 
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
target shape torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
--------


## 4.Define some baseline model to test our idea

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


### 4.1 Bigram model implementation

In order to try the get_batch() function, and start obtaining some meaninful results, it's time to implement some first iteration of a possible model.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
  #define the constructor
  def __init__(self,vocabulary_size,n_embedding_dimension):
    super().__init__()
    #in our case, we are gonna obtain a table of 65x65 --> convert the token into a embedding representation
    self.token_embedding_table = nn.Embedding(vocabulary_size,
                                              n_embedding_dimension)
    #encoding of positions of the sequence (block size->max length of sequence)
    self.position_embedding_table = nn.Embedding(block_size,
                                                 n_embedding_dimension)
    #add the final linear layer
    self.lm_head = nn.Linear(n_embedding_dimension,vocabulary_size)

  def forward(self,idx,targets=None):
    idx = idx.to(device)
    B,T = idx.shape
    #obtain the embeddings representation of the table in the position of interest
    token_embeddings = self.token_embedding_table(idx) #size -> (B,T,C)
    indices = torch.arange(T,device=device)
    position_embeddings = self.position_embedding_table(indices % self.position_embedding_table.weight.size(0)) #size -> (T,C)
    x = token_embeddings + position_embeddings #add the positional information to the embeddings
    logits = self.lm_head(x) #size -> (batch_size,context_window,vocab_size)

    if targets is None:
      loss = None
    else:
      #define a loss function
      B,T,C = logits.shape
      #simplify the dimension of the loss
      logits = logits.view(B*T,C)
      target = targets.view(B*T)
      #define a loss function
      loss = F.cross_entropy(logits,target)
    return logits,loss

  #generate tokens
  def generate(self,idx,max_new_tokens):
    """
    Generates a sequence of tokens autoregressively based on the given input context.

    Args:
        idx (Tensor): The current token sequence (shape: [B, T]), where B is the batch size and T is the sequence length.
        max_new_tokens (int): The number of new tokens to generate.

    Returns:
        Tensor: The generated sequence with new tokens appended (shape: [B, T + max_new_tokens]).
    """
    #idx --> current context (B,T)
    for _ in range(max_new_tokens):
      logits,_ = self(idx)
      # Extract the logits for the last predicted token in the sequence
      logits = logits[:,-1,:]
      probs = F.softmax(logits,dim=-1)
      #return 1 element for each batch -> (B,1)
      # Sample the next token from the probability distribution
      idx_next = torch.multinomial(probs,num_samples=1)
      #append element to the current sequence: (B,T+1)
      idx = torch.cat((idx,idx_next),dim=1)
    return idx

n_embedding_dimension = 32
block_size = 8
baseline_model = BigramLanguageModel(vocab_size,n_embedding_dimension).to(device)
xb,yb = xb.to(device),yb.to(device)
logits,loss = baseline_model(xb,yb)
print(logits.shape)
print(loss)

#try to generate some data
print(decode(baseline_model.generate(idx=torch.zeros((1,1),dtype=torch.long,device=device),
                                     max_new_tokens=100)[0].tolist()))


torch.Size([8192, 65])
tensor(4.4775, device='cuda:0', grad_fn=<NllLossBackward0>)

?YCnx.DkRZkNdc'wf,ZT,OLlT-ebtK
b:xPT&kMBbUAUG:.XSKgO-33mMGd?KL3auhX:YVXhthXNNuyq&BMWG.tbfF dXENDZaAe


In [None]:
#define some hyper-parameters
optimizer = torch.optim.AdamW(baseline_model.parameters(),lr=1e-3)
batch_size  = 32
epochs = 1000

In [None]:
#define some training loop to test our ideas
for epoch in range(epochs):
  #obtain a batch from our data
  xb,yb = get_batch('train',batch_size,block_size)
  xb,yb = xb.to(device),yb.to(device)
  #evaluate the loss
  logits, loss =  baseline_model(xb,yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()
print(loss.item())

2.581425189971924


In [None]:
#after a fast training, let's try again the text generator.
print(decode(baseline_model.generate(idx=torch.zeros((1,1),dtype=torch.long,device=device),
                                     max_new_tokens=100)[0].tolist()))


Wawice my.
Wh'starom orour
Yowhs, tof is h be t il ndilin,

W iree sengcin lat Het drov te, and t l 


## 5.Self-attention

Self-attention is a key mechanism in deep learning models like Transformers, allowing them to focus on different parts of an input sequence when making predictions. It helps the model capture relationships between words/tokens, regardless of their distance in the sequence.

Why It Matters?

- Helps models capture long-range dependencies in text.
- Enables parallel processing (unlike RNNs).
- Forms the foundation of Transformer models (e.g., GPT, BERT).

### Mathematical trick in self-attention

In [None]:
#simple example
torch.manual_seed(1337) #for reproducibility
B,T,C = 4,8,2 #batch size , context window, vocabulary size
x = torch.rand(B,T,C)
x[0],x.shape

(tensor([[0.0783, 0.4956],
         [0.6231, 0.4224],
         [0.2004, 0.0287],
         [0.5851, 0.6967],
         [0.1761, 0.2595],
         [0.7086, 0.5809],
         [0.0574, 0.7669],
         [0.8778, 0.2434]]),
 torch.Size([4, 8, 2]))

In [None]:
'''
  Since we wanna express in some way the influence of the previous tokens respect
  the current token, what about doing an average of the previous tokens?
'''
#APPROACH 1

xbow = torch.zeros((B,T,C)) #C--> number of channels/features
for b in range(B):
  for t in range(T):
    xprev = x[b,:t+1]  # (t,C)
    xbow[b,t] = torch.mean(xprev,0) #average the previous elements

Now, let's try to use some mathematical trick to compute the same operation

In [None]:
torch.manual_seed(17)
a = torch.tril(torch.ones(3,3))
a = a/torch.sum(a,1,keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a@b
print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
---
b=
tensor([[9., 5.],
        [1., 2.],
        [0., 9.]])
---
c=
tensor([[9.0000, 5.0000],
        [5.0000, 3.5000],
        [3.3333, 5.3333]])


We can see clearly, that with that trick, we are computing the average of the elements step by step. Firt the 2 ones, then the 4 ones, and then, all at once. Let's implement this idea in the previous code.

In [None]:
#APPROACH 2
weights = torch.tril(torch.ones(T,T))
weights = weights/weights.sum(1,keepdim=True)
print(weights)
xbow2 = weights@x #(T,T) @ (B,T,C) --> (B,T,C)
torch.allclose(xbow,xbow2)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


True

Now, it's time to add the use of **softmax**, to simulate the self-attention mechanism that we know.
Softmax formula:


![softmax](https://miro.medium.com/v2/resize:fit:600/0*fbg5QEc2Lv8IIKcq.png)

In [None]:
#APPROACH 3
tril = torch.tril(torch.ones(T,T))
print(tril)
weights = torch.zeros((T,T))
weights = weights.masked_fill(tril==0,float('-inf'))
print(weights)
weights = F.softmax(weights,dim=-1)
print(weights)
xbow3 = weights@x
torch.allclose(xbow,xbow3)

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000,

True

In [None]:
#APPROACH 4
torch.manual_seed(1337)
B,T,C = 4,8,32 #4 batches, 8 tokens per context window, 32 elements to represent the embedding
x = torch.rand(B,T,C)

tril = torch.tril((torch.ones(T,T)))
weights = torch.zeros((T,T))
weights = weights.masked_fill(tril==0,
                              float('-inf'))
weights = F.softmax(weights,dim=-1)
out = weights@x
out.shape

torch.Size([4, 8, 32])

In the self-attention mechanism, for each word vector in the input sequence, we create three distinct vectors:

- Query vector (Q): This represents the word we are currently focusing on. It is used to compare against all the other word vectors in the sequence.

- Key vector (K): This represents every word in the sequence from the perspective of the word we are focusing on. It is used to match with the query to determine how much attention a word should pay to other words.

- Value vector (V): This contains the actual information that will be used in the final output. The value vector is weighted based on the similarity between the query and the key vectors.


**Overall Process:**

- For each word in the sequence, compute its query (Q), key (K), and value (V) using learned linear transformations.
- Compute the attention scores by taking the dot product of the query vector with the key vectors of all words, followed by a softmax function to normalize these scores.
- Use the attention scores to compute a weighted sum of the value vectors.

The resulting weighted sum represents how much information from the other words (and itself) contributes to the representation of the current word.

This mechanism allows the model to dynamically focus on the most relevant parts of the sequence for each word, capturing dependencies and context efficiently.


**The Role of the Head in Multi-Head Attention**

Instead of computing self-attention once, transformers use multiple attention heads in parallel to capture different types of relationships between words. Each head learns a different attention pattern, allowing the model to focus on various parts of the sequence simultaneously.

Each head independently computes its own Q, K, and V matrices using different learned weight matrices.

These separate attention mechanisms allow the model to capture different semantic relationships (e.g., long-range dependencies, syntactic structure, coreference resolution).

After computing the attention outputs from multiple heads, they are concatenated and passed through a linear transformation to merge the information.

So, let's try to use this concepts in order to define the self-attention mechanism in a proper way.

In [None]:
#APPROACH 4
torch.manual_seed(1337)
B,T,C = 4,8,32 #4 batches, 8 tokens per context window, 32 elements to represent the embedding
x = torch.rand(B,T,C)

#Implementation of a single Head that performs self-attention
head_size = 16

# Create a linear transformation for the key vectors
# The input dimension is C (size of word embeddings), and the output dimension is head_size (size of key vector for each attention head)
# No bias term is added to this transformation
key = nn.Linear(C, head_size, bias=False)

# Create a linear transformation for the query vectors
# The input dimension is C (size of word embeddings), and the output dimension is head_size (size of query vector for each attention head)
# No bias term is added to this transformation
query = nn.Linear(C, head_size, bias=False)

#definition of value vector
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B,T,16)
q = query(x) # (B,T,16)
v = value(x)

#we want to obtain the relationship between the key vectors and query vectors for every word in the context window
scores = q @ k.transpose(-2,-1)*(head_size**-0.5) # (B,T,16) @ (B,16,T) --> (B,T,T)

tril = torch.tril((torch.ones(T,T)))
scores = scores.masked_fill(tril==0,
                              float('-inf'))
scores = F.softmax(scores,dim=-1)
out = scores@v
out.shape

torch.Size([4, 8, 16])

In [None]:
weights[0]

tensor([1., 0., 0., 0., 0., 0., 0., 0.])

Let's see at the attention score output and analyse the result:
- Each row corresponds to a token in the sequence.
- Each column represents how much attention is paid to other tokens (or itself) by that specific token.
- The values in the tensor are normalized (typically using softmax), indicating the strength of attention, ranging between 0 and 1.

Additional notes:

-  Attention mechanism is a communication mechanism.
- There is no notion of what space is. Attention simply acts over a set of vectors.
- Each sample across batch dimension is processed independently and never 'communicate' with each other sample.
- In the case that we are doing an **encoder** attention block, the part of the **tril masking* is not neccesary because we want to allow the tokens to communicate each other. In the case of a **decoder** block , it's necessary since we want to make the model to generate new tokens.
- In **self-attention**, Q,K,V are generated by the same data. In the case of **cross-attention**, keys and values come from an external source.
- Scaled attention further adjusts the attention weights by dividing them by 1/sqrt(head_size). This scaling ensures that when the input query (Q) and key (K) vectors have unit variance, the resulting attention weights also maintain unit variance. Consequently, the softmax function produces more balanced and diffuse probabilities, preventing it from becoming overly concentrated or saturating excessively.



## 6. Create the self-attention head in the previously defined model and finish the decoder implementation

![Image](https://i.sstatic.net/nV7Ee.jpg)

In [None]:
'''
  Demo of the data that we are going to get to try the model
'''

torch.manual_seed(1337) #for reproducibility
batch_size = 4 #groups to be process in parallel
block_size = 16 #context window lenght

def get_batch(split,batch_size,block_size):
  data = train_data if split=='train' else validation_data
  #define the index to obtain the information
  ix = torch.randint(len(data) - block_size, (batch_size,))
  #obtain the current context window
  x = torch.stack([data[i:i+block_size] for i in ix])
   #stack the inputs
  y = torch.stack([data[i+1:i+block_size+1] for i in ix]) #stack the targets
  return x,y

xb,yb = get_batch('train',batch_size,block_size)
print(f'input shape: {xb.shape} ')
print(xb)
print(f'target shape {yb.shape}')
print(yb)
print('--------')

input shape: torch.Size([4, 16]) 
tensor([[21, 27, 24, 13, 26, 33, 31, 10,  0, 32, 59, 57, 46,  6,  1, 58],
        [53, 59, 57,  1, 51, 43, 52,  0, 13, 56, 43,  1, 39, 58,  1, 58],
        [50,  6,  1, 57, 47, 56,  8,  1, 18, 39, 56, 43,  1, 63, 53, 59],
        [58, 46,  1, 57, 53,  6,  1, 46, 53, 50, 63,  1, 57, 47, 56, 11]])
target shape torch.Size([4, 16])
tensor([[27, 24, 13, 26, 33, 31, 10,  0, 32, 59, 57, 46,  6,  1, 58, 59],
        [59, 57,  1, 51, 43, 52,  0, 13, 56, 43,  1, 39, 58,  1, 58, 46],
        [ 6,  1, 57, 47, 56,  8,  1, 18, 39, 56, 43,  1, 63, 53, 59,  1],
        [46,  1, 57, 53,  6,  1, 46, 53, 50, 63,  1, 57, 47, 56, 11,  1]])
--------


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class Head(nn.Module):  # Represents a single attention head in a multi-head attention mechanism.
    def __init__(self, head_size):
        super().__init__()
        # Linear transformations for Key (K), Query (Q), and Value (V) vectors.
        self.key = nn.Linear(n_embedding_dimension, head_size, bias=False)
        self.query = nn.Linear(n_embedding_dimension, head_size, bias=False)
        self.value = nn.Linear(n_embedding_dimension, head_size, bias=False)
        # Register a lower triangular matrix to apply masking for causal attention.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # B: batch size, T: sequence length, C: embedding dimension
        B, T, C = x.shape

        # Compute the key (K), query (Q), and value (V) vectors.
        k = self.key(x)  # Shape: (B, T, head_size)
        q = self.query(x)  # Shape: (B, T, head_size)
        v = self.value(x)  # Shape: (B, T, head_size)

        # Calculate the attention scores by taking the dot product of Q and K (scaled by sqrt(C)).
        attention_scores = q @ k.transpose(-2, -1) * C**-0.5  # Shape: (B, T, T)

        # Apply causal masking to prevent attention to future tokens.
        attention_scores = attention_scores.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # Normalize the scores using softmax to get attention weights.
        attention_scores = F.softmax(attention_scores, dim=-1)  # Shape: (B, T, T)
        attention_scores = self.dropout(attention_scores)

        # Compute the output by applying attention weights to the value (V) vectors.
        out = attention_scores @ v  # Shape: (B, T, head_size)

        return out

class MultiHeadAttention(nn.Module):
  '''
  Class to define the multihead implementation
  '''
  def __init__(self,num_heads,head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.proj = nn.Linear(n_embedding_dimension,n_embedding_dimension)
    self.dropout = nn.Dropout(0.5)

  def forward(self,x):
    return self.proj(torch.cat([h(x) for h in self.heads],dim=-1))

class FeedForward(nn.Module):
  '''
  Class to define the feedforward implementation
  '''
  def __init__(self,n_embedding_dimension):
    '''
      Paper 'Attention is all you need' reference.
      The dimensionality of input and output is d_model=512,
      and the inner-layer has dimensionality dff=2048. So we have 4 as a factor
    '''
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(n_embedding_dimension,n_embedding_dimension*4),
        nn.ReLU(),
        nn.Linear(4*n_embedding_dimension,n_embedding_dimension), #reproject in the same embedding space
        nn.Dropout(0.5)
    )
  def forward(self,x):
    return self.net(x)

# Layer Normalization (LayerNorm) is used in Transformers to stabilize training,
# normalize activations, and improve gradient flow. Unlike BatchNorm, it normalizes
# across features (not batch) and helps prevent internal covariate shift. It is
# usually applied before or after residual connections to improve convergence.

#time to replicate the mechanism N times in a block form
class Block(nn.Module):
  '''
  Class to define the block implementation --> communication followed by computation
  '''
  def __init__(self,n_embedding_dimension,number_of_heads):
    super().__init__()
    head_size = n_embedding_dimension//number_of_heads
    self.sa = MultiHeadAttention(number_of_heads,head_size)
    self.ffwd = FeedForward(n_embedding_dimension)
    self.ln1 = nn.LayerNorm(n_embedding_dimension)
    self.ln2 = nn.LayerNorm(n_embedding_dimension)

  def forward(self,x):
    x = x + self.sa(self.ln1(x)) #add residual connections
    x = x + self.ffwd(self.ln2(x)) #add residual connections
    return x

class BigramLanguageModel(nn.Module):
  #define the constructor
  def __init__(self,vocabulary_size,n_embedding_dimension,number_of_heads):
    super().__init__()
    #in our case, we are gonna obtain a table of 65x65 --> convert the token into a embedding representation
    self.token_embedding_table = nn.Embedding(vocabulary_size,
                                              n_embedding_dimension)
    #encoding of positions of the sequence (block size->max length of sequence)
    self.position_embedding_table = nn.Embedding(block_size,
                                                 n_embedding_dimension)
    self.blocks = nn.Sequential(*[Block(n_embedding_dimension,number_of_heads) for _ in range(number_layers)])
    self.ln_f = nn.LayerNorm(n_embedding_dimension)
    #add the final linear layer
    self.lm_head = nn.Linear(n_embedding_dimension,vocabulary_size)

  def forward(self,idx,targets=None):
    idx = idx.to(device)
    B,T = idx.shape
    #obtain the emebddings representation of the table in the position of interest
    token_embeddings = self.token_embedding_table(idx) #size -> (B,T,C)
    indices = torch.arange(T,device=device)
    #obtain the position information of the tokens
    position_embeddings = self.position_embedding_table(indices) #size -> (T,C)
    x = token_embeddings + position_embeddings #add the positional information to the embeddings
    x = self.blocks(x) #apply the attention in blocks
    x = self.ln_f(x) #apply layer norm 
    logits = self.lm_head(x) #size -> (batch_size,context_window,vocab_size)

    if targets is None:
      loss = None
    else:
      #define a loss function
      B,T,C = logits.shape
      #simplify the dimension of the loss
      logits = logits.view(B*T,C)
      target = targets.view(B*T)
      #define a loss function
      loss = F.cross_entropy(logits,target)
    return logits,loss

  #generate tokens
  def generate(self,idx,max_new_tokens):
    """
    Generates a sequence of tokens autoregressively based on the given input context.

    Args:
        idx (Tensor): The current token sequence (shape: [B, T]), where B is the batch size and T is the sequence length.
        max_new_tokens (int): The number of new tokens to generate.

    Returns:
        Tensor: The generated sequence with new tokens appended (shape: [B, T + max_new_tokens]).
    """
    #idx --> current context (B,T)
    for _ in range(max_new_tokens):
      #crop idx to the last block_size tokens in order to fit the defined table
      idx_croped = idx[:,-block_size:]
      #gets the predictions
      logits,loss = self(idx_croped)
      # Extract the logits for the last predicted token in the sequence
      logits = logits[:,-1,:]
      probs = F.softmax(logits,dim=-1)
      #return 1 element for each batch -> (B,1)
      # Sample the next token from the probability distribution
      idx_next = torch.multinomial(probs,num_samples=1)
      if idx_next.max() >= vocab_size:
            print(f"Warning: idx_next contains invalid index {idx_next.max()}")

      #append element to the current sequence: (B,T+1)
      idx = torch.cat((idx,idx_next),dim=1)
    return idx

'''
  Depending on your resources, you can change the parameters
'''

'''
#SMALL RESOURCES CONFIGURATION
n_embedding_dimension = 32
number_of_heads = 4
block_size = 256
number_layers = 4
learning_rate = 1e-3
'''
n_embedding_dimension = 384
number_of_heads = 6  
block_size = 256  # Reduced from 256
number_layers = 4  
learning_rate = 1e-3
vocab_size = 65  # Adjust based on your dataset

#define the model
xb, yb = xb.to(device), yb.to(device)
baseline_model = BigramLanguageModel(vocab_size,n_embedding_dimension,number_of_heads).to(device)
logits,loss = baseline_model(xb,yb)
print(logits.shape)
print(loss)

#try to generate some data
print(decode(baseline_model.generate(idx=torch.zeros((1,1),dtype=torch.long, device=device),
                                     max_new_tokens=100)[0].tolist()))

torch.Size([256, 65])
tensor(4.4642, device='cuda:0', grad_fn=<NllLossBackward0>)

M.Ebimuvlig&$!?vX;p&wCu!gylgHt$r:ycH.WVDRCtwAtu;mpgET
KocTCtZYBubphHym
mHcPbQans'ftrc& D;qEQgWgnDbER


In [None]:
xb.shape

In [None]:
#define some hyper-parameters
optimizer = torch.optim.AdamW(baseline_model.parameters(),learning_rate)
batch_size  = 64 #can try 64/128
epochs = 5000

In [None]:
#define some training loop to test our ideas
for epoch in range(epochs):
  #obtain a batch from our data
  xb,yb = get_batch('train',batch_size,block_size)
  xb,yb = xb.to(device),yb.to(device)
  #evaluate the loss
  logits, loss =  baseline_model(xb,yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()
print(loss.item())

1.060608983039856


In [None]:
#after a fast training, let's try the text generator feature
print(decode(baseline_model.generate(idx=torch.zeros((1,1),dtype=torch.long,device=device),
                                     max_new_tokens=500)[0].tolist()))


Dost thou port it please and all greet,
on tide I denied thee for twenty times drawn.

LUCIO:
But, fair dreadful; may give I may prick haof.

CAMILLO:
I do protector must be you, fellow a rest--cord savant
Mib trust as you for the grace!

CLEOMENES:
Pread not Claudio well changed it the
Which breath not rung his work and his head,
Upon his knowledge him what is his land.

Second Musicing learned doth burn him;
And gracious fanad did that my colour look
Which grief you hither: nay, come, my son,



## 7.Save the weights of the model

In [None]:
# Save model weights
model_save_path = "/kaggle/working/baseline_model.pth"

# Save only the state_dict (recommended)
torch.save(baseline_model.state_dict(), model_save_path)

print(f"Model weights saved at: {model_save_path}")#print("Model weights saved successfully!")

Model weights saved at: /kaggle/working/baseline_model.pth


## 8.Conclusions

In this notebook, we implemented a Decoder-based model to generate text from a given .txt file. The key components of our implementation included token embedding, positional encoding, multi-head self-attention, and feedforward layers, which together form the foundation of transformer-based text generation.

We also explored the self-attention mechanism, which allows the model to dynamically weigh different parts of the input sequence to capture long-range dependencies. Unlike traditional sequence models such as RNNs, self-attention enables efficient parallelization and better context modeling, leading to more coherent text generation.

Future Improvements:
- Fine-tuning on domain-specific text for specialized text generation.
- Exploring larger context windows to enhance coherence in long-form text.
- Since we have learned the basics of the transformer architecture, we can try to implement variations to the architecture, or try to implement variations in other domains like vision transformer.