<a href="https://colab.research.google.com/github/janithsjay/transformer-experiments/blob/main/tiny_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [212]:
# mini_transformer.py
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

In [213]:
# ----------------------
# Step 1: Tiny Dataset
# ----------------------
sentences = [
    "I like pizza",
    "I like cats",
    "I took the dog for a walk",
    "The sun is bright",
    "I went to the park",
    "The dog likes to play",
    "I love dogs",
    "dogs are lovely"
]

In [214]:
# Build vocabulary
tokens = set()
for sentence in sentences:
    tokens.update(sentence.lower().split())
print(tokens)

{'cats', 'took', 'sun', 'the', 'like', 'is', 'love', 'a', 'likes', 'park', 'went', 'lovely', 'for', 'are', 'dog', 'dogs', 'walk', 'to', 'bright', 'i', 'pizza', 'play'}


In [215]:
token2id = {tok: idx for idx, tok in enumerate(sorted(tokens))}
id2token = {idx: tok for tok, idx in token2id.items()}
vocab_size = len(token2id)
print(token2id)

{'a': 0, 'are': 1, 'bright': 2, 'cats': 3, 'dog': 4, 'dogs': 5, 'for': 6, 'i': 7, 'is': 8, 'like': 9, 'likes': 10, 'love': 11, 'lovely': 12, 'park': 13, 'pizza': 14, 'play': 15, 'sun': 16, 'the': 17, 'to': 18, 'took': 19, 'walk': 20, 'went': 21}


In [216]:
# Convert sentences to sequences of IDs
sequences = [[token2id[word] for word in sentence.lower().split()] for sentence in sentences]
print(sequences)

[[7, 9, 14], [7, 9, 3], [7, 19, 17, 4, 6, 0, 20], [17, 16, 8, 2], [7, 21, 18, 17, 13], [17, 4, 10, 18, 15], [7, 11, 5], [5, 1, 12]]


In [217]:
# Create input/output pairs for next-word prediction
X, Y = [], []
for seq in sequences:
    for i in range(1, len(seq)):
        X.append(seq[:i])
        Y.append(seq[i])


Step 2: Positional Encoding

In transformers, unlike RNNs, the model doesn‚Äôt inherently know the order of tokens in a sequence. To give it a sense of position, we add positional encodings to the input embeddings. These encodings help the model distinguish between the first word, second word, etc.

Here‚Äôs what your code does:

Parameters:

d_model = 16 ‚Üí The dimensionality of the embeddings (number of features per token).

max_len = 10 ‚Üí Maximum length of the sequence we want to encode positions for.

Function get_positional_encoding(seq_len, d_model):

Creates a zero tensor of shape (seq_len, d_model) to store the positional encodings.

Loops over each position in the sequence (pos) and each dimension of the embedding (i).

Even indices (i): Use the sine function

ùëÉ
ùê∏
[
ùëù
ùëú
ùë†
,
ùëñ
]
=
sin
‚Å°
(
ùëù
ùëú
ùë†
10000
ùëñ
/
ùëë
_
ùëö
ùëú
ùëë
ùëí
ùëô
)
PE[pos,i]=sin(
10000
i/d_model
pos
	‚Äã

)

Odd indices (i+1): Use the cosine function

ùëÉ
ùê∏
[
ùëù
ùëú
ùë†
,
ùëñ
+
1
]
=
cos
‚Å°
(
ùëù
ùëú
ùë†
10000
ùëñ
/
ùëë
_
ùëö
ùëú
ùëë
ùëí
ùëô
)
PE[pos,i+1]=cos(
10000
i/d_model
pos
	‚Äã

)

Why sine and cosine?

They create a unique pattern for each position across all embedding dimensions.

These patterns are continuous, so the model can infer the relative distances between positions.

Using different frequencies (10000^(i/d_model)) ensures each dimension has a different periodicity.

Output:

pos_encoding is a tensor of shape (max_len, d_model).

This tensor is later added to the input embeddings to inject positional information.

üí° Intuition:
Think of positional encoding as giving each token a ‚Äúlocation tag‚Äù in the sequence. Sine and cosine allow the model to figure out relative positions without hard-coding numbers.

In [218]:
# ----------------------
# Step 2: Positional Encoding
# ----------------------
d_model = 16
# max_len = 10 # Removed static max_len

# Calculate max_len dynamically based on the longest sentence
max_len = max(len(s.split()) for s in sentences)
print(f"Dynamic max_len: {max_len}")

def get_positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = math.sin(pos / (10000 ** (i / d_model)))
            if i + 1 < d_model:
                pe[pos, i+1] = math.cos(pos / (10000 ** (i / d_model)))
    return pe

# pos_encoding = get_positional_encoding(max_len, d_model)

Dynamic max_len: 7


Step 3: Transformer Encoder Layer

The transformer encoder layer is the core building block of a Transformer. It takes in a sequence of embeddings and outputs a transformed sequence that captures contextual relationships between tokens.

Class Initialization (__init__)

Linear projections for Q, K, V:

self.W_q, self.W_k, self.W_v are linear layers that map the input embeddings into Query, Key, and Value vectors.

Each has shape (d_model, d_model).

These are the vectors used in self-attention to compute relationships between tokens.

Feed-Forward Network (ffn):

A small MLP applied independently to each position.

Two linear layers: first expands the dimension by 4√ó (d_model -> 4*d_model), then reduces it back (4*d_model -> d_model).

Uses ReLU for non-linearity.

Layer Normalization (ln1, ln2):

Helps stabilize training by normalizing the inputs at each step.

Applied after residual connections.

In [219]:
# ----------------------
# Step 3: Transformer Encoder Layer
# ----------------------
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model*4),
            nn.ReLU(),
            nn.Linear(d_model*4, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        scores = Q @ K.T / math.sqrt(d_model)
        attn_weights = F.softmax(scores, dim=-1)
        attn_out = attn_weights @ V

        x = self.ln1(x + attn_out)
        x = self.ln2(x + self.ffn(x))
        return x

In [220]:
# ----------------------
# Step 4: Mini Transformer
# ----------------------
class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = get_positional_encoding(max_len, d_model)
        self.layers = nn.ModuleList([TransformerEncoderLayer(d_model) for _ in range(num_layers)])
        self.output_layer = nn.Linear(d_model, vocab_size)

    def forward(self, seq_ids):
        seq_len = seq_ids.size(0)
        x = self.embedding(seq_ids) + self.pos_encoding[:seq_len]

        for layer in self.layers:
            x = layer(x)
        logits = self.output_layer(x)
        return logits

In [221]:
# ----------------------
# Step 5: Training Loop
# ----------------------
num_layers = 2
learning_rate = 0.01
model = MiniTransformer(vocab_size, d_model, num_layers)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train on tiny dataset
for epoch in range(50):
    total_loss = 0
    for seq, target in zip(X, Y):
        seq_ids = torch.tensor(seq)
        target_id = torch.tensor([target])

        optimizer.zero_grad()
        logits = model(seq_ids)
        pred = logits[-1].unsqueeze(0)
        loss = criterion(pred, target_id)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Epoch 10, Loss: 22.2382
Epoch 20, Loss: 12.0126
Epoch 30, Loss: 11.4462
Epoch 40, Loss: 11.1547
Epoch 50, Loss: 10.9612


In [222]:
# ----------------------
# Step 6: Test Prediction
# ----------------------
test_seq = torch.tensor([token2id[w] for w in ["i", "like"]])
logits = model(test_seq)
pred_id = logits[-1].argmax().item()
print("Input: 'i like'")
print("Predicted next word:", id2token[pred_id])

Input: 'i like'
Predicted next word: cats


In [223]:
# ----------------------
# Step 6: Test Prediction
# ----------------------
test_seq = torch.tensor([token2id[w] for w in ["i", "took", "the", "cats", "for", "a"]])
logits = model(test_seq)
pred_id = logits[-1].argmax().item()
print("Predicted next word:", id2token[pred_id])

Predicted next word: walk


In [230]:
import time
import sys

# ----------------------
# Step 6: Generate Sentence with "pausing"
# ----------------------
test_seq = ["i"]
max_gen_len = 5  # maximum tokens to generate

generated = test_seq.copy()

# Print the starting sequence
for word in test_seq:
    print(word, end=" ", flush=True)
    time.sleep(0.3)  # small pause for "thinking"

for _ in range(max_gen_len):
    seq_ids = torch.tensor([token2id[w] for w in generated])
    logits = model(seq_ids)
    pred_id = logits[-1].argmax().item()
    next_word = id2token[pred_id]

    # Print next word immediately
    print(next_word, end=" ", flush=True)
    time.sleep(0.3)  # pause between words

    generated.append(next_word)

    if next_word == "<EOS>":
        break

print()  # final newline

i like cats cats to the 
