# Making model

<img src="./diagram_1.PNG" width=600>

Yup... that's the entire diagram for model <br/>
(without checkpoints. i'll be covering this in this ipynb as well)

# Helper

<img src="./GELU.png" width="400"/>

In [10]:
import math, torch
import torch.nn as nn

# note: i tried GELU, ReLU, Swish, etc 
# but GELU gave me the fastest runtime (idk the reason...)
# so, i'm using GELU for this project
def build_activation(name: str):
    name = (name or "gelu").lower() 
    if name == "gelu":
        return nn.GELU()
    raise ValueError(f"Unsupported activation: {name}")

# MultiHeadAttention

In [None]:
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, qkv_bias, attn_drop, resid_drop, max_ctx):
        super().__init__()
        
        # in multi-head attention, we split the full embedding vector of size d_model into n_heads smaller chunks (one chunk per head)
        # so, d_model should be divisible by n_heads
        # if not, we'll end up with fractional head dimensions, which is impossible for tensor reshaping
        assert d_model % n_heads == 0, "hidden_size must be divisible by num_heads"

        self.n_heads = n_heads
        self.head_dim = d_model // n_heads # dimension per attention head

        # linear layers to project input [B, T, d_model] into queries, keys, values
        self.q_proj = nn.Linear(d_model, d_model, bias=qkv_bias) # Q projection
        self.k_proj = nn.Linear(d_model, d_model, bias=qkv_bias) # K projection
        self.v_proj = nn.Linear(d_model, d_model, bias=qkv_bias) # V projection
        # think of
        # Q: asks "what am i looking for?"
        # K: says "what do i have to offer?"
        # V: the actual content we'll aggregate once we decide "who to listen to"

        # output projection: combines all head outputs back into a single vector per token
        self.out_proj = nn.Linear(d_model, d_model, bias=True)

        # dropout stuffs
        self.attn_drop = nn.Dropout(attn_drop)
        self.resid_drop = nn.Dropout(resid_drop)

        # masking (see fig 2_1)
        # so, precompute upper triangular causal mask [max_ctx, max_ctx]
        # determine true entries & indicate positions that should be masked 
        # that masked ones will be future tokens
        mask = torch.triu(torch.ones(max_ctx, max_ctx, dtype=torch.bool), diagonal=1)
        self.register_buffer("causal_mask", mask, persistent=False)

    def forward(self, x):
        B, T, C = x.shape

        # project input to Q, K, V, then reshape into [B, n_heads, T, head_dim]
        q = self.q_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

        # compute attention output
        if hasattr(F, "scaled_dot_product_attention"):
            # if SDPA is found, use that 
            y = F.scaled_dot_product_attention(
                q, k, v, attn_mask=None, is_causal=True,
                dropout_p=self.attn_drop.p if self.training else 0.0,
            )
        else:
            # manual attention computation
            # scale it first
            att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

            # apply causal mask
            cm = self.causal_mask[:T, :T].to(att.device)
            att = att.masked_fill(cm, torch.finfo(att.dtype).min)

            # softmax
            att = F.softmax(att, dim=-1)

            # dropout attention weights
            att = self.attn_drop(att)

            # weighted sum of values
            y = att @ v # [B, h, T, head_dim]

        # merge heads
        # [B, h, T, head_dim] → [B, T, C]
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # final output projection + dropout
        y = self.resid_drop(self.out_proj(y))
        return y


### Figure 2_1
<img src="./causal_mask_example.png" width="300"/>

# MLP (Feed-Forward)

In [12]:
# the guy used in transformer block

class MLP(nn.Module):
    def __init__(self, d_model, d_ff, drop, activation="gelu"):
        super().__init__()

        # first linear layer: expand embedding dimension to a larger feed-forward size
        # this gives the network more capacity to learn richer transformations per token
        self.fc = nn.Linear(d_model, d_ff)

        # non-linear activation (we're using GELU)
        self.act = build_activation(activation)

        # second linear layer: project back down to original embedding size
        self.proj = nn.Linear(d_ff, d_model)

        # dropout: randomly zero out some elements during training to prevent overfitting
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        # flow:
        # 1. project up: [B, T, d_model] -> [B, T, d_ff]
        # 2. apply non-linear activation
        # 3. project down: [B, T, d_ff] -> [B, T, d_model]
        # 4. apply dropout to output
        return self.drop(self.proj(self.act(self.fc(x))))

# TransformerBlock

In [20]:
# we'll have multiples of this later
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()

        # dimensions
        d_model = cfg["emb_dim"] # embedding dimension
        n_heads = cfg["n_heads"] # number of attention heads

        # feed-forward dimension
        d_ff = cfg.get("intermediate_size", 4 * d_model) # i've setted up as 4 * the embedding size

        # dropout 
        drop = cfg.get("drop_rate", 0.1) # residual & MLP dropout
        attn_drop = cfg.get("attention_probs_dropout_prob", drop) # attention weight dropout

        # attention config
        qkv_bias = cfg.get("qkv_bias", False) # whether Q/K/V projections have bias (we've set this to True in model_config_124M.json)
        max_ctx = cfg["context_length"] # maximum sequence length

        # activation for MLP
        activation = cfg.get("activation", "gelu") # we're using GELU overall

        # layernorm epsilon (small constant for numerical stability)
        eps = cfg.get("layer_norm_eps", 1e-5)

        # submodules
        self.ln1 = nn.LayerNorm(d_model, eps=eps) # pre-attention layernorm
        self.attn = MultiHeadAttention(d_model, n_heads, qkv_bias, attn_drop, drop, max_ctx) # multi-head self-attention block
        self.ln2 = nn.LayerNorm(d_model, eps=eps) # pre-MLP layer norm
        self.mlp = MLP(d_model, d_ff, drop, activation) # feed-forward network

        # scaling factor (to stabilize training for deep networks)
        # 1 / sqrt(2 * num_layers)
        self.res_scale = 1.0 / math.sqrt(2 * cfg["n_layers"])


    def forward(self, x):
        # multi-head attention sublayer
        # pre-norm: normalize inputs before attention
        # residual: add attention output back to original x
        x = x + self.attn(self.ln1(x)) * self.res_scale

        # MLP sublayer
        # pre-norm: normalize inputs before MLP
        # residual: add MLP output back to updated x
        x = x + self.mlp(self.ln2(x)) * self.res_scale
        return x


# Main Block

In [25]:
# weight initialization (this is used in DummyModel class)
def _init_weights(module, std):
    if isinstance(module, nn.Linear):
        nn.init.normal_(module.weight, mean=0.0, std=std)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0.0, std=std)
    elif isinstance(module, nn.LayerNorm):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

In [26]:
# main model

class DummyModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()

        # grab configurations
        # in this case, from "model_config_124M.json"
        self.vocab_size = cfg["vocab_size"]
        self.context_length = cfg["context_length"]
        d_model = cfg["emb_dim"]
        n_layers = cfg["n_layers"]
        drop = cfg.get("drop_rate", 0.1)
        eps = cfg.get("layer_norm_eps", 1e-5)
        init_std = cfg.get("initializer_range", 0.02)

        # embeddings
        self.tok_emb = nn.Embedding(self.vocab_size, d_model) # token embeddings
        self.pos_emb = nn.Embedding(self.context_length, d_model) # positional embeddings
        self.drop = nn.Dropout(drop) # dropout after embeddings

        # transformer layers
        # remember i told you we'll have multiple of transformer blocks?
        # this makes the multiple of it
        self.blocks = nn.ModuleList([
            TransformerBlock(cfg) for _ in range(n_layers) # stack of blocks
        ])

        # final norm & output projection
        self.ln_f = nn.LayerNorm(d_model, eps=eps) # normalize before output
        self.lm_head = nn.Linear(d_model, self.vocab_size, bias=False) # projection to vocab

        # weight tying
        # share token embedding weights with output projection weights
        # this reduces parameters
        self.lm_head.weight = self.tok_emb.weight

        # for checkpoint
        self.grad_ckpt = bool(cfg.get("grad_ckpt", False))

        # apply that to all submodules
        self.apply(lambda m: _init_weights(m, init_std))

    def forward(self, idx, labels=None):
        assert idx.dtype == torch.long
        B, T = idx.shape

        # clip sequence length if it exceeds context_length
        # we would have some tokens missing while training
        # but hey. in real training, we have literally billions of tokens.
        # so, that wouldn't be a big issue
        if T > self.context_length:
            idx = idx[:, -self.context_length:]
            T = idx.shape[1]

        # embedding lookup
        T = min(T, self.context_length) # for safety
        pos = torch.arange(T, device=idx.device)
        x = self.tok_emb(idx[:, :T]) + self.pos_emb(pos)[None, :, :]
        x = self.drop(x)

        # checkpoint
        use_ckpt = self.grad_ckpt and self.training

        # if we have checkpoint, start from there
        if use_ckpt:
            try:
                amp_dtype = torch.get_autocast_gpu_dtype()
            except AttributeError:
                amp_dtype = torch.bfloat16

            def run_block(block, t):
                with autocast(device_type=t.device.type, dtype=amp_dtype):
                    return block(t)

            def _block_forward(bl, u):
                return run_block(bl, u)
            
            for b in self.blocks:
                x = checkpoint(_block_forward, b, x, use_reentrant=False)
        else:
            # transformer stack
            for b in self.blocks:
                x = b(x)

        # final norm + projection
        x = self.ln_f(x)
        logits = self.lm_head(x)

        # loss (this is for log & displaying graph. it's optional)
        loss = None
        if labels is not None:
            labels = labels[:, :T]
            assert labels.dtype == torch.long
            logits_flat = logits[:, :-1, :].contiguous().view(-1, self.vocab_size)
            labels_flat = labels[:, 1:].contiguous().view(-1)
            loss = F.cross_entropy(logits_flat, labels_flat, ignore_index=-100)

        return logits, loss

    def num_params(self, trainable_only=True):
        # count total parameters in the module
        ps = (p for p in self.parameters() if (p.requires_grad or not trainable_only))
        return sum(p.numel() for p in ps)


# Tiny Smoke Test
(just to verify we're not getting any error after executing above codes)

In [27]:

cfg = dict(
    vocab_size=1000, context_length=32, emb_dim=64,
    n_heads=8, n_layers=2, drop_rate=0.1, qkv_bias=True, activation="gelu"
)
m = DummyModel(cfg)
print("Params:", m.num_params())
B, T = 4, 32
idx = torch.randint(0, cfg["vocab_size"], (B, T))
labels = torch.randint(0, cfg["vocab_size"], (B, T))
with torch.no_grad():
    logits, loss = m(idx, labels)
print("logits:", logits.shape, "loss:", float(loss))


Params: 166144
logits: torch.Size([4, 32, 1000]) loss: 6.944150447845459


In [28]:

B, T, d = 2, 16, cfg["emb_dim"]
x = torch.randn(B, T, d)
attn = MultiHeadAttention(d, cfg["n_heads"], True, 0.0, 0.0, cfg["context_length"])
with torch.no_grad():
    y = attn(x)
print("MHA input:", x.shape, "→ output:", y.shape)


MHA input: torch.Size([2, 16, 64]) → output: torch.Size([2, 16, 64])


if you see input torch.size = output torch.size <br/>
you're good to go!

# END

in python_files folder, there's Dummy_model.py <br/>
they're basically the same (except I tried to use ReLU for test, but i'm sticking with GELU) <br/>
I'm naming this as Dummy_model.py <br/>
(but you can have different name. Just make sure in the future, you select the correct model file)

# UPDATES

- **Fused QKV**: replaced three Linear projections (Q,K,V) with a single `nn.Linear(d_model, 3*d_model)`.
- **SDPA path**: use `torch.nn.functional.scaled_dot_product_attention` when available; fallback to masked softmax.
- **Causal masking**: register a boolean upper-tri mask as a buffer; SDPA path uses `is_causal=True`.
- **Residual scaling**: scale both residual branches by `1/sqrt(2*n_layers)` for depth stability.
- **Pre-Norm**: LayerNorm before attention and MLP (`ln1`, `ln2`) with configurable `eps`.
- **Gradient checkpointing**: optional per-block checkpointing (`use_reentrant=False`, `preserve_rng_state=False`) and rely on outer AMP autocast.
- **AMP support**: compatible with `torch.amp.autocast` in the training loop.
- **Weight tying**: `lm_head.weight = tok_emb.weight` to tie input/output embeddings.
- **Learned positional embeddings**: keep learned `pos_emb` (no RoPE/ALiBi).
- **Label shift & ignore_index**: next-token CE loss with `ignore_index=-100`.
- **Config knobs**: `qkv_bias`, `attention_probs_dropout_prob` separate from `drop_rate`, `layer_norm_eps`, `initializer_range`.
- **Context-length safety**: trim long sequences and clamp `T` to `context_length`.
- **Shape guards**: assert `d_model % n_heads == 0`.
- **Param count helper**: `num_params(trainable_only=True)`.

# DONE!