In [4]:
import sys
import os

# Add project root to sys.path so 'src' is importable
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, project_root)

print("Added project root to sys.path:", project_root)
print("Current sys.path:", sys.path)

Added project root to sys.path: /Users/imperfect_abhi/Desktop/Learning/github-repo/slm-personalized-ecommerce-recommendations
Current sys.path: ['/Users/imperfect_abhi/Desktop/Learning/github-repo/slm-personalized-ecommerce-recommendations', '/Users/imperfect_abhi/Desktop/Learning/github-repo/slm-personalized-ecommerce-recommendations', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python312.zip', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/lib-dynload', '', '/Users/imperfect_abhi/Desktop/Learning/VENVS/slm-env/lib/python3.12/site-packages']


# 03 - SLM Architecture From Scratch

In this notebook we:
- Port the GPT model from the original Vizuara notebook into a modular class
- Explain every component theoretically
- Discuss pros/cons of design choices
- Test a tiny forward pass on dummy input

## 3.1 Why Decoder-Only Transformer?

- Causal self-attention ensures autoregressive generation (no peeking at future tokens)
- Ideal for next-token prediction â†’ text generation, recommendation narratives
- Pros: Simple, scalable, proven (GPT family)
- Cons: Quadratic complexity in sequence length (but block_size=128 is fine for laptop)

In [8]:
try:
    from src.model.gpt import GPT, GPTConfig
    print("Import successful!")
    config = GPTConfig()
    model = GPT(config)
    print(f"Model loaded: {model.get_num_params() / 1e6:.2f}M parameters")
    print(model)
except ImportError as e:
    print("Import failed:", e)

Import successful!
number of parameters: 10.70M
Model loaded: 10.70M parameters
GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 384)
    (wpe): Embedding(128, 384)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=384, out_features=1152, bias=True)
          (c_proj): Linear(in_features=384, out_features=384, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=384, out_features=1536, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=1536, out_features=384, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=384, out_features=50257, bia

Expected output: ~58.5M parameters (close to 60M target)

Continue in next notebooks: LoRA integration, fine-tuning loop, etc.