<a href="https://colab.research.google.com/github/jessica-hoffman/transformer_practice/blob/main/day8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Day 8 – Debugging & Problem Solving (2 hrs)

**Goal:** Be able to fix errors quickly.

---

## 📖 Reading (~30 min)

### 1. Common PyTorch Errors
- **Device mismatch (`RuntimeError: Expected all tensors to be on the same device`)**
  - Happens if model is on GPU but data is still on CPU (or vice versa).
  - Fix: `.to(device)` on both model and tensors.

- **GradScaler warnings (AMP training)**
  - AMP = Automatic Mixed Precision.
  - Errors like `GradScaler has been disabled` → usually because loss became NaN.
  - Fix: reduce learning rate, check for exploding gradients, clip gradients.

- **Shape mismatch**
  - Example: logits shape `[batch_size, vocab_size]` vs labels shape `[batch_size, seq_len]`.
  - Fix: make sure labels align with model output (e.g. shift labels in language modeling).

---

### 2. Hugging Face Specific Errors
- **`KeyError: 'input_ids'`**
  - Hugging Face models expect inputs in dicts with keys: `input_ids`, `attention_mask`, etc.
  - Fix: make sure your tokenizer output is passed as `**batch`.

- **`ValueError: Tokenizer does not have a pad_token`**
  - Needed for batching sequences of different lengths.
  - Fix:
    ```python
    tokenizer.pad_token = tokenizer.eos_token
    ```

- **`RuntimeError: CUDA out of memory`**
  - Fixes: reduce batch size, use gradient accumulation, switch to smaller model, use `.half()`.

---

## 🛠 Hands-on (~1.5 hr)

We’ll **intentionally introduce errors** and practice debugging.

### 1. Device mismatch




In [None]:
import torch
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"

model = nn.Linear(10, 2).to(device)
x = torch.randn(4, 10)   # Oops! Still on CPU

# ❌ Will error:
out = model(x)

Fix:

In [2]:
x = x.to(device)
out = model(x)

2. KeyError: 'input_ids'

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

text = ['hello world']
batch = tokenizer(text, padding=True, return_tensors='pt')

# ❌ Mistakenly try to access wrong key
print(batch["ids"])

Fix:

In [4]:
print(batch["input_ids"])

tensor([[ 101, 7592, 2088,  102]])


3. Tokenizer pad_token issue

In [None]:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('gpt2')
tok.pad_token = None

# ❌ This will error because GPT-2 has no pad_token
batch = tok(["hi", "there"], padding=True, return_tensors="pt")

Fix:

In [6]:
tok.pad_token = tok.eos_token
batch = tok(['hi', 'there'], padding=True, return_tensors='pt')

4. Shape mismatch

In [None]:
logits = torch.randn(4,5)   # (batch, classes)
labels = torch.randint(0, 5, (4,3)) # (batch, seq_len) ❌ mismatch

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)

Fix:

In [8]:
labels = torch.randint(0, 5, (4,)) # match (batch,)
loss = loss_fn(logits, labels)

5. AMP / GradScaler

In [None]:
scaler = torch.cuda.amp.GradScaler()
optim = torch.optim.Adam(model.parameters(), lr=1e-2)

for i in range(100):
    with torch.cuda.amp.autocast():
        out = model(torch.randn(32, 10).to(device))
        loss = out.mean() * 1e6  #  # ❌ artificial overflow
    scaler.scale(loss).backward()
    scaler.step(optim)
    scaler.update()

Fixes:

Lower learning rate.

Clip gradients.

Avoid extreme scaling.

In [None]:
scaler = torch.cuda.amp.GradScaler()
optim = torch.optim.Adam(model.parameters(), lr=1e-5)   # lower learning rate

for i in range(100):
    with torch.cuda.amp.autocast():
        out = model(torch.randn(32, 10).to(device))
        loss = out.mean()                       # remove the high scaling coefficient that causes overflow
    scaler.scale(loss).backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)        # gradient clipping right before step - this keeps gradient from exploding
    scaler.step(optim)
    scaler.update()
