[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gouthamgo/FineTuning/blob/main/lessons/module2_first_training/03_debugging_training.ipynb)

# üîß Debugging Like a Pro (When Things Go Wrong)

**Duration:** 1 hour  
**Level:** Intermediate  
**Prerequisites:** Module 2, Lessons 1-2

---

## Hey Friend, Let's Fix Some Bugs! üêõ

Okay, real talk time:

**Your model WILL fail. Multiple times. And that's totally normal!**

I've been doing this for years and I STILL run into errors every single day. The difference between a beginner and a pro isn't that pros don't get errors - it's that **pros know how to fix them fast**.

Think of this lesson as your **"Oh crap, it's broken!"** survival guide.

We're going to cover:
1. The most common errors (and how to fix them)
2. How to debug training issues
3. When your model "works" but gives bad results
4. Memory errors (the #1 frustration!)

Let's become debugging ninjas! ü•∑

## üö® The Top 10 Errors (And How to Fix Them)

Let me save you HOURS of frustration. Here are the errors everyone hits:

### 1. **"CUDA Out of Memory" üí•**

**What it looks like:**
```
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
```

**What it means:**
Your GPU ran out of RAM. Like trying to fit 10 pounds of potatoes in a 5-pound bag.

**How to fix:**
```python
# Option 1: Smaller batch size
per_device_train_batch_size=8  # Instead of 16 or 32

# Option 2: Shorter sequences
max_length=128  # Instead of 512

# Option 3: Gradient accumulation (simulate larger batches)
gradient_accumulation_steps=4
per_device_train_batch_size=4  # 4 * 4 = effective batch of 16

# Option 4: Enable mixed precision
fp16=True  # Uses less memory

# Option 5: Smaller model
# Use 'distilbert' instead of 'bert-base'
# Use 'bert-base' instead of 'bert-large'
```

**Pro tip:** Start small (batch_size=4), then gradually increase until you hit the limit!

---

### 2. **"Size Mismatch" Error ‚ö†Ô∏è**

**What it looks like:**
```
RuntimeError: The size of tensor a (128) must match the size of tensor b (512)
```

**What it means:**
Your input is the wrong size for the model.

**How to fix:**
```python
# Make sure you're using the SAME tokenizer as your model
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Ensure padding and truncation
tokenizer(text, padding='max_length', truncation=True, max_length=512)
```

---

### 3. **"Token Type IDs Error" üé≠**

**What it looks like:**
```
TypeError: forward() got an unexpected keyword argument 'token_type_ids'
```

**What it means:**
Some models (like RoBERTa) don't use token_type_ids, but the tokenizer creates them.

**How to fix:**
```python
# Remove them from your tokenized data
tokenized = tokenizer(text, truncation=True, padding=True)
tokenized.pop('token_type_ids', None)  # Remove if present

# Or use this in your dataset map function
def tokenize_function(examples):
    result = tokenizer(examples['text'], truncation=True, padding='max_length')
    result.pop('token_type_ids', None)
    return result
```

---

### 4. **"Loss is NaN" üìà**

**What it looks like:**
```
loss: nan
```

**What it means:**
Your training exploded! The numbers got too big and became "Not a Number".

**How to fix:**
```python
# Lower your learning rate
learning_rate=1e-5  # Instead of 1e-4 or higher

# Add gradient clipping
max_grad_norm=1.0  # Prevents gradient explosion

# Enable fp16 mixed precision
fp16=True
```

---

### 5. **"No module named 'transformers'" üì¶**

**What it means:**
You forgot to install the library!

**How to fix:**
```python
!pip install transformers datasets torch accelerate
```

**In Colab?** Run this at the TOP of your notebook, BEFORE importing anything.

## üîç Debugging Training Issues

Your code runs... but something's wrong. Let's diagnose:

### Issue: "Loss isn't decreasing" üìâ

Let's create a diagnostic script:

In [None]:
!pip install -q datasets transformers torch accelerate evaluate matplotlib

In [None]:
import matplotlib.pyplot as plt
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
import numpy as np
import evaluate

# Load small dataset for testing
dataset = load_dataset('imdb')
small_train = dataset['train'].shuffle(seed=42).select(range(100))
small_test = dataset['test'].shuffle(seed=42).select(range(50))

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

train_data = small_train.map(tokenize, batched=True)
test_data = small_test.map(tokenize, batched=True)

metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

print("‚úÖ Data ready for debugging!")

### Diagnostic Test #1: Is Your Model Learning Anything?

In [None]:
print("üß™ DIAGNOSTIC TEST: Can the model overfit on a tiny dataset?\n")
print("If a model can't overfit on 10 examples, something is VERY wrong.\n")

# Take ONLY 10 examples
tiny_data = small_train.select(range(10)).map(tokenize, batched=True)

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

args = TrainingArguments(
    output_dir='./test_overfit',
    num_train_epochs=20,  # Many epochs on tiny data
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    logging_steps=10,
    evaluation_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tiny_data,
    eval_dataset=tiny_data,  # Eval on same data!
    compute_metrics=compute_metrics,
)

result = trainer.train()

print("\n" + "="*60)
print("üìä RESULTS:\n")
print(f"Final loss: {result.training_loss:.4f}")

eval_result = trainer.evaluate()
print(f"Accuracy on training data: {eval_result['eval_accuracy']:.3f}")

if eval_result['eval_accuracy'] > 0.95:
    print("\n‚úÖ GOOD! Model CAN learn (it overfitted on tiny data)")
    print("   ‚Üí Your model setup is working")
    print("   ‚Üí Problem is probably with hyperparameters or data")
else:
    print("\n‚ùå BAD! Model CANNOT learn even on 10 examples!")
    print("   ‚Üí Check your model architecture")
    print("   ‚Üí Check your data labels")
    print("   ‚Üí Check your loss function")

### Diagnostic Test #2: Visualize Training Progress

In [None]:
# Let's plot training history
def plot_training_history(trainer):
    """Plot loss over time"""
    logs = trainer.state.log_history
    
    train_loss = [log['loss'] for log in logs if 'loss' in log]
    eval_loss = [log['eval_loss'] for log in logs if 'eval_loss' in log]
    
    plt.figure(figsize=(12, 5))
    
    # Loss plot
    plt.subplot(1, 2, 1)
    plt.plot(train_loss, label='Training Loss', marker='o')
    if eval_loss:
        plt.plot(eval_loss, label='Validation Loss', marker='s')
    plt.xlabel('Step', fontsize=12)
    plt.ylabel('Loss', fontsize=12)
    plt.title('üìâ Training Progress', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Accuracy plot (if available)
    plt.subplot(1, 2, 2)
    eval_acc = [log['eval_accuracy'] for log in logs if 'eval_accuracy' in log]
    if eval_acc:
        plt.plot(eval_acc, label='Accuracy', marker='o', color='green')
        plt.xlabel('Epoch', fontsize=12)
        plt.ylabel('Accuracy', fontsize=12)
        plt.title('üéØ Accuracy Progress', fontsize=14, fontweight='bold')
        plt.ylim([0, 1])
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Diagnosis
    print("\nüîç WHAT DO THESE CHARTS TELL US?\n")
    
    if len(train_loss) > 1:
        if train_loss[-1] < train_loss[0] * 0.5:
            print("‚úÖ Training loss is decreasing nicely!")
        elif train_loss[-1] > train_loss[0]:
            print("‚ùå Training loss is INCREASING! That's bad!")
            print("   ‚Üí Try lower learning rate")
            print("   ‚Üí Check your data for errors")
        else:
            print("‚ö†Ô∏è Training loss barely changing")
            print("   ‚Üí Try higher learning rate")
            print("   ‚Üí Train more epochs")
    
    if len(eval_loss) > 1 and len(train_loss) > 1:
        if eval_loss[-1] > train_loss[-1] * 1.5:
            print("\n‚ö†Ô∏è Eval loss >> Train loss = OVERFITTING!")
            print("   ‚Üí Use more training data")
            print("   ‚Üí Add regularization (weight_decay)")
            print("   ‚Üí Train fewer epochs")

plot_training_history(trainer)

## üéØ Issue: "Model Always Predicts The Same Thing"

Your model just predicts "positive" for EVERYTHING. What's wrong?

In [None]:
def diagnose_predictions(trainer, dataset, num_examples=20):
    """Check if model is actually making different predictions"""
    
    predictions = trainer.predict(dataset.select(range(num_examples)))
    pred_labels = np.argmax(predictions.predictions, axis=-1)
    true_labels = predictions.label_ids
    
    print("üîç PREDICTION ANALYSIS\n" + "="*60)
    print(f"\nChecking {num_examples} examples...\n")
    
    # Count predictions
    unique, counts = np.unique(pred_labels, return_counts=True)
    
    print("Predicted labels:")
    for label, count in zip(unique, counts):
        print(f"  Class {label}: {count}/{num_examples} ({count/num_examples*100:.1f}%)")
    
    # Check if model is stuck
    if len(unique) == 1:
        print("\n‚ùå PROBLEM: Model predicts ONLY class", unique[0])
        print("\nPossible causes:")
        print("  1. Class imbalance in training data")
        print("  2. Learning rate too high (model collapsed)")
        print("  3. Wrong loss function")
        print("  4. Model not trained enough")
        print("\nTry:")
        print("  ‚Üí Check data balance (should be ~50/50)")
        print("  ‚Üí Lower learning rate to 1e-5")
        print("  ‚Üí Train more epochs")
    else:
        print("\n‚úÖ Model makes different predictions (that's good!)")
    
    # Show some examples
    print("\nüìã Sample predictions:\n")
    for i in range(min(5, num_examples)):
        correct = "‚úÖ" if pred_labels[i] == true_labels[i] else "‚ùå"
        print(f"{correct} True: {true_labels[i]}, Predicted: {pred_labels[i]}")

diagnose_predictions(trainer, test_data)

## üíæ Memory Management Tips

**The #1 beginner frustration: Running out of memory**

Here's how to handle it:

In [None]:
# MEMORY-EFFICIENT TRAINING CONFIG

from transformers import TrainingArguments

memory_efficient_args = TrainingArguments(
    output_dir='./results',
    
    # 1. Smallest batch size that still works
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    
    # 2. Simulate larger batch with gradient accumulation
    gradient_accumulation_steps=4,  # Effective batch = 4 * 4 = 16
    
    # 3. Enable mixed precision (uses less memory)
    fp16=True,  # For NVIDIA GPUs
    
    # 4. Clear cache between batches
    gradient_checkpointing=True,  # Saves memory during backprop
    
    # 5. Don't keep all checkpoints
    save_total_limit=2,  # Only keep 2 best models
    
    # Standard settings
    learning_rate=3e-5,
    num_train_epochs=3,
    evaluation_strategy='epoch',
)

print("üí° These settings should work on most free-tier GPUs!")
print("\nIf you STILL run out of memory:")
print("  1. Reduce max_length when tokenizing")
print("  2. Use a smaller model (distilbert instead of bert)")
print("  3. Reduce batch_size to 2 or even 1")
print("  4. Train on CPU (slow but works!)")

## üõ†Ô∏è Your Debugging Checklist

When something goes wrong, go through this list:

### ‚úÖ Before Training:
- [ ] Data loaded correctly?
- [ ] Labels are correct (0, 1, 2... not text)?
- [ ] Tokenizer matches the model?
- [ ] Data is balanced (or you're aware it's not)?
- [ ] Train/test split done properly?

### ‚úÖ During Training:
- [ ] Loss is decreasing?
- [ ] No NaN in loss?
- [ ] Training speed reasonable?
- [ ] Memory usage okay?
- [ ] Validation metrics improving?

### ‚úÖ After Training:
- [ ] Train accuracy > random guessing?
- [ ] Test accuracy reasonable?
- [ ] Model makes varied predictions?
- [ ] No severe overfitting?
- [ ] Results make sense for your task?

## üöë Emergency Quick Fixes

**Training exploded (NaN loss)?**
```python
learning_rate=1e-5  # Lower!
max_grad_norm=1.0   # Clip gradients
```

**Out of memory?**
```python
per_device_train_batch_size=4  # Smaller!
fp16=True  # Mixed precision
max_length=128  # Shorter sequences
```

**Not learning?**
```python
learning_rate=5e-5  # Higher!
num_train_epochs=5  # More epochs
# Check your data!
```

**Overfitting?**
```python
weight_decay=0.01  # Add regularization
num_train_epochs=2  # Fewer epochs
# Get more data if possible
```

**Too slow?**
```python
per_device_train_batch_size=32  # Larger!
fp16=True  # Faster
# Use smaller dataset for testing
```

## üéì Final Wisdom

Here's what I wish someone told me when I started:

1. **Errors are NORMAL** - You'll see hundreds. Don't panic!

2. **Start small** - Get it working on 100 examples before trying 100,000

3. **One change at a time** - Change learning rate OR batch size, not both!

4. **Google is your friend** - 99% of errors, someone else hit first

5. **Check your data FIRST** - Most "model problems" are actually data problems

6. **Visualize everything** - Plot loss, plot predictions, plot distributions

7. **Save your work** - Nothing worse than losing a good model

8. **Ask for help** - Community is super friendly!

---

## üéâ You're Now a Debugging Ninja!

You learned:
- ‚úÖ Common errors and fixes
- ‚úÖ How to diagnose training issues
- ‚úÖ Memory management
- ‚úÖ Quick debugging techniques

**Remember:** Every expert was once a beginner who didn't give up!

You got this! üí™

---

**Next up:** Module 3 - Advanced Techniques! üöÄ

## üìö Resources

**When you're stuck:**
- [HuggingFace Forums](https://discuss.huggingface.co/) - Super active community
- [Stack Overflow](https://stackoverflow.com/questions/tagged/huggingface-transformers) - For specific errors
- [HuggingFace Docs](https://huggingface.co/docs/transformers) - Official documentation

**For debugging:**
- Google the EXACT error message
- Check GitHub Issues for the library
- Ask in our community Discord!

Don't struggle alone - we're here to help! ü§ù