# Lab 1: Introduction to Experiment Tracking with TensorBoard

Experiment tracking is essential for reproducible ML research. Without systematic tracking, you lose the ability to reproduce successful models, compare approaches, and understand what hyperparameters led to specific results.

## High-Level Workflow

![Experiment Tracking Workflow](./images/image.png)

The workflow consists of **three phases**:

| Phase | What Happens |
|-------|--------------|
| **Phase 1: Setup** | Initialize `SummaryWriter` and configure where logs are saved |
| **Phase 2: Training** | Run training loop, collect metrics each epoch, save to event files |
| **Phase 3: Analysis** | Launch TensorBoard, compare experiments, select best configuration |

---

In this notebook, we'll learn to:
- Integrate TensorBoard with PyTorch using `SummaryWriter`
- Log metrics, hyperparameters, and model graphs
- Organize and compare multiple experiments
- Use TensorBoard UI for analysis


---
## Part 1: Setup and Installation


In [3]:
# Install required packages
!pip install torch torchvision tensorboard tqdm requests -q


247.21s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


In [2]:
# Imports
import torch
from torch import nn
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from tqdm.auto import tqdm
from pathlib import Path
from datetime import datetime
import requests
import zipfile
import os

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")


Using device: cpu


  from .autonotebook import tqdm as notebook_tqdm


---
## Part 2: First Experiment ‚Äî Single Run Tracking

In this section, we'll run one complete experiment with TensorBoard logging. This covers **Phase 1 (Setup)** and **Phase 2 (Training)** from our workflow diagram.

**What we'll do:**
1. Download a food image dataset (pizza, steak, sushi)
2. Set up a pre-trained neural network (EfficientNet-B0)
3. Train it while logging metrics to TensorBoard
4. Visualize the results

### Step 1: Download and Prepare Data

We'll use a small food classification dataset. The images are organized in folders by class name, which PyTorch's `ImageFolder` can automatically load.

![Download Dataset](./images/image7.png)


In [4]:
# Download dataset
data_path = Path("data")
image_path = data_path / "pizza_steak_sushi"

if not image_path.exists():
    print("Downloading dataset...")
    image_path.mkdir(parents=True, exist_ok=True)
    with open(data_path / "pizza_steak_sushi.zip", "wb") as f:
        response = requests.get("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip")
        f.write(response.content)
    with zipfile.ZipFile(data_path / "pizza_steak_sushi.zip", "r") as zip_ref:
        zip_ref.extractall(image_path)
    print("Data downloaded!")
else:
    print("Data already exists.")

# Setup paths
train_dir = image_path / "train"
test_dir = image_path / "test"

print(f"Train directory: {train_dir}")
print(f"Test directory: {test_dir}")


Downloading dataset...
Data downloaded!
Train directory: data/pizza_steak_sushi/train
Test directory: data/pizza_steak_sushi/test


In [5]:
# Get transforms from pretrained model
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
auto_transforms = weights.transforms()

# Create datasets
train_data = datasets.ImageFolder(train_dir, transform=auto_transforms)
test_data = datasets.ImageFolder(test_dir, transform=auto_transforms)

print(f"Classes: {train_data.classes}")
print(f"Train samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")


Classes: ['pizza', 'steak', 'sushi']
Train samples: 225
Test samples: 75


### Step 2: Define Helper Functions

Before training, we need two helper functions:
- **`train_step()`** ‚Äî Runs one epoch of training (forward pass ‚Üí loss ‚Üí backward pass ‚Üí update weights)
- **`test_step()`** ‚Äî Evaluates the model on test data (no weight updates)

These functions will be called repeatedly during training, and we'll log their outputs to TensorBoard.


In [6]:
def train_step(model, dataloader, loss_fn, optimizer, device):
    """Single training epoch."""
    model.train()
    train_loss, train_acc = 0, 0
    
    for X, y in dataloader:
        X, y = X.to(device), y.to(device)
        y_pred = model(X)
        loss = loss_fn(y_pred, y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
        train_acc += (y_pred.argmax(1) == y).sum().item() / len(y)
    
    return train_loss / len(dataloader), train_acc / len(dataloader)


def test_step(model, dataloader, loss_fn, device):
    """Single evaluation epoch."""
    model.eval()
    test_loss, test_acc = 0, 0
    
    with torch.inference_mode():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            y_pred = model(X)
            loss = loss_fn(y_pred, y)
            
            test_loss += loss.item()
            test_acc += (y_pred.argmax(1) == y).sum().item() / len(y)
    
    return test_loss / len(dataloader), test_acc / len(dataloader)

print("Helper functions defined.")


Helper functions defined.


### Step 3: Initialize SummaryWriter

This is **Phase 1 (Setup)** from our workflow. The `SummaryWriter` is TensorBoard's logging class ‚Äî it creates event files that TensorBoard reads to display your metrics.

**What happens when you create a writer:**
- A `runs/` directory is created (if it doesn't exist)
- A timestamped subdirectory is created for this experiment
- All logged data will be saved there as event files


In [7]:
# Create writer ‚Äî logs will be saved to runs/ directory
writer = SummaryWriter()
print(f"TensorBoard logs will be saved to: {writer.log_dir}")


TensorBoard logs will be saved to: runs/Jan12_10-04-28_509b23b82278476a


### Step 4: Define Training Function with TensorBoard Logging

This is where **Phase 2 (Training)** happens. The training loop calls our helper functions and logs metrics to TensorBoard at each epoch.

**Key TensorBoard methods we'll use:**

| Method | Purpose | When to Use |
|--------|---------|-------------|
| `add_scalars(tag, dict, step)` | Log multiple values on one chart | Train vs test metrics |
| `add_graph(model, input)` | Log model architecture | Once after training |
| `close()` | Flush data to disk | Always at the end |

**Why `add_scalars` instead of `add_scalar`?** Using `add_scalars` with a dictionary plots both training and test metrics on the same chart, making it easy to spot overfitting.


In [8]:
def train(model, train_dataloader, test_dataloader, optimizer, loss_fn, epochs, device):
    """Training loop with TensorBoard logging."""
    
    results = {"train_loss": [], "train_acc": [], "test_loss": [], "test_acc": []}

    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model, train_dataloader, loss_fn, optimizer, device)
        test_loss, test_acc = test_step(model, test_dataloader, loss_fn, device)

        print(f"Epoch: {epoch+1} | "
              f"train_loss: {train_loss:.4f} | train_acc: {train_acc:.4f} | "
              f"test_loss: {test_loss:.4f} | test_acc: {test_acc:.4f}")

        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        # === TensorBoard Logging ===
        writer.add_scalars("Loss", {"train": train_loss, "test": test_loss}, epoch)
        writer.add_scalars("Accuracy", {"train": train_acc, "test": test_acc}, epoch)

    # Log model graph
    writer.add_graph(model, torch.randn(32, 3, 224, 224).to(device))
    writer.close()
    
    return results

print("Training function defined.")


Training function defined.


### Step 5: Create Model and DataLoaders

We'll use **transfer learning** with EfficientNet-B0 ‚Äî a state-of-the-art image classifier pre-trained on ImageNet (millions of images).

**Why transfer learning?**
- Training from scratch requires millions of images and days of computation
- Pre-trained models already understand general image features (edges, textures, shapes)
- We only need to train the final layer to recognize our specific classes
- Much faster and better results with small datasets!


In [9]:
# Create model (using pretrained EfficientNet-B0)
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
model = torchvision.models.efficientnet_b0(weights=weights)

# Freeze base layers, modify classifier
for param in model.features.parameters():
    param.requires_grad = False

model.classifier = nn.Sequential(
    nn.Dropout(p=0.2),
    nn.Linear(1280, len(train_data.classes))
)
model = model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")


Downloading: "https://download.pytorch.org/models/efficientnet_b0_rwightman-7f5810bc.pth" to /home/poridhian/.cache/torch/hub/checkpoints/efficientnet_b0_rwightman-7f5810bc.pth


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20.5M/20.5M [00:00<00:00, 73.1MB/s]


Total parameters: 4,011,391
Trainable parameters: 3,843


In [10]:
# Create DataLoaders
BATCH_SIZE = 32

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

print(f"Train batches: {len(train_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")


Train batches: 8
Test batches: 3


### Step 6: Run the Experiment

![Training Output](./images/image6.png)


In [11]:
# Setup optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Train
NUM_EPOCHS = 5

results = train(
    model=model,
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    epochs=NUM_EPOCHS,
    device=device
)

print(f"\nFinal test accuracy: {results['test_acc'][-1]:.4f}")


 20%|‚ñà‚ñà        | 1/5 [00:25<01:41, 25.32s/it]

Epoch: 1 | train_loss: 1.0434 | train_acc: 0.5156 | test_loss: 0.9459 | test_acc: 0.5388


 40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:48<01:12, 24.12s/it]

Epoch: 2 | train_loss: 0.9208 | train_acc: 0.5859 | test_loss: 0.7898 | test_acc: 0.6913


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3/5 [01:14<00:49, 24.88s/it]

Epoch: 3 | train_loss: 0.7465 | train_acc: 0.7578 | test_loss: 0.7067 | test_acc: 0.8248


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [01:38<00:24, 24.52s/it]

Epoch: 4 | train_loss: 0.6671 | train_acc: 0.7773 | test_loss: 0.6605 | test_acc: 0.8040


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [02:03<00:00, 24.75s/it]

Epoch: 5 | train_loss: 0.6133 | train_acc: 0.8008 | test_loss: 0.6125 | test_acc: 0.7936






Final test accuracy: 0.7936


### Step 7: View Results in TensorBoard


In [None]:
# Launch TensorBoard
%load_ext tensorboard
%tensorboard --logdir runs


**What you'll see in TensorBoard:**
- **Scalars tab**: Loss and accuracy curves over epochs (look for decreasing loss, increasing accuracy)
- **Graphs tab**: Interactive model architecture visualization

Access the dashboard at `http://<PUBLIC_IP>:6006` (get your IP by running `curl -s ifconfig.me` in terminal)

![TensorBoard UI](./images/image4.png)

**Scalars Tab:**

![Scalars](./images/image8.png)

**Graphs Tab:**

![Graphs](./images/image9.png)

---
## Part 3: Second Experiment ‚Äî Multiple Hyperparameter Configurations

Now we enter **Phase 3 (Analysis)** ‚Äî but first, we need more experiments to compare!

Training a single model is rarely enough. In practice, you'll want to try different hyperparameters. This section shows how to:
1. Run multiple experiments systematically
2. Organize logs so they're easy to compare
3. Use TensorBoard to find the best configuration

**The experiments we'll run:**

| Experiment | Learning Rate | Batch Size |
|------------|---------------|------------|
| 1 | 0.001 | 32 |
| 2 | 0.001 | 64 |
| 3 | 0.01 | 32 |
| 4 | 0.01 | 64 |

### Step 1: Create Custom Writer Function

The default `SummaryWriter()` puts all experiments in one folder. We need a helper function to organize experiments into a structured directory.


In [13]:
def create_writer(experiment_name: str, model_name: str, extra: str = None):
    """Creates SummaryWriter with organized directory structure."""
    timestamp = datetime.now().strftime("%Y-%m-%d")
    
    log_dir = os.path.join("runs", model_name, timestamp, experiment_name)
    if extra:
        log_dir = os.path.join(log_dir, extra)
    
    print(f"[INFO] Logging to: {log_dir}")
    return SummaryWriter(log_dir=log_dir)

print("Custom writer function defined.")


Custom writer function defined.


### Step 2: Create Model Factory Function

Each experiment needs a **fresh model** to ensure fair comparison. If we reused a trained model, subsequent experiments would start with learned weights.


In [14]:
def create_model(num_classes=3):
    """Creates a fresh EfficientNet-B0 model for each experiment."""
    weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
    model = torchvision.models.efficientnet_b0(weights=weights)
    
    for param in model.features.parameters():
        param.requires_grad = False
    
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.2),
        nn.Linear(1280, num_classes)
    )
    return model

print("Model factory function defined.")


Model factory function defined.


### Step 3: Modified Training Function

This version accepts a writer as parameter instead of using a global one.


In [15]:
def train_with_writer(model, train_dataloader, test_dataloader, optimizer, loss_fn, epochs, device, writer):
    """Training loop with TensorBoard logging using provided writer."""
    
    results = {"train_loss": [], "train_acc": [], "test_loss": [], "test_acc": []}

    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model, train_dataloader, loss_fn, optimizer, device)
        test_loss, test_acc = test_step(model, test_dataloader, loss_fn, device)

        print(f"Epoch: {epoch+1} | "
              f"train_loss: {train_loss:.4f} | train_acc: {train_acc:.4f} | "
              f"test_loss: {test_loss:.4f} | test_acc: {test_acc:.4f}")

        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        # TensorBoard Logging
        writer.add_scalars("Loss", {"train": train_loss, "test": test_loss}, epoch)
        writer.add_scalars("Accuracy", {"train": train_acc, "test": test_acc}, epoch)

    # Log model graph
    writer.add_graph(model, torch.randn(32, 3, 224, 224).to(device))
    
    return results

print("Modified training function defined.")


Modified training function defined.


### Step 4: Define Hyperparameter Combinations


In [16]:
# Hyperparameter combinations to test
learning_rates = [0.001, 0.01]
batch_sizes = [32, 64]

total_experiments = len(learning_rates) * len(batch_sizes)
print(f"Total experiments to run: {total_experiments}")


Total experiments to run: 4


### Step 5: Run All Experiments

**Expected output for each experiment:**

![Experiment 1](./images/exp-1.png)
![Experiment 2](./images/exp-2.png)
![Experiment 3](./images/exp-3.png)
![Experiment 4](./images/exp-4.png)


In [17]:
# Store all results
all_results = {}
experiment_num = 0

for lr in learning_rates:
    for bs in batch_sizes:
        experiment_num += 1
        print(f"\n{'='*60}")
        print(f"Experiment {experiment_num}/{total_experiments}: LR={lr}, Batch={bs}")
        print(f"{'='*60}")
        
        # Create experiment-specific writer
        experiment_name = f"lr{lr}_bs{bs}"
        writer = create_writer(experiment_name, "EfficientNetB0")
        
        # Create DataLoaders with current batch size
        train_loader = DataLoader(train_data, batch_size=bs, shuffle=True)
        test_loader = DataLoader(test_data, batch_size=bs, shuffle=False)
        
        # Fresh model for each experiment
        model = create_model(num_classes=len(train_data.classes)).to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
        loss_fn = nn.CrossEntropyLoss()
        
        # Train with logging
        results = train_with_writer(
            model=model,
            train_dataloader=train_loader,
            test_dataloader=test_loader,
            optimizer=optimizer,
            loss_fn=loss_fn,
            epochs=5,
            device=device,
            writer=writer
        )
        
        # Store results
        all_results[experiment_name] = results
        
        # Close writer before next experiment
        writer.close()
        
        print(f"Final test accuracy: {results['test_acc'][-1]:.4f}")

print(f"\n{'='*60}")
print("All experiments completed!")
print(f"{'='*60}")



Experiment 1/4: LR=0.001, Batch=32
[INFO] Logging to: runs/EfficientNetB0/2026-01-12/lr0.001_bs32


 20%|‚ñà‚ñà        | 1/5 [00:23<01:35, 23.76s/it]

Epoch: 1 | train_loss: 1.0285 | train_acc: 0.4453 | test_loss: 0.8359 | test_acc: 0.7936


 40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:46<01:10, 23.42s/it]

Epoch: 2 | train_loss: 0.8661 | train_acc: 0.6094 | test_loss: 0.6937 | test_acc: 0.8456


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3/5 [01:09<00:45, 22.93s/it]

Epoch: 3 | train_loss: 0.7452 | train_acc: 0.7383 | test_loss: 0.7060 | test_acc: 0.7746


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [01:34<00:23, 23.79s/it]

Epoch: 4 | train_loss: 0.6953 | train_acc: 0.7578 | test_loss: 0.6170 | test_acc: 0.8258


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [01:57<00:00, 23.49s/it]

Epoch: 5 | train_loss: 0.6021 | train_acc: 0.9062 | test_loss: 0.6263 | test_acc: 0.7538





Final test accuracy: 0.7538

Experiment 2/4: LR=0.001, Batch=64
[INFO] Logging to: runs/EfficientNetB0/2026-01-12/lr0.001_bs64


 20%|‚ñà‚ñà        | 1/5 [00:27<01:50, 27.54s/it]

Epoch: 1 | train_loss: 1.0623 | train_acc: 0.4061 | test_loss: 0.9568 | test_acc: 0.6229


 40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:55<01:24, 28.01s/it]

Epoch: 2 | train_loss: 0.9078 | train_acc: 0.6657 | test_loss: 0.8299 | test_acc: 0.7244


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3/5 [01:23<00:55, 27.81s/it]

Epoch: 3 | train_loss: 0.7972 | train_acc: 0.8146 | test_loss: 0.7180 | test_acc: 0.8466


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [01:49<00:27, 27.24s/it]

Epoch: 4 | train_loss: 0.7028 | train_acc: 0.8224 | test_loss: 0.6395 | test_acc: 0.9233


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [02:17<00:00, 27.46s/it]

Epoch: 5 | train_loss: 0.6081 | train_acc: 0.8794 | test_loss: 0.5760 | test_acc: 0.9155





Final test accuracy: 0.9155

Experiment 3/4: LR=0.01, Batch=32
[INFO] Logging to: runs/EfficientNetB0/2026-01-12/lr0.01_bs32


 20%|‚ñà‚ñà        | 1/5 [00:23<01:35, 23.78s/it]

Epoch: 1 | train_loss: 1.1156 | train_acc: 0.4844 | test_loss: 0.4820 | test_acc: 0.7850


 40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:47<01:10, 23.48s/it]

Epoch: 2 | train_loss: 0.4143 | train_acc: 0.8477 | test_loss: 0.3036 | test_acc: 0.8759


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3/5 [01:10<00:47, 23.60s/it]

Epoch: 3 | train_loss: 0.4251 | train_acc: 0.7891 | test_loss: 0.4415 | test_acc: 0.8049


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [01:32<00:23, 23.04s/it]

Epoch: 4 | train_loss: 0.5830 | train_acc: 0.7617 | test_loss: 0.3939 | test_acc: 0.8352


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [01:56<00:00, 23.22s/it]

Epoch: 5 | train_loss: 0.4588 | train_acc: 0.7500 | test_loss: 0.4727 | test_acc: 0.8144





Final test accuracy: 0.8144

Experiment 4/4: LR=0.01, Batch=64
[INFO] Logging to: runs/EfficientNetB0/2026-01-12/lr0.01_bs64


 20%|‚ñà‚ñà        | 1/5 [00:28<01:55, 28.85s/it]

Epoch: 1 | train_loss: 0.8854 | train_acc: 0.5756 | test_loss: 0.5529 | test_acc: 0.7166


 40%|‚ñà‚ñà‚ñà‚ñà      | 2/5 [00:57<01:25, 28.47s/it]

Epoch: 2 | train_loss: 0.4177 | train_acc: 0.8562 | test_loss: 0.3174 | test_acc: 0.9077


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3/5 [01:25<00:56, 28.31s/it]

Epoch: 3 | train_loss: 0.2065 | train_acc: 0.9111 | test_loss: 0.3415 | test_acc: 0.9311


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [01:52<00:28, 28.12s/it]

Epoch: 4 | train_loss: 0.1797 | train_acc: 0.9180 | test_loss: 0.3506 | test_acc: 0.9077


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [02:21<00:00, 28.28s/it]

Epoch: 5 | train_loss: 0.0971 | train_acc: 0.9846 | test_loss: 0.3413 | test_acc: 0.8388





Final test accuracy: 0.8388

All experiments completed!


### Step 6: Compare Results


In [18]:
# Print comparison table
print("\n" + "="*60)
print("EXPERIMENT COMPARISON")
print("="*60)
print(f"{'Experiment':<20} {'Final Test Acc':<15} {'Best Test Acc':<15}")
print("-"*60)

best_exp = None
best_acc = 0

for exp_name, results in all_results.items():
    final_acc = results['test_acc'][-1]
    best_acc_exp = max(results['test_acc'])
    print(f"{exp_name:<20} {final_acc:<15.4f} {best_acc_exp:<15.4f}")
    
    if final_acc > best_acc:
        best_acc = final_acc
        best_exp = exp_name

print("="*60)
print(f"\nüèÜ Best experiment: {best_exp} with {best_acc:.4f} accuracy")



EXPERIMENT COMPARISON
Experiment           Final Test Acc  Best Test Acc  
------------------------------------------------------------
lr0.001_bs32         0.7538          0.8456         
lr0.001_bs64         0.9155          0.9233         
lr0.01_bs32          0.8144          0.8759         
lr0.01_bs64          0.8388          0.9311         

üèÜ Best experiment: lr0.001_bs64 with 0.9155 accuracy


### Step 7: View All Experiments in TensorBoard


In [None]:
# Reload and launch TensorBoard
%reload_ext tensorboard
%tensorboard --logdir runs


![Final Comparison](./images/final.png)

**TensorBoard comparison features:**

| Feature | How to Use | Purpose |
|---------|------------|---------|
| **Run Selector** | Left sidebar checkboxes | Show/hide specific experiments |
| **Regex Filter** | Type `lr0.001` in search | Filter to matching experiments |
| **Smoothing Slider** | Adjust to reduce noise | See cleaner trends |

**What to look for in comparisons:**
- Which learning rate converges faster?
- Which batch size gives better final accuracy?
- Are any configurations unstable (high variance)?

---
## Part 4: Summary

### Key TensorBoard Methods

| Method | Purpose | Example |
|--------|---------|---------|
| `add_scalar(tag, value, step)` | Log single metric | Learning rate |
| `add_scalars(tag, dict, step)` | Log multiple metrics on one chart | Train vs test loss |
| `add_graph(model, input)` | Log model architecture | Network visualization |
| `add_histogram(tag, values, step)` | Log weight distributions | Debugging gradients |
| `add_image(tag, img, step)` | Log images | Sample predictions |
| `close()` | Flush and close writer | End of training |

### Best Practices

1. **Naming**: Use descriptive experiment names with hyperparameters (`lr0.001_bs32`)
2. **Organization**: Structure directories by model/date/experiment
3. **Consistency**: Log the same metrics across all experiments for comparison
4. **Close writers**: Always call `writer.close()` to flush data to disk

### What We Covered

Looking back at our workflow diagram:
- ‚úÖ **Phase 1 (Setup)**: Initialized `SummaryWriter`, configured log directories
- ‚úÖ **Phase 2 (Training)**: Logged metrics during training, saved event files
- ‚úÖ **Phase 3 (Analysis)**: Launched TensorBoard, compared experiments, identified best config

---
## Next Steps

In **Lab 2**, we'll explore data scaling ‚Äî training the same model on different dataset sizes to understand the data-performance relationship.
