# Lab 1: Data Setup & Model Builder Scripts

In this lab, you'll learn how to **turn notebook code into reusable Python scripts** - a key skill for moving from experimentation to production.

## Learning Objectives

By the end of this lab, you will be able to:

- Understand the benefits of modular code organization
- Create `data_setup.py` for DataLoader creation
- Create `model_builder.py` with the TinyVGG architecture
- Import and use modular scripts in notebooks

```

## Why Go Modular?
| Notebooks | Python Scripts |
|-----------|----------------|
| **Quick experimentation** - Test ideas rapidly | **Reusable code** - Write once, import anywhere (no copy-pasting!) |
| **Visualization** - See plots inline | **Version control** - Track changes with git |
| **Sharing ideas** - Show results easily | **Cloud & servers** - Run training jobs remotely |
| **Learning & prototyping** - Ideal for iteration | **Production deployments** - Integrate into apps & pipelines |

**Key insight:** Notebooks are great for exploration, but as your code matures, scripts make it easier to maintain, test, and scale your ML workflows.

**The common pattern:** Start in notebooks, move to scripts when you have working code.

## Step 0: Install Dependencies

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install matplotlib requests

Looking in indexes: https://download.pytorch.org/whl/cpu


## Step 1: Import Libraries

In [2]:
import os
import requests
import zipfile
from pathlib import Path

import torch
from torch import nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 2.9.0+cpu


## Step 2: Download the Dataset

We'll use the **pizza_steak_sushi** dataset - a small image classification dataset with 3 classes.

**Dataset Structure (ImageFolder format):**
```
data/
└── pizza_steak_sushi/
    ├── train/
    │   ├── pizza/
    │   ├── steak/
    │   └── sushi/
    └── test/
        ├── pizza/
        ├── steak/
        └── sushi/
```

In [3]:
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi"

if image_path.is_dir():
    print(f"{image_path} directory exists.")
else:
    print(f"Did not find {image_path} directory, creating one...")
    image_path.mkdir(parents=True, exist_ok=True)
    
    with open(data_path / "pizza_steak_sushi.zip", "wb") as f:
        request = requests.get("https://raw.githubusercontent.com/poridhioss/Introduction-to-Deep-Learning-with-Pytorch-Resources/main/Going-module/pizza_steak_sushi.zip")
        print("Downloading pizza, steak, sushi data...")
        f.write(request.content)

    with zipfile.ZipFile(data_path / "pizza_steak_sushi.zip", "r") as zip_ref:
        print("Unzipping pizza, steak, sushi data...")
        zip_ref.extractall(image_path)

    os.remove(data_path / "pizza_steak_sushi.zip")
    print("Download complete!")

Did not find data/pizza_steak_sushi directory, creating one...
Downloading pizza, steak, sushi data...
Unzipping pizza, steak, sushi data...
Download complete!


In [4]:
train_dir = image_path / "train"
test_dir = image_path / "test"

print(f"Train directory: {train_dir}")
print(f"Test directory: {test_dir}")

Train directory: data/pizza_steak_sushi/train
Test directory: data/pizza_steak_sushi/test


## Step 3: Create DataLoaders (Traditional Way)

Before creating our modular script, let's see the standard notebook approach:

1. Define transforms
2. Create datasets using `ImageFolder`
3. Wrap datasets in `DataLoader`

In [5]:
data_transform = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.ToTensor()
])

train_data = datasets.ImageFolder(train_dir, transform=data_transform)
test_data = datasets.ImageFolder(test_dir, transform=data_transform)

print(f"Train dataset size: {len(train_data)}")
print(f"Test dataset size: {len(test_data)}")
print(f"Classes: {train_data.classes}")

Train dataset size: 225
Test dataset size: 75
Classes: ['pizza', 'steak', 'sushi']


In [6]:
BATCH_SIZE = 32

train_dataloader = DataLoader(
    train_data,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2,
    pin_memory=True
)

test_dataloader = DataLoader(
    test_data,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=0,
    pin_memory=True
)

print(f"Number of training batches: {len(train_dataloader)}")
print(f"Number of test batches: {len(test_dataloader)}")

Number of training batches: 8
Number of test batches: 3


In [7]:
# Check a batch
images, labels = next(iter(train_dataloader))
print(f"Image batch shape: {images.shape}")
print(f"Label batch shape: {labels.shape}")



Image batch shape: torch.Size([32, 3, 64, 64])
Label batch shape: torch.Size([32])


## Step 4: Create `data_setup.py`

Now let's **convert this code into a reusable function** and save it as `data_setup.py`.

**Function:** `create_dataloaders()`

| Parameter | Description |
|-----------|-------------|
| `train_dir` | Path to training directory |
| `test_dir` | Path to testing directory |
| `transform` | Torchvision transforms to apply |
| `batch_size` | Samples per batch |
| `num_workers` | Workers for data loading |

**Returns:** `(train_dataloader, test_dataloader, class_names)`

In [8]:
# Create going_modular directory
going_modular_path = Path("going_modular")
going_modular_path.mkdir(parents=True, exist_ok=True)
print(f"Created directory: {going_modular_path}")

Created directory: going_modular


In [9]:
%%writefile going_modular/data_setup.py
import os

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

NUM_WORKERS = os.cpu_count()

def create_dataloaders(
    train_dir: str, 
    test_dir: str, 
    transform: transforms.Compose, 
    batch_size: int, 
    num_workers: int = NUM_WORKERS
):
    train_data = datasets.ImageFolder(train_dir, transform=transform)
    test_data = datasets.ImageFolder(test_dir, transform=transform)

    class_names = train_data.classes

    train_dataloader = DataLoader(
        train_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )
    test_dataloader = DataLoader(
        test_data,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        pin_memory=True,
    )

    return train_dataloader, test_dataloader, class_names

Writing going_modular/data_setup.py


### Test `data_setup.py`

Let's verify our module works by importing and using it:

In [10]:
from going_modular import data_setup

train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=data_transform,
    batch_size=32,
    num_workers=2
)

print(f"Class names: {class_names}")
print(f"Train batches: {len(train_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")

Class names: ['pizza', 'steak', 'sushi']
Train batches: 8
Test batches: 3


## Step 5: Build TinyVGG Model (Traditional Way)

**TinyVGG** is a simplified VGG architecture:

| Component | Description |
|-----------|-------------|
| Conv Block 1 | 2x Conv2d + ReLU + MaxPool |
| Conv Block 2 | 2x Conv2d + ReLU + MaxPool |
| Classifier | Flatten + Linear |

This architecture is small enough to train quickly but powerful enough for real classification tasks.

In [11]:
class TinyVGG(nn.Module):
    def __init__(self, input_shape: int, hidden_units: int, output_shape: int) -> None:
        super().__init__()
        
        self.conv_block_1 = nn.Sequential(
            nn.Conv2d(in_channels=input_shape, 
                      out_channels=hidden_units, 
                      kernel_size=3, 
                      stride=1, 
                      padding=0),  
            nn.ReLU(),
            nn.Conv2d(in_channels=hidden_units, 
                      out_channels=hidden_units,
                      kernel_size=3,
                      stride=1,
                      padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.conv_block_2 = nn.Sequential(
            nn.Conv2d(hidden_units, hidden_units, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.Conv2d(hidden_units, hidden_units, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            # Note: The in_features depends on the input image size
            # For 64x64 images: 13*13*hidden_units
            nn.Linear(in_features=hidden_units*13*13, out_features=output_shape)
        )
    
    def forward(self, x: torch.Tensor):
        x = self.conv_block_1(x)
        x = self.conv_block_2(x)
        x = self.classifier(x)
        return x

In [12]:
torch.manual_seed(42)
model = TinyVGG(
    input_shape=3,  # RGB images
    hidden_units=10,
    output_shape=len(class_names)  # 3 classes
)

print(model)

TinyVGG(
  (conv_block_1): Sequential(
    (0): Conv2d(3, 10, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
    (2): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv_block_2): Sequential(
    (0): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
    (2): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=1690, out_features=3, bias=True)
  )
)


In [13]:
# Test with a sample batch
images, labels = next(iter(train_dataloader))
print(f"Input shape: {images.shape}")

# Forward pass
output = model(images)
print(f"Output shape: {output.shape}")

Input shape: torch.Size([32, 3, 64, 64])
Output shape: torch.Size([32, 3])


## Step 6: Create `model_builder.py`

Save our TinyVGG model to a reusable Python script:

In [14]:
%%writefile going_modular/model_builder.py

import torch
from torch import nn

class TinyVGG(nn.Module):
    def __init__(self, input_shape: int, hidden_units: int, output_shape: int) -> None:
        super().__init__()
        self.conv_block_1 = nn.Sequential(
            nn.Conv2d(in_channels=input_shape, 
                      out_channels=hidden_units, 
                      kernel_size=3, 
                      stride=1, 
                      padding=0),  
            nn.ReLU(),
            nn.Conv2d(in_channels=hidden_units, 
                      out_channels=hidden_units,
                      kernel_size=3,
                      stride=1,
                      padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_block_2 = nn.Sequential(
            nn.Conv2d(hidden_units, hidden_units, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.Conv2d(hidden_units, hidden_units, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            # Where did this in_features shape come from? 
            # It's because each layer of our network compresses and changes the shape of our input data.
            nn.Linear(in_features=hidden_units*13*13, out_features=output_shape)
        )
    
    def forward(self, x: torch.Tensor):
        x = self.conv_block_1(x)
        x = self.conv_block_2(x)
        x = self.classifier(x)
        return x
        # Alternatively: return self.classifier(self.conv_block_2(self.conv_block_1(x)))

Writing going_modular/model_builder.py


### Test `model_builder.py`

Import and verify the model works:

In [15]:
from going_modular import model_builder

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

torch.manual_seed(42)
model = model_builder.TinyVGG(
    input_shape=3,
    hidden_units=10,
    output_shape=len(class_names)
).to(device)

print(model)

Using device: cpu
TinyVGG(
  (conv_block_1): Sequential(
    (0): Conv2d(3, 10, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
    (2): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv_block_2): Sequential(
    (0): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
    (2): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=1690, out_features=3, bias=True)
  )
)


In [16]:
images, labels = next(iter(train_dataloader))
images = images.to(device)

with torch.inference_mode():
    output = model(images)
    
print(f"Input shape: {images.shape}")
print(f"Output shape: {output.shape}")
print(f"Output (first 3 samples):\n{output[:3]}")

Input shape: torch.Size([32, 3, 64, 64])
Output shape: torch.Size([32, 3])
Output (first 3 samples):
tensor([[ 0.0208, -0.0020,  0.0095],
        [ 0.0184,  0.0025,  0.0067],
        [ 0.0177,  0.0010,  0.0095]])


## Step 7: Verify Directory Structure

Confirm both scripts were created successfully:

In [17]:
import os

print("Files in going_modular/:")
for file in os.listdir("going_modular"):
    print(f"  - {file}")

Files in going_modular/:
  - data_setup.py
  - __pycache__
  - model_builder.py


## Summary

In this lab, you learned how to go modular with PyTorch:

| Step | What You Created | Purpose |
|------|------------------|---------|
| 1 | Downloaded data | pizza_steak_sushi dataset |
| 2 | `data_setup.py` | `create_dataloaders()` function |
| 3 | `model_builder.py` | `TinyVGG` class |

**Final Directory Structure:**
```
going_modular/
├── data_setup.py      # DataLoader creation
└── model_builder.py   # TinyVGG model
```

**Next:** In Lab 2, you'll create `engine.py` with training and testing functions!

## Exercises

1. **Modify `data_setup.py`** to accept separate transforms for training and testing data (training might use data augmentation while testing should not).

2. **Add a new model** to `model_builder.py` called `TinyVGG_v2` that uses `BatchNorm2d` after each convolutional layer.

3. **Create a `get_data.py` script** that downloads the data if it doesn't exist. This could be imported in `train.py` later.