# 📚 Custom Dataset Building in PyTorch

## 🎯 What You'll Learn
This notebook teaches you how to:
1. **Build Custom Datasets** for images with labels (Cats & Dogs)
2. **Handle Text Datasets** for image captioning (Flickr8k)
3. **Create DataLoaders** for efficient batch processing
4. **Build Vocabularies** for text processing

---

## 🤔 Why Do We Need Custom Datasets?

**The Restaurant Analogy:**
- **PyTorch's Built-in Datasets** = Fast food menu (limited options: MNIST, CIFAR10)
- **Custom Datasets** = Your own restaurant menu (any dish you want!)

**Real Problem:**
Most real-world projects need custom data (medical images, company photos, specific text), not just standard datasets.

---

## 📋 Two Examples Covered

### Example 1: Cats & Dogs Classification 🐱🐶
**Goal:** Given an image → predict "cat" or "dog"

**Files needed:**
- `cats_dogs/` folder with images (`cat_001.jpg`, `dog_001.jpg`, etc.)
- `cats_dogs.csv` file with labels (filename, class)

### Example 2: Image Captioning 📷💬
**Goal:** Given an image → generate a text description

**Files needed:**
- `flickr8k_images/` folder with photos
- `captions.txt` file with image-caption pairs

Let's build both!



1. Top panel — dataset assets

* A folder named **cats_dogs_resized** → this is the directory that holds the prepared (resized) images.
* A file named **cats_dogs** (Excel workbook) → a spreadsheet used for labels/metadata (e.g., filename → class). The size shows ~394 KB.

<figure>
  <img src="asset/cat_dog_directory.png" alt="File Directory containing cat dog image and csv file with their identity" width="800">
</figure>

2. Middle panel — the images themselves

* Inside **cats_dogs_resized**, Windows Explorer displays a grid of **dog photos** (filenames like `dog_3286.jpg`, `dog_3297.jpg`, … `dog_3312.jpg`).
* These are varied pictures (different breeds/poses/backgrounds) but all **uniformly resized**, suitable for a ML dataset.
<figure>
  <img src="asset/cat_dog_image_directory.png" alt="inside of folder" width="800">
</figure>


3. Bottom panel — the label spreadsheet

* An Excel sheet with **filenames in Column A** (e.g., `dog_2511.jpg`, `dog_2512.jpg`, …).
* **Column B contains numeric class labels** (shown as `1` for these rows). Given the filenames are “dog_…”, this implies a mapping such as **1 = dog** (and likely **0 = cat** elsewhere in the sheet).
* Columns C/D are empty in the visible portion (reserved for other info if needed).

<figure>
  <img src="asset/cat_dog_csv_structure.png" alt="inside of the csv" width="400">
</figure>

In short: the image depicts a typical classification dataset setup—(1) a folder of resized images, (2) a preview of those images, and (3) a spreadsheet mapping each filename to a class label.

---

# 🐱 Part 1: Cats & Dogs Classification Dataset

## 📂 Understanding the Data Structure

In [9]:
import torch
from torch.utils.data import DataLoader, Dataset
from skimage import io
import pandas as pd
import os

---

## 📦 Step 1: Import Required Libraries

### 🍕 Simple Analogy: Kitchen Tools
Before cooking, you gather your tools:
- **Knife** = `torch` (main PyTorch library)
- **Cutting board** = `Dataset` (structure for organizing data)
- **Serving tray** = `DataLoader` (delivers batches of data)
- **Recipe book** = `pandas` (reads CSV files)
- **Camera** = `io` (loads images from disk)

### 🔧 What Each Library Does

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

```python
import torch                        # Main PyTorch library (tensors, neural networks)
from torch.utils.data import DataLoader, Dataset  # Tools for data handling
from skimage import io             # For reading images
import pandas as pd                # For reading CSV files
import os                          # For file path operations
```

---

## 🖥️ Step 2: Set Device (CPU or GPU)

### 🚗 Simple Analogy: Choosing Your Vehicle
- **GPU (cuda)** = Race car 🏎️ (fast, for heavy workloads)
- **CPU** = Regular car 🚗 (slower, but always available)

### 🔧 Technical Explanation
This line checks if you have a GPU available. If yes, use it; otherwise, use CPU.

In [6]:
class CatsandDogsDataset(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        self.annotations=pd.read_csv(csv_file)
        self.root_dir=root_dir
        self.transform=transform
        
    def __len__(self):
        return len(self.annotations)
    
    def __getitem__(self, index):
        img_path=os.path.join(self.root_dir, self.annotations.iloc[index,0]) #ith row, 0th column
        image=io.imread(img_path)
        y_label = torch.tensor(int(self.annotations.iloc[index,1])) #ith row, 1st column

        if self.transform:
            image = self.transform(image)

        return image, y_label


```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

**What happens:**
- If GPU exists → `device = "cuda"` ⚡
- If no GPU → `device = "cpu"` 🐢

---

## 🏗️ Step 3: Build the Custom Dataset Class

### 🍕 Simple Analogy: Pizza Delivery Service

Imagine you run a pizza delivery service:
1. **Menu (CSV file)**: List of orders → `["Margherita.jpg", "Pepperoni"]`
2. **Kitchen (root_dir)**: Where pizzas are stored → `pizzas/` folder
3. **Transform**: Heat up the pizza before delivery (resize, normalize images)

When customer orders pizza #5:
- Look up order #5 in menu → get filename and type
- Go to kitchen, grab that pizza
- Heat it up (transform)
- Deliver (return image + label)

### 🔧 Technical Explanation: The Dataset Class

A **Dataset** in PyTorch must have 3 methods:
1. **`__init__`**: Setup (read CSV, store paths)
2. **`__len__`**: Return total number of items
3. **`__getitem__`**: Get one item (image + label) by index

Let's break down each part:

In [None]:
import torchvision.transforms as transforms

dataset=CatsandDogsDataset(csv_file='cats_dogs.csv', root_dir='cats_dogs',transform=transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((128, 128)),
    transforms.ToTensor()
]))


```python
class CatsandDogsDataset(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        """
        Initialize the dataset
        
        Args:
            csv_file: Path to CSV with [filename, label] columns
            root_dir: Directory with all the images
            transform: Optional transforms to apply to images
        """
        self.annotations = pd.read_csv(csv_file)  # Read CSV into DataFrame
        self.root_dir = root_dir                   # Store image folder path
        self.transform = transform                 # Store transforms
        
    def __len__(self):
        """Return the total number of samples"""
        return len(self.annotations)  # Number of rows in CSV
    
    def __getitem__(self, index):
        """
        Get one sample (image + label) by index
        
        Args:
            index: Integer index (0 to len-1)
            
        Returns:
            image: Transformed image tensor
            y_label: Class label (0 or 1)
        """
        # 1. Get image path from CSV (row=index, column=0)
        img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
        
        # 2. Load image from disk
        image = io.imread(img_path)  # Returns numpy array (H, W, C)
        
        # 3. Get label from CSV (row=index, column=1)
        y_label = torch.tensor(int(self.annotations.iloc[index, 1]))
        
        # 4. Apply transforms if provided
        if self.transform:
            image = self.transform(image)
            
        return image, y_label
```

### 📊 Example Walkthrough

**CSV File (`cats_dogs.csv`):**
```
filename,label
cat_001.jpg,0
dog_001.jpg,1
cat_002.jpg,0
```

**Code Flow:**
```python
dataset[1]  # Request item at index 1
```

**Step-by-step:**
1. `index = 1`
2. `img_path = "cats_dogs/dog_001.jpg"` (row 1, column 0)
3. `image = io.imread("cats_dogs/dog_001.jpg")` → loads image as array
4. `y_label = tensor(1)` (row 1, column 1)
5. `image = transform(image)` → resize to (128, 128) and convert to tensor
6. `return (image_tensor, tensor(1))`

---

## 🎬 Step 4: Create Dataset Instance & Apply Transforms

### 🍕 Simple Analogy: Setting Up Your Order System

You're now opening your pizza shop:
1. **Menu** = `cats_dogs.csv` (what pizzas you have)
2. **Kitchen location** = `cats_dogs/` folder
3. **Heating instructions** = transforms (resize, convert to tensor)

### 🔧 What Each Transform Does

In [None]:
train_set, test_set = torch.utils.data.random_split(dataset, [800, 200])

train_loader = DataLoader(dataset=train_set, batch_size=32, shuffle=True)
test_loader = DataLoader(dataset=test_set, batch_size=32, shuffle=True)

```python
import torchvision.transforms as transforms

dataset = CatsandDogsDataset(
    csv_file='cats_dogs.csv',      # Path to labels
    root_dir='cats_dogs',          # Path to images folder
    transform=transforms.Compose([  # Chain of transformations
        transforms.ToPILImage(),    # 1. Convert numpy array to PIL Image
        transforms.Resize((128, 128)),  # 2. Resize to 128x128 pixels
        transforms.ToTensor()       # 3. Convert to tensor & normalize [0,1]
    ])
)
```

### 📊 Transform Pipeline Explanation

**Input:** Numpy array from `io.imread()` → shape `(H, W, 3)`, values `[0-255]`

**Step 1: `ToPILImage()`**
- Converts numpy array to PIL Image
- Why? Because `Resize()` works best with PIL Images

**Step 2: `Resize((128, 128))`**
- Resizes image to 128×128 pixels
- Why? Neural networks need fixed-size inputs

**Step 3: `ToTensor()`**
- Converts PIL Image to PyTorch tensor
- Changes shape from `(H, W, C)` to `(C, H, W)`
- Normalizes values from `[0, 255]` to `[0.0, 1.0]`

**Output:** Tensor of shape `(3, 128, 128)` with values in `[0.0, 1.0]`

---

## 🔀 Step 5: Split into Train/Test Sets

### 🍕 Simple Analogy: Practice vs Competition

You have 1000 pizzas:
- **800 pizzas** = Practice cooking (training set) 👨‍🍳
- **200 pizzas** = Competition cooking (test set) 🏆

You practice on 800, then test your skills on unseen 200 pizzas.

### 🔧 Technical Explanation

This image shows the structure of the Flickr8k Image Captioning dataset, where each photo (in /images) is linked to multiple natural-language descriptions (in /captions). The task is to build a model that can look at a new image and describe it in words.


###  Folder structure
<figure>
  <img src="asset/image_caption_directory.png" alt="File Directory containing  image and text file with their caption" width="800">
</figure>

* **`images`** → A folder containing all the image files.
* **`captions`** → A text file (`.txt`) containing all the image-caption pairs.
  This setup is part of the **Flickr8k dataset**, which is widely used for *image captioning* tasks in deep learning.


---

###  Folder content preview
<figure>
  <img src="asset/image_caption_image_directory.png" alt="inside of folder" width="800">
</figure>
* The path is `Desktop > customdata > flickr8k > images`.
* It shows thumbnails of several **JPEG images**, each named with an ID like `667626_18933d13e`, `10815824_2997e03d76`, etc.
* The images show various human and animal activities (children playing, dogs, people near water, etc.).
  These are the **input images** used for training and evaluation in the captioning model.



---

###  Captions file opened in Notepad
<figure>
  <img src="asset/image_caption_caption_file.png" alt="inside of the txt" width="800">
</figure>

* The text file has **two columns**:

  * **Column 1:** Image filename (e.g., `1000268201_693b08cb0e.jpg`)
  * **Column 2:** The **caption** (natural language description).
* Example lines:

  * `1000268201_693b08cb0e.jpg, A child in a pink dress is climbing up a set of stairs in an entry way.`
  * `1000268201_693b08cb0e.jpg, A girl going into a wooden building.`
  * `1000268201_693b08cb0e.jpg, A little girl climbing into a wooden playhouse.`
* Note that the same image appears multiple times, each with a **different caption** → this is a common feature in Flickr8k: *five human-written captions per image*.


**In short:**
This image shows the structure of the **Flickr8k Image Captioning dataset**, where each photo (in `/images`) is linked to multiple natural-language descriptions (in `/captions`). The task is to build a model that can look at a new image and describe it in words.


| Component       | Description                               | Purpose                              |
| --------------- | ----------------------------------------- | ------------------------------------ |
| `images` folder | Set of photos (8,000 total)               | Visual input                         |
| `captions.txt`  | Image filename + 5 human captions         | Ground-truth text labels             |
| Combined usage  | Each image paired with multiple sentences | For training Image Captioning models |


```python
train_set, test_set = torch.utils.data.random_split(dataset, [800, 200])

train_loader = DataLoader(dataset=train_set, batch_size=32, shuffle=True)
test_loader = DataLoader(dataset=test_set, batch_size=32, shuffle=True)
```

### 📊 What Happens Here

**1. `random_split(dataset, [800, 200])`**
- Randomly splits 1000 images into 800 and 200
- Returns two subset objects

**2. `DataLoader` for Training:**
- `batch_size=32`: Delivers 32 images at a time (not one by one)
- `shuffle=True`: Randomizes order each epoch (prevents memorization)

**3. `DataLoader` for Testing:**
- Same batch size
- Shuffle to avoid bias

### 🎯 Why Batching?

**Without batching (batch_size=1):**
```python
for image, label in train_loader:  # One image at a time
    # image.shape = (1, 3, 128, 128)
    # Train on 1 image → update weights → repeat 800 times
    # Super slow! 🐢
```

**With batching (batch_size=32):**
```python
for images, labels in train_loader:  # 32 images at once
    # images.shape = (32, 3, 128, 128)
    # Train on 32 images → update weights → repeat 25 times
    # Much faster! ⚡
```

---

# 📷 Part 2: Image Captioning Dataset

## 🤔 What's Different Here?

**Cats & Dogs:**
- Input: Image → Output: Single number (0 or 1)

**Image Captioning:**
- Input: Image → Output: **Sentence** ("A dog playing in the park")

**New Challenge:** How do we handle text?
- Need to convert words to numbers (vocabulary)
- Need to handle variable-length captions
- Need to pad sequences to same length

Let's build it step by step!

In [None]:
import spacy
spacy_eng = spacy.load("en_core_web_sm")

class Vocabulary:
    def __init__(self, freq_threshold):
        self.freq_threshold = freq_threshold
        self.itos = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"} #iots: index to string
        self.stoi = {v: k for k, v in self.itos.items()} #stoi: string to index
        
    def __len__(self):
        return len(self.itos)
    
    @staticmethod
    def tokenizer(text):
        return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]
    #        "I love dogs" -> ['i', 'love', 'dogs']
    
    def build_vocabulary(self, sentence_list):
        frequencies = {}
        idx = 4
        
        for sentence in sentence_list:
            for word in self.tokenizer(sentence):
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] +=1
                
                if frequencies[word] == self.freq_threshold:
                    self.stoi[word] = idx
                    self.itos[idx] = word
                    idx +=1
                    
    def numericalize(self, text):
        tokenized_text = self.tokenizer(text)
        return [
            self.stoi.get(token, self.stoi["<UNK>"])
            for token in tokenized_text
        ]

---

## 📖 Step 6: Build a Vocabulary Class

### 🍕 Simple Analogy: Restaurant Menu with Item Numbers

Imagine a restaurant where:
- Each dish has a **number** (easier for kitchen)
- **Menu**: "Pizza" = 5, "Burger" = 10, "Salad" = 15

**For neural networks:**
- Can't process words directly ("cat", "dog", "running")
- Need to convert words to numbers: {"cat": 4, "dog": 5, "running": 6}

**Special tokens:**
- `<PAD>` = 0: Padding (to make all sentences same length)
- `<SOS>` = 1: Start of sentence
- `<EOS>` = 2: End of sentence
- `<UNK>` = 3: Unknown word (not in vocabulary)

### 🔧 Technical Explanation: Vocabulary Class

This class converts text to numbers and vice versa:

In [3]:
import spacy
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from PIL import Image

```python
import spacy
spacy_eng = spacy.load("en_core_web_sm")  # English language model

class Vocabulary:
    def __init__(self, freq_threshold):
        """
        freq_threshold: Minimum word frequency to include in vocabulary
                       (e.g., 5 means word must appear 5+ times)
        """
        self.freq_threshold = freq_threshold
        
        # Initialize with special tokens
        self.itos = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}  # index to string
        self.stoi = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}  # string to index
        
    def __len__(self):
        """Return vocabulary size"""
        return len(self.itos)
    
    @staticmethod
    def tokenizer(text):
        """
        Split sentence into words (tokens)
        Example: "I love dogs" → ['i', 'love', 'dogs']
        """
        return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]
    
    def build_vocabulary(self, sentence_list):
        """
        Build vocabulary from list of sentences
        Only include words that appear >= freq_threshold times
        """
        frequencies = {}  # Count word occurrences
        idx = 4  # Start from 4 (0-3 reserved for special tokens)
        
        for sentence in sentence_list:
            for word in self.tokenizer(sentence):
                # Count word frequency
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
                
                # Add to vocabulary when threshold reached
                if frequencies[word] == self.freq_threshold:
                    self.stoi[word] = idx
                    self.itos[idx] = word
                    idx += 1
    
    def numericalize(self, text):
        """
        Convert text to list of numbers
        Example: "I love dogs" → [45, 67, 89]
        Unknown words → 3 (<UNK>)
        """
        tokenized_text = self.tokenizer(text)
        return [
            self.stoi.get(token, self.stoi["<UNK>"])  # Use <UNK> if word not in vocab
            for token in tokenized_text
        ]
```

### 📊 Example Walkthrough

**Input sentences:**
```python
sentences = [
    "A dog playing in park",
    "A cat sleeping on sofa",
    "A dog running in park",
    "A bird flying in sky",
    "A dog jumping in park"
]
```

**Step 1: Count frequencies**
```python
frequencies = {
    'a': 5, 'dog': 3, 'playing': 1, 'in': 4, 'park': 3,
    'cat': 1, 'sleeping': 1, 'on': 1, 'sofa': 1, ...
}
```

**Step 2: Build vocabulary (freq_threshold=3)**
```python
# Only words appearing 3+ times
vocab.stoi = {
    '<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<UNK>': 3,
    'a': 4, 'in': 5, 'dog': 6, 'park': 7
}
```

**Step 3: Numericalize**
```python
vocab.numericalize("A dog playing in park")
# Output: [4, 6, 3, 5, 7]
#         'a' 'dog' '<UNK>' 'in' 'park'
#         (playing → <UNK> because frequency < 3)
```

---

## 📦 Step 7: Import Additional Libraries for Image Captioning

In [5]:
class FlickerDataset(Dataset):
    def __init__(self, root_dir, captions_file, transform=None, freq_threshold=5):
        self.root_dir=root_dir
        self.df=pd.read_csv(captions_file)
        self.transform=transform
        
        self.imgs=self.df['image']
        self.captions=self.df['caption']
        
        self.vocab=Vocabulary(freq_threshold)
        self.vocab.build_vocabulary(self.captions.tolist())
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        caption=self.captions[index]
        img_id=self.imgs[index]
        img_path=os.path.join(self.root_dir, img_id)
        image=Image.open(img_path).convert("RGB")
        
        if self.transform is not None:
            image=self.transform(image)
        
        numericalized_caption = [self.vocab.stoi["<SOS>"]]
        numericalized_caption += self.vocab.numericalize(caption)
        numericalized_caption.append(self.vocab.stoi["<EOS>"])
        
        return image, torch.tensor(numericalized_caption)
    

```python
import spacy                        # For text tokenization
import torch
from torch.nn.utils.rnn import pad_sequence  # For padding variable-length sequences
from torch.utils.data import DataLoader, Dataset
from PIL import Image              # Better image handling than skimage
```

**New libraries:**
- **`pad_sequence`**: Makes all captions the same length (we'll see this later)
- **`PIL`**: Handles images better, works well with torchvision transforms

---

## 🏗️ Step 8: Build Flickr Dataset Class

### 🍕 Simple Analogy: Photo Album with Descriptions

You have a photo album where:
- **Photos** = Images in `flickr8k_images/` folder
- **Descriptions** = Captions in `captions.txt` file
- Each photo has 5 different descriptions written by different people

When you request photo #10:
1. Look up description #10 in the text file
2. Find corresponding image
3. Convert description to numbers using vocabulary
4. Return (image_tensor, caption_numbers)

### 🔧 Technical Explanation

In [4]:
class MyCollate:
    def __init__(self, pad_idx):
        self.pad_idx=pad_idx
        
    def __call__(self, batch):
        images = [item[0].unsqueeze(0) for item in batch]
        images = torch.cat(images, dim=0)
        captions = [item[1] for item in batch]
        captions = pad_sequence(captions, batch_first=False, padding_value=self.pad_idx)
        
        return images, captions

```python
class FlickerDataset(Dataset):
    def __init__(self, root_dir, captions_file, transform=None, freq_threshold=5):
        """
        Initialize Flickr8k dataset
        
        Args:
            root_dir: Folder containing images
            captions_file: CSV file with [image, caption] columns
            transform: Image transformations
            freq_threshold: Min frequency for words in vocabulary
        """
        self.root_dir = root_dir
        self.df = pd.read_csv(captions_file)  # Read captions CSV
        self.transform = transform
        
        # Extract columns
        self.imgs = self.df['image']      # Image filenames
        self.captions = self.df['caption']  # Caption texts
        
        # Build vocabulary from all captions
        self.vocab = Vocabulary(freq_threshold)
        self.vocab.build_vocabulary(self.captions.tolist())
        
    def __len__(self):
        """Return total number of image-caption pairs"""
        return len(self.df)
    
    def __getitem__(self, index):
        """
        Get one image-caption pair
        
        Returns:
            image: Transformed image tensor
            numericalized_caption: Caption as list of numbers with <SOS> and <EOS>
        """
        # 1. Get caption text
        caption = self.captions[index]
        
        # 2. Get image filename and load image
        img_id = self.imgs[index]
        img_path = os.path.join(self.root_dir, img_id)
        image = Image.open(img_path).convert("RGB")
        
        # 3. Apply transforms to image
        if self.transform is not None:
            image = self.transform(image)
        
        # 4. Convert caption to numbers with special tokens
        numericalized_caption = [self.vocab.stoi["<SOS>"]]          # Start token
        numericalized_caption += self.vocab.numericalize(caption)   # Caption words
        numericalized_caption.append(self.vocab.stoi["<EOS>"])      # End token
        
        return image, torch.tensor(numericalized_caption)
```

### 📊 Example Walkthrough

**CSV file (`captions.txt`):**
```
image,caption
dog_001.jpg,A brown dog running in the park
cat_002.jpg,A white cat sleeping on a sofa
```

**Request item at index 0:**
```python
image, caption = dataset[0]
```

**Step-by-step:**
1. `caption = "A brown dog running in the park"`
2. `img_path = "flickr8k_images/dog_001.jpg"`
3. `image = Image.open(...)` → load and transform to tensor
4. Convert caption to numbers:
   ```python
   # Tokenize: ['a', 'brown', 'dog', 'running', 'in', 'the', 'park']
   # Numericalize: [1, 4, 45, 6, 89, 5, 12, 7, 2]
   #                <SOS> a brown dog running in the park <EOS>
   ```
5. Return: `(image_tensor, tensor([1, 4, 45, 6, 89, 5, 12, 7, 2]))`

---

## 🔄 Step 9: Custom Collate Function (Padding)

### 🍕 Simple Analogy: Boxing Pizzas of Different Sizes

You have pizzas of different sizes:
- Small pizza: 6 inches
- Medium pizza: 10 inches
- Large pizza: 14 inches

But your delivery boxes are all the same size (14 inches). So you:
- Put small pizza in box + add padding (6 → 14)
- Put medium pizza in box + add padding (10 → 14)
- Put large pizza in box (already fits!)

**For captions:**
- Caption 1: 5 words
- Caption 2: 12 words
- Caption 3: 8 words

Need to pad all to length 12 (longest) using `<PAD>` token (value = 0).

### 🔧 Technical Explanation

In [6]:
def get_loader(
    root_folder,
    annotation_file,
    transform,
    batch_size=32,
    num_workers=2,
    shuffle=True,
    pin_memory=True,
):
    dataset = FlickerDataset(
        root_dir=root_folder,
        captions_file=annotation_file,
        transform=transform,
    )

    pad_idx = dataset.vocab.stoi["<PAD>"]

    loader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        shuffle=shuffle,
        pin_memory=pin_memory,
        collate_fn=MyCollate(pad_idx=pad_idx),
    )

    return loader



```python
class MyCollate:
    def __init__(self, pad_idx):
        """
        pad_idx: Index of <PAD> token (usually 0)
        """
        self.pad_idx = pad_idx
        
    def __call__(self, batch):
        """
        Collate function to create batches with padded captions
        
        Args:
            batch: List of tuples [(image1, caption1), (image2, caption2), ...]
            
        Returns:
            images: Stacked tensor of images (batch_size, C, H, W)
            captions: Padded captions (max_length, batch_size)
        """
        # 1. Extract images and stack them
        images = [item[0].unsqueeze(0) for item in batch]  # Add batch dimension
        images = torch.cat(images, dim=0)  # Stack: (batch_size, C, H, W)
        
        # 2. Extract captions (variable lengths)
        captions = [item[1] for item in batch]
        
        # 3. Pad captions to same length
        captions = pad_sequence(captions, batch_first=False, padding_value=self.pad_idx)
        # Output shape: (max_length, batch_size)
        
        return images, captions
```

### 📊 Example: How Padding Works

**Batch of 3 samples:**
```python
batch = [
    (image1, tensor([1, 4, 6, 2])),          # Length 4: <SOS> a dog <EOS>
    (image2, tensor([1, 4, 5, 7, 8, 2])),    # Length 6: <SOS> a cat on sofa <EOS>
    (image3, tensor([1, 4, 6, 5, 7, 2]))     # Length 6: <SOS> a dog in park <EOS>
]
```

**After `pad_sequence`:**
```python
captions = tensor([
    [1, 1, 1],      # <SOS> for all 3
    [4, 4, 4],      # 'a' for all 3
    [6, 5, 6],      # 'dog', 'cat', 'dog'
    [2, 7, 5],      # <EOS>, 'on', 'in'
    [0, 8, 7],      # <PAD>, 'sofa', 'park'
    [0, 2, 2]       # <PAD>, <EOS>, <EOS>
])
# Shape: (6, 3) = (max_length, batch_size)
```

**Why `batch_first=False`?**
- RNN/LSTM expects input as `(sequence_length, batch_size, embedding_dim)`
- If `batch_first=True` → shape would be `(batch_size, sequence_length)`

---

## 🎬 Step 10: Create DataLoader with Custom Collate

### 🍕 Simple Analogy: Automated Pizza Delivery Service

You set up an automated system that:
1. Takes orders (reads images + captions)
2. Boxes pizzas (pads captions)
3. Delivers in batches of 32 orders at a time

### 🔧 Technical Explanation

In [None]:
import torchvision.transforms as transforms
dataloader=get_loader(
    root_folder="flickr8k_images",
    annotation_file="captions.txt",
    transform=transforms.Compose(
        [
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
        ]
    ),
    batch_size=32,
)

```python
def get_loader(
    root_folder,
    annotation_file,
    transform,
    batch_size=32,
    num_workers=2,
    shuffle=True,
    pin_memory=True,
):
    """
    Create a DataLoader for Flickr8k dataset
    
    Args:
        root_folder: Path to images folder
        annotation_file: Path to captions CSV
        transform: Image transformations
        batch_size: Number of samples per batch
        num_workers: Number of CPU threads for data loading
        shuffle: Randomize order each epoch
        pin_memory: Speed up CPU → GPU transfer
        
    Returns:
        loader: PyTorch DataLoader object
    """
    # 1. Create dataset instance
    dataset = FlickerDataset(
        root_dir=root_folder,
        captions_file=annotation_file,
        transform=transform,
    )
    
    # 2. Get PAD token index
    pad_idx = dataset.vocab.stoi["<PAD>"]
    
    # 3. Create DataLoader with custom collate function
    loader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        shuffle=shuffle,
        pin_memory=pin_memory,
        collate_fn=MyCollate(pad_idx=pad_idx),  # Custom padding function
    )
    
    return loader
```

### 📊 What Each Parameter Does

**`batch_size=32`**
- Load 32 image-caption pairs at once
- Larger batch = faster training but more memory

**`num_workers=2`**
- Use 2 CPU threads to load data in parallel
- While GPU processes batch 1, CPU loads batch 2 (faster!)

**`shuffle=True`**
- Randomize order every epoch
- Prevents model from memorizing order

**`pin_memory=True`**
- Allocates data in pinned memory (faster CPU → GPU transfer)
- Only use if you have a GPU

**`collate_fn=MyCollate(pad_idx=pad_idx)`**
- Use our custom function to pad captions
- Default collate doesn't know how to handle variable-length sequences

---

## 🎯 Step 11: Use the DataLoader

### 🍕 Simple Analogy: Finally Opening Your Restaurant!

You've set up everything:
- ✅ Menu (CSV)
- ✅ Kitchen (image folder)
- ✅ Cooking instructions (transforms)
- ✅ Delivery system (DataLoader)

Now customers can order, and you deliver in batches!

### 🔧 How to Use It

```python
import torchvision.transforms as transforms

dataloader = get_loader(
    root_folder="flickr8k_images",     # Where images are stored
    annotation_file="captions.txt",     # Where captions are stored
    transform=transforms.Compose([
        transforms.Resize((224, 224)),  # Resize to 224×224
        transforms.ToTensor(),          # Convert to tensor [0, 1]
    ]),
    batch_size=32,
)
```

### 🎯 How to Loop Through Data

```python
for images, captions in dataloader:
    # images.shape = (32, 3, 224, 224)
    #   - 32 images
    #   - 3 color channels (RGB)
    #   - 224×224 pixels
    
    # captions.shape = (max_length, 32)
    #   - max_length = longest caption in this batch
    #   - 32 captions (one per image)
    
    # Train your model here!
    pass
```

### ⚠️ Note About Running

**If you don't have the dataset**, running this cell will cause an error:
```
FileNotFoundError: [Errno 2] No such file or directory: 'flickr8k_images'
```

**This is expected!** The code structure is correct. To actually run it, you'd need to download the Flickr8k dataset.

---

## 🎊 Summary: What We Built

### 🐱 Part 1: Cats & Dogs Classification
1. ✅ Built `CatsandDogsDataset` class
2. ✅ Loaded images and labels from CSV
3. ✅ Applied transforms (resize, convert to tensor)
4. ✅ Created train/test split
5. ✅ Created DataLoaders for batching

### 📷 Part 2: Image Captioning
1. ✅ Built `Vocabulary` class (text → numbers)
2. ✅ Built `FlickerDataset` class (images + captions)
3. ✅ Added special tokens (`<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`)
4. ✅ Built custom collate function for padding
5. ✅ Created DataLoader with variable-length sequences

---

## 🎓 Key Takeaways

### 1️⃣ **Custom Datasets Need 3 Methods**
```python
class MyDataset(Dataset):
    def __init__(self):    # Setup
    def __len__(self):     # Return total count
    def __getitem__(self, idx):  # Return one item
```

### 2️⃣ **Text Must Be Converted to Numbers**
- Build vocabulary from training data
- Use special tokens for padding and sequence markers
- Handle unknown words with `<UNK>`

### 3️⃣ **Variable-Length Sequences Need Padding**
- Use `pad_sequence` to make all captions same length
- Custom collate function handles this automatically

### 4️⃣ **DataLoader Makes Life Easy**
- Batching: Load multiple samples at once
- Shuffling: Randomize order for better training
- Multi-threading: Load data while GPU trains

---

## 🚀 Next Steps

Now that you can load custom data, you can:
1. Build a CNN classifier for cats & dogs
2. Build an image captioning model (CNN encoder + RNN decoder)
3. Train on your own custom datasets!

**Happy coding! 🎉**