# 📷 Image Captioning with CNN + RNN

## 🎯 What is Image Captioning?

**Simple Definition:** Given an image, generate a natural language description.

**Example:**
```
Input:  🖼️ [Photo of a dog playing with a ball in a park]
Output: "A brown dog is playing with a red ball in the park"
```

---

## 🧠 The Big Idea: Two-Part System

### 🍕 Simple Analogy: Restaurant Review System

Imagine you're writing a restaurant review:

**Part 1: LOOK at the food** 👀
- Take a photo of the dish
- Your eyes analyze: colors, shapes, ingredients
- **This is the CNN Encoder** → extracts visual features

**Part 2: WRITE the review** ✍️
- Your brain converts visual information into words
- Generates sentence word by word: "This → delicious → pizza → has → fresh → toppings"
- **This is the RNN Decoder** → generates text sequence

---

## 🏗️ Architecture Overview

```
Image → [CNN Encoder] → Feature Vector → [RNN Decoder] → Caption
  🖼️         🔍              📊              ✍️            📝
```

**Components:**
1. **EncoderCNN**: Pre-trained Inception model (extracts image features)
2. **DecoderRNN**: LSTM network (generates caption word by word)
3. **CNNtoRNN**: Combines both parts

Let's build each component!

<figure>
  <img src="asset/image_caption_overview.png" alt="Architecture of Image Caption" width="800">
</figure>

## 🏛️ Image Captioning Architecture

This diagram shows the complete pipeline:
1. **Input Image** → processed by CNN encoder
2. **Feature Vector** → extracted visual features
3. **LSTM Decoder** → generates words one by one
4. **Output Caption** → final sentence

In [2]:
import torch
import torch.nn as nn 
import torchvision.models as models

---

## 📦 Step 1: Import Core Libraries

### 🔧 What Each Library Does

In [4]:
class EncoderCNN(nn.Module):
    def __init__(self,embed_size,train_CNN=False):
        super(EncoderCNN,self).__init__()
        
        self.train_CNN=train_CNN
        self.inception=models.inception_v3(pretrained=True,aux_logits=False)
        self.inception.fc=nn.Linear(self.inception.fc.in_features,embed_size)
        self.relu=nn.ReLU()
        self.dropout=nn.Dropout(0.5)
        
    def forward(self,images):
        features=self.inception(images)
        
        for name, param in self.inception.named_parameters():
            if "fc.weight" in name or "fc.bias" in name:
                param.requires_grad = True
            else:
                param.requires_grad = self.train_CNN
        return self.dropout(self.relu(features))
    

```python
import torch                    # Main PyTorch library
import torch.nn as nn          # Neural network modules
import torchvision.models as models  # Pre-trained models (Inception)
```

---

## 🔍 Step 2: Build CNN Encoder (Vision Part)

### 🍕 Simple Analogy: Professional Food Photographer

You hire a professional photographer to analyze dishes:
- **Pre-trained eyes** (Inception model trained on ImageNet)
- Converts photo into a **feature summary** (embed_size numbers)
- Example: [0.5, 0.8, 0.2, ...] → represents "brown dog, green grass, blue sky"

### 🔧 Technical Explanation: EncoderCNN Class

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self,embed_size,hidden_size,vocab_size,num_layers):
        super(DecoderRNN,self).__init__()
        
        self.embed=nn.Embedding(vocab_size,embed_size)
        self.lstm=nn.LSTM(embed_size,hidden_size,num_layers,batch_first=True)
        self.linear=nn.Linear(hidden_size,vocab_size)
        self.dropout=nn.Dropout(0.5)
        
    def forward(self, features,captions):
        embeddings = self.dropout(self.embed(captions))
        embeddings = torch.cat((features.unsqueeze(1), embeddings), dim=1)
        hiddens, _ = self.lstm(embeddings)
        outputs=self.linear(hiddens)
        return outputs

---

## 🔗 Deep Dive: How Image Features Become "Words" for LSTM

### 🍕 Simple Analogy: Restaurant Critic's First Impression

Imagine a food critic:
1. **Photographer shows photo** → critic forms first impression ("Ah, this looks like Italian food")
2. **Critic writes review** → "This" → "delicious" → "pizza" → "has" → "fresh" → "toppings"

The photo analysis becomes the **starting point** for the review!

### 🔧 Technical Explanation: Image Features as First "Word"

In the decoder's forward pass:

```python
# 1. Image features from encoder
features.shape = (batch_size, embed_size)
# Example: (8, 256) - 8 images, each is 256 numbers

# 2. Caption words converted to embeddings
embeddings = self.embed(captions)
# Example: (8, 10, 256) - 8 captions, 10 words each, 256-dim vectors

# 3. THE MAGIC: Add sequence dimension to features
features_with_seq = features.unsqueeze(1)
# Shape changes: (8, 256) → (8, 1, 256)
# Now it looks like "one word" per image!

# 4. Concatenate: Image feature becomes the first "word"
embeddings = torch.cat((features_with_seq, embeddings), dim=1)
# Result: (8, 11, 256) = 8 captions, 11 words (1 image + 10 real words), 256-dim
```

### 📊 Visual Breakdown

**Before concatenation:**
```python
Image features:  [0.5, 0.8, 0.2, ..., 0.7]  # 256 numbers (batch_size, 256)
Caption embeddings:
  Word 1 "A":    [0.1, 0.3, 0.5, ..., 0.2]  # 256 numbers
  Word 2 "dog":  [0.7, 0.2, 0.1, ..., 0.9]  # 256 numbers
  Word 3 "is":   [0.3, 0.6, 0.4, ..., 0.1]  # 256 numbers
  ...
```

**After concatenation (what LSTM sees):**
```python
Sequence fed to LSTM:
  Position 0 (image):  [0.5, 0.8, 0.2, ..., 0.7]  ← Image features
  Position 1 "A":      [0.1, 0.3, 0.5, ..., 0.2]  ← Word embedding
  Position 2 "dog":    [0.7, 0.2, 0.1, ..., 0.9]  ← Word embedding
  Position 3 "is":     [0.3, 0.6, 0.4, ..., 0.1]  ← Word embedding
  ...
```

**Why this works:**
- Image features and word embeddings are **both 256-dimensional vectors**
- LSTM doesn't care if position 0 is an "image" or a "word"
- It treats the image feature vector as the **context** for generating the caption

### 🎯 Complete Example

**Input:**
```python
# Single image of a dog
features = encoder(dog_image)  # Output: (1, 256)

# Caption: "A brown dog"
caption_indices = [1, 4, 45, 6]  # <SOS>=1, "a"=4, "brown"=45, "dog"=6
```

**Processing:**
```python
# 1. Embed caption words
embeddings = self.embed(caption_indices)
# Shape: (1, 4, 256) - 1 caption, 4 words, 256-dim each

# 2. Add image as first position
features_unsqueezed = features.unsqueeze(1)  # (1, 256) → (1, 1, 256)
embeddings = torch.cat((features_unsqueezed, embeddings), dim=1)
# Shape: (1, 5, 256) - 1 caption, 5 positions (1 image + 4 words)

# 3. LSTM sees this sequence:
#    [Image_vector, <SOS>, "a", "brown", "dog"]
#    
# 4. LSTM generates predictions:
#    Position 0 (after seeing image) → predict <SOS>
#    Position 1 (after seeing image + <SOS>) → predict "a"
#    Position 2 (after seeing image + <SOS> + "a") → predict "brown"
#    Position 3 (after seeing image + <SOS> + "a" + "brown") → predict "dog"
#    Position 4 (after seeing full context) → predict <EOS>
```

### 💡 Key Insight

The image feature vector acts as **visual context** that influences every word generation:
- **Without image context**: LSTM might generate random captions
- **With image context**: LSTM generates captions that describe what's actually in the image!

Think of it like this:
```
Image features = "There's a brown furry animal in grass"
LSTM uses this context to generate: "A brown dog is playing in the park"
```

---

In [None]:
class EncoderCNN(nn.Module):
    def __init__(self, embed_size, train_CNN=False):
        """
        CNN Encoder using pre-trained Inception v3
        
        Args:
            embed_size: Size of feature vector (e.g., 256)
            train_CNN: Whether to fine-tune CNN weights (usually False)
        """
        super(EncoderCNN, self).__init__()
        
        self.train_CNN = train_CNN
        
        # Load pre-trained Inception v3
        self.inception = models.inception_v3(pretrained=True, aux_logits=False)
        
        # Replace final layer to output embed_size features
        self.inception.fc = nn.Linear(self.inception.fc.in_features, embed_size)
        
        # Activation and regularization
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, images):
        """
        Convert images to feature vectors
        
        Args:
            images: Batch of images (batch_size, 3, 299, 299)
            
        Returns:
            features: Feature vectors (batch_size, embed_size)
        """
        features = self.inception(images)
        
        # Freeze/unfreeze CNN weights
        for name, param in self.inception.named_parameters():
            if "fc.weight" in name or "fc.bias" in name:
                param.requires_grad = True  # Always train final layer
            else:
                param.requires_grad = self.train_CNN  # Freeze other layers
                
        return self.dropout(self.relu(features))


### 📊 What Happens Here

**1. Load Pre-trained Inception**
```python
self.inception = models.inception_v3(pretrained=True, aux_logits=False)
```
- **Pre-trained**: Already knows how to recognize objects (trained on ImageNet)
- **aux_logits=False**: Disables auxiliary classifier (we don't need it)

**2. Replace Final Layer**
```python
self.inception.fc = nn.Linear(in_features, embed_size)
```
- Original Inception outputs 1000 classes (ImageNet)
- We replace it to output `embed_size` features (e.g., 256)
- These features represent the image content

**3. Freeze Layers**
```python
param.requires_grad = self.train_CNN
```
- **If `train_CNN=False`**: Freeze all layers (don't update weights)
  - Why? Pre-trained features are already good!
  - Faster training, less memory
- **If `train_CNN=True`**: Fine-tune the entire network
  - Slower but potentially better accuracy

**4. Output**
```python
return self.dropout(self.relu(features))
```
- Apply ReLU activation
- Apply dropout (50% chance to zero out neurons → prevents overfitting)

### 🎯 Example Flow

```python
# Input: Batch of 8 images, 3 channels (RGB), 299×299 pixels
images.shape = (8, 3, 299, 299)

# Pass through encoder
features = encoder(images)

# Output: Feature vectors
features.shape = (8, 256)  # 8 images, each represented by 256 numbers
```

---

## ✍️ Step 3: Build RNN Decoder (Language Part)

### 🍕 Simple Analogy: Food Critic Writing Reviews

The food critic has two jobs:
1. **Understand the photo analysis** (feature vector from CNN)
2. **Write review word by word**: "This" → "delicious" → "pizza" → "has" → "fresh" → "toppings"

### 🔧 Technical Explanation: DecoderRNN Class

In [7]:
class CNNtoRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, train_CNN=False):
        super(CNNtoRNN, self).__init__()
        self.encoder = EncoderCNN(embed_size, train_CNN)
        self.decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers)
        
    def forward(self, images, captions):
        features = self.encoder(images)
        outputs = self.decoder(features, captions)
        return outputs
    
    def caption(self, image, vocabulary, max_length=50):
        result_caption = []
        
        with torch.no_grad():
            x = self.encoder(image).unsqueeze(0)
            states = None
            
            for _ in range(max_length):
                hiddens, states = self.decoder.lstm(x, states)
                output = self.decoder.linear(hiddens.squeeze(1))
                predicted = output.argmax(1)
                
                result_caption.append(predicted.item())
                
                x = self.decoder.embed(predicted).unsqueeze(1)
                
                if vocabulary.itos[predicted.item()] == "<EOS>":
                    break
                    
        return [vocabulary.itos[idx] for idx in result_caption]

```python
class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
        """
        RNN Decoder using LSTM
        
        Args:
            embed_size: Size of word embeddings (matches CNN output)
            hidden_size: Size of LSTM hidden state
            vocab_size: Number of words in vocabulary
            num_layers: Number of LSTM layers
        """
        super(DecoderRNN, self).__init__()
        
        # Word embedding layer (converts word indices to vectors)
        self.embed = nn.Embedding(vocab_size, embed_size)
        
        # LSTM layer (generates sequences)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        
        # Output layer (converts LSTM output to vocabulary probabilities)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, features, captions):
        """
        Generate caption predictions during training
        
        Args:
            features: Image features from encoder (batch_size, embed_size)
            captions: Ground truth captions (batch_size, caption_length)
            
        Returns:
            outputs: Predicted word probabilities (batch_size, caption_length, vocab_size)
        """
        # 1. Convert caption words to embeddings
        embeddings = self.dropout(self.embed(captions))
        # Shape: (batch_size, caption_length, embed_size)
        
        # 2. Concatenate image features with caption embeddings
        embeddings = torch.cat((features.unsqueeze(1), embeddings), dim=1)
        # Shape: (batch_size, caption_length+1, embed_size)
        # Image features act as the first "word"
        
        # 3. Pass through LSTM
        hiddens, _ = self.lstm(embeddings)
        # Shape: (batch_size, caption_length+1, hidden_size)
        
        # 4. Convert to vocabulary predictions
        outputs = self.linear(hiddens)
        # Shape: (batch_size, caption_length+1, vocab_size)
        
        return outputs
```

### 📊 Breaking Down Each Component

**1. Word Embedding Layer**
```python
self.embed = nn.Embedding(vocab_size, embed_size)
```
- **Purpose**: Convert word indices to dense vectors
- **Example**:
  ```python
  vocab_size = 10000  # 10,000 words in vocabulary
  embed_size = 256
  
  # Word "dog" has index 452
  word_vector = embed(452)  # Shape: (256,)
  # Output: [0.2, 0.8, -0.5, ..., 0.3]  # 256 numbers representing "dog"
  ```

**2. LSTM Layer**
```python
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
```
- **Purpose**: Generate sequences by remembering context
- **batch_first=True**: Input shape is `(batch_size, sequence_length, embed_size)`

**3. Linear Layer**
```python
self.linear = nn.Linear(hidden_size, vocab_size)
```
- **Purpose**: Convert LSTM output to word predictions
- **Output**: Probability for each word in vocabulary

### 🎯 Example Flow (Training Time)

**Input:**
```python
features.shape = (8, 256)         # 8 images, 256 features each
captions.shape = (8, 10)          # 8 captions, 10 words each (indices)
```

**Step-by-step:**
```python
# 1. Embed captions
embeddings = self.embed(captions)  # (8, 10, 256)

# 2. Add image features as first "word"
embeddings = torch.cat((features.unsqueeze(1), embeddings), dim=1)
# Shape: (8, 11, 256)  # 11 = 1 image + 10 words

# 3. Pass through LSTM
hiddens, _ = self.lstm(embeddings)  # (8, 11, hidden_size)

# 4. Convert to predictions
outputs = self.linear(hiddens)  # (8, 11, 10000)
# For each of 8 captions, for each of 11 positions, predict probability of each of 10000 words
```

---

## 🔗 Step 4: Combine CNN + RNN (Complete Model)

### 🍕 Simple Analogy: Restaurant Review System

Now we connect both parts:
1. **Photographer** analyzes the dish → gives summary to critic
2. **Critic** uses summary → writes review word by word

### 🔧 Technical Explanation: CNNtoRNN Class

In [1]:
import torchvision.transforms as transforms
from torch.utils.tensorboard import SummaryWriter

import spacy
spacy_eng = spacy.load("en_core_web_sm")

class Vocabulary:
    def __init__(self, freq_threshold):
        self.freq_threshold = freq_threshold
        self.itos = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"} #iots: index to string
        self.stoi = {v: k for k, v in self.itos.items()} #stoi: string to index
        
    def __len__(self):
        return len(self.itos)
    
    @staticmethod
    def tokenizer(text):
        return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]
    #        "I love dogs" -> ['i', 'love', 'dogs']
    
    def build_vocabulary(self, sentence_list):
        frequencies = {}
        idx = 4
        
        for sentence in sentence_list:
            for word in self.tokenizer(sentence):
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] +=1
                
                if frequencies[word] == self.freq_threshold:
                    self.stoi[word] = idx
                    self.itos[idx] = word
                    idx +=1
                    
    def numericalize(self, text):
        tokenized_text = self.tokenizer(text)
        return [
            self.stoi.get(token, self.stoi["<UNK>"])
            for token in tokenized_text
        ]

```python
class CNNtoRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, train_CNN=False):
        """
        Complete Image Captioning Model
        
        Args:
            embed_size: Feature vector size (CNN output = RNN input)
            hidden_size: LSTM hidden state size
            vocab_size: Number of words in vocabulary
            num_layers: Number of LSTM layers
            train_CNN: Whether to fine-tune CNN
        """
        super(CNNtoRNN, self).__init__()
        self.encoder = EncoderCNN(embed_size, train_CNN)
        self.decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers)
        
    def forward(self, images, captions):
        """
        Training forward pass
        
        Args:
            images: Batch of images
            captions: Ground truth captions
            
        Returns:
            outputs: Predicted word probabilities
        """
        features = self.encoder(images)      # Image → features
        outputs = self.decoder(features, captions)  # Features → caption
        return outputs
    
    def caption(self, image, vocabulary, max_length=50):
        """
        Generate caption for a single image (inference time)
        
        Args:
            image: Single image tensor (1, 3, 299, 299)
            vocabulary: Vocabulary object (for word lookups)
            max_length: Maximum caption length
            
        Returns:
            result_caption: List of words
        """
        result_caption = []
        
        with torch.no_grad():  # No gradient computation (inference only)
            # 1. Get image features
            x = self.encoder(image).unsqueeze(0)  # (1, 1, embed_size)
            states = None  # Initial LSTM state
            
            # 2. Generate words one by one
            for _ in range(max_length):
                # Pass through LSTM
                hiddens, states = self.decoder.lstm(x, states)
                
                # Predict next word
                output = self.decoder.linear(hiddens.squeeze(1))
                predicted = output.argmax(1)  # Get word with highest probability
                
                # Add to caption
                result_caption.append(predicted.item())
                
                # Use predicted word as input for next step
                x = self.decoder.embed(predicted).unsqueeze(1)
                
                # Stop if we predict <EOS> (end of sentence)
                if vocabulary.itos[predicted.item()] == "<EOS>":
                    break
                    
        # Convert indices to words
        return [vocabulary.itos[idx] for idx in result_caption]
```

### 📊 Two Modes: Training vs Inference

**TRAINING MODE** (`forward` method):
```python
# We have ground truth captions
images = [img1, img2, ..., img8]  # Batch of 8 images
captions = ["A dog ...", "A cat ...", ...]  # Known captions

outputs = model(images, captions)
# Model predicts all words at once (teacher forcing)
# Compare predictions with ground truth → calculate loss → update weights
```

**INFERENCE MODE** (`caption` method):
```python
# We DON'T know the caption
image = single_image  # One image

caption = model.caption(image, vocabulary)
# Model generates word by word:
# Step 1: [Image] → "A"
# Step 2: [Image, "A"] → "dog"
# Step 3: [Image, "A", "dog"] → "is"
# ...until <EOS> or max_length reached
```

### 🎯 Inference Example

```python
# Given image of dog in park
image = load_image("dog_park.jpg")  # Shape: (1, 3, 299, 299)

# Generate caption
caption = model.caption(image, vocab, max_length=20)

# Output: ['a', 'brown', 'dog', 'is', 'playing', 'in', 'the', 'park', '<EOS>']
# As sentence: "A brown dog is playing in the park"
```

---

## 📖 Step 5: Vocabulary Class (Same as Custom Dataset)

### 🍕 Simple Analogy: Menu with Item Numbers

(See explanation in Custom_Dataset_Build.ipynb)

**Quick Summary:**
- Converts words ↔ numbers
- Special tokens: `<PAD>`, `<SOS>`, `<EOS>`, `<UNK>`
- Only includes words appearing `freq_threshold` times

In [None]:
import spacy
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from PIL import Image
import pandas as pd
import os

---

## 📦 Step 6: Import Dataset Libraries

In [3]:
class FlickerDataset(Dataset):
    def __init__(self, root_dir, captions_file, transform=None, freq_threshold=5):
        self.root_dir=root_dir
        self.df=pd.read_csv(captions_file)
        self.transform=transform
        
        self.imgs=self.df['image']
        self.captions=self.df['caption']
        
        self.vocab=Vocabulary(freq_threshold)
        self.vocab.build_vocabulary(self.captions.tolist())
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        caption=self.captions[index]
        img_id=self.imgs[index]
        img_path=os.path.join(self.root_dir, img_id)
        image=Image.open(img_path).convert("RGB")
        
        if self.transform is not None:
            image=self.transform(image)
        
        numericalized_caption = [self.vocab.stoi["<SOS>"]]
        numericalized_caption += self.vocab.numericalize(caption)
        numericalized_caption.append(self.vocab.stoi["<EOS>"])
        
        return image, torch.tensor(numericalized_caption)
    

```python
import spacy                        # Text processing
import torch
from torch.nn.utils.rnn import pad_sequence  # Padding variable-length sequences
from torch.utils.data import DataLoader, Dataset
from PIL import Image              # Image loading
import pandas as pd                # CSV reading
import os                          # File paths
```

---

## 🗂️ Step 7: Flickr Dataset Class

(See detailed explanation in Custom_Dataset_Build.ipynb)

**Quick Summary:**
- Loads image + caption pairs
- Builds vocabulary from all captions
- Returns (image_tensor, numericalized_caption)

In [4]:
class MyCollate:
    def __init__(self, pad_idx):
        self.pad_idx=pad_idx
        
    def __call__(self, batch):
        images = [item[0].unsqueeze(0) for item in batch]
        images = torch.cat(images, dim=0)
        captions = [item[1] for item in batch]
        captions = pad_sequence(captions, batch_first=False, padding_value=self.pad_idx)
        
        return images, captions

---

## 🔄 Step 8: Custom Collate Function

(See detailed explanation in Custom_Dataset_Build.ipynb)

**Quick Summary:**
- Pads captions to same length
- Uses `<PAD>` token (index 0)
- Returns batched images + padded captions

In [None]:
def get_loader(
    root_folder,
    annotation_file,
    transform,
    batch_size=32,
    num_workers=2,
    shuffle=True,
    pin_memory=True,
):
    dataset = FlickerDataset(
        root_dir=root_folder,
        captions_file=annotation_file,
        transform=transform,
    )

    pad_idx = dataset.vocab.stoi["<PAD>"]

    loader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        shuffle=shuffle,
        pin_memory=pin_memory,
        collate_fn=MyCollate(pad_idx=pad_idx),
    )

    return loader, dataset

---

## 🎬 Step 9: DataLoader Function

**Modified to return both loader AND dataset:**

```python
def get_loader(...):
    dataset = FlickerDataset(...)
    loader = DataLoader(...)
    return loader, dataset  # ✅ Returns BOTH (needed for vocab_size)
```

---

## 🚀 Step 10: Complete Training Loop

### 🍕 Simple Analogy: Learning to Describe Food

You're training someone to describe dishes:
1. **Show photo** (image)
2. **Show correct description** (caption)
3. **They try to describe** (model prediction)
4. **You correct them** (calculate loss)
5. **They improve** (update weights)
6. **Repeat** for thousands of photos

### 🔧 Technical Explanation: Training Function

In [None]:
import torchvision.transforms as transforms
import os

def save_checkpoint(state, filename="my_checkpoint.pth.tar"):
    print("=> Saving checkpoint")
    torch.save(state, filename)

def load_checkpoint(checkpoint, model, optimizer):
    print("=> Loading checkpoint")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    step = checkpoint["step"]
    return step

def train():
    transform=transforms.Compose(
        [
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
        ]
    )
    train_loader,dataset=get_loader(
        root_folder="data/flickr8k/images",
        annotation_file="data/flickr8k/captions.txt",
        transform=transform,
        batch_size=32,
    )
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    embed_size=256
    hidden_size=256
    vocab_size=len(dataset.vocab)
    num_layers=1
    learning_rate=3e-4
    num_epochs=10
    load_model=False
    save_model=True
    
    
    writer=SummaryWriter("runs/flickr8k_experiment_1")
    step=0
    
    model=CNNtoRNN(embed_size,hidden_size,vocab_size,num_layers).to(device)
    criterion=nn.CrossEntropyLoss(ignore_index=dataset.vocab.stoi["<PAD>"])
    optimizer=torch.optim.Adam(model.parameters(),lr=learning_rate)
    
    if load_model:
        step=load_checkpoint(torch.load("my_checkpoint.pth.tar"),model,optimizer)
        
    model.train()
    
    for epoch in range(num_epochs):
        if save_model:
            checkpoint={
                "state_dict":model.state_dict(),
                "optimizer":optimizer.state_dict(),
                "step":step,
            }
            save_checkpoint(checkpoint,"my_checkpoint.pth.tar")
            
            
        for idx, (imgs, captions) in enumerate(train_loader):
            imgs=imgs.to(device)
            captions=captions.to(device)
            
            outputs=model(imgs,captions[:-1])
            
            loss=criterion(
                outputs.reshape(-1, outputs.shape[2]),
                captions.reshape(-1),
            )
            
            writer.add_scalar("Training loss", loss.item(), global_step=step)
            step+=1
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if idx % 100 ==0:
                print(f"Epoch [{epoch}/{num_epochs}] Batch {idx}/{len(train_loader)} Loss: {loss.item():.4f}")

---

## 🚀 Step 10: Training Function - Complete Breakdown

The training function is large, so let's break it into **digestible chunks** with explanations!

---

### 📦 Part 1: Setup Transforms

```python
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to 224×224
    transforms.ToTensor(),          # Convert to tensor [0, 1]
    transforms.Normalize(           # Normalize for Inception
        (0.485, 0.456, 0.406),     # ImageNet mean
        (0.229, 0.224, 0.225)      # ImageNet std
    ),
])
```

**🍕 Analogy:** Preparing ingredients before cooking
- **Resize**: Cut all vegetables to same size
- **ToTensor**: Put ingredients in containers (PIL → Tensor)
- **Normalize**: Season consistently (match what Inception was trained on)

**🔧 Technical:**
- Inception v3 was trained on ImageNet with specific normalization
- Using the same normalization ensures best performance
- Mean/std values are ImageNet dataset statistics

---

### 📂 Part 2: Load Dataset

```python
train_loader, dataset = get_loader(
    root_folder="data/flickr8k/images",
    annotation_file="data/flickr8k/captions.txt",
    transform=transform,
    batch_size=32,
)
```

**🍕 Analogy:** Opening restaurant and bringing in supplies
- **root_folder**: Kitchen pantry (where images are stored)
- **annotation_file**: Menu with descriptions
- **batch_size=32**: Serve 32 customers at once (not one by one)

**🔧 Technical:**
- Creates DataLoader that yields batches of (images, captions)
- Returns dataset too (we need its vocabulary)
- Handles padding automatically via MyCollate

---

### 🖥️ Part 3: Setup Device

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

**🍕 Analogy:** Choose your vehicle
- GPU = Sports car 🏎️ (fast, for heavy work)
- CPU = Bicycle 🚲 (slow, but always available)

**🔧 Technical:**
- Checks if NVIDIA GPU is available
- All tensors and models will move to this device
- Training on GPU is 10-100x faster than CPU

---

### ⚙️ Part 4: Hyperparameters

```python
embed_size = 256        # Size of feature vectors
hidden_size = 256       # Size of LSTM hidden state
vocab_size = len(dataset.vocab)  # Number of words in vocabulary
num_layers = 1          # Number of LSTM layers
learning_rate = 3e-4    # Learning rate for optimizer
num_epochs = 10         # Number of complete passes through dataset
load_model = False      # Whether to load saved checkpoint
save_model = True       # Whether to save checkpoints
```

**🍕 Analogy:** Recipe settings
- **embed_size**: Size of flavor profile (how many taste dimensions)
- **hidden_size**: Chef's memory capacity (how much context to remember)
- **vocab_size**: Number of dishes on menu
- **learning_rate**: How fast chef learns (big steps vs small adjustments)
- **num_epochs**: How many times to practice entire recipe book

**🔧 Technical:**
- **embed_size=256**: Matches CNN output and word embeddings
- **hidden_size=256**: Larger = more memory but slower
- **learning_rate=3e-4** (0.0003): Adam optimizer works well with this
- **num_epochs=10**: Typically need 10-20 epochs for convergence

---

### 📊 Part 5: Setup TensorBoard

```python
writer = SummaryWriter("runs/flickr8k_experiment_1")
step = 0
```

**🍕 Analogy:** Restaurant quality tracker
- Records how good each dish tastes over time
- Can review later to see improvement

**🔧 Technical:**
- TensorBoard logs training metrics
- View with: `tensorboard --logdir=runs`
- `step` counts total batches processed

---

### 🏗️ Part 6: Create Model

```python
model = CNNtoRNN(embed_size, hidden_size, vocab_size, num_layers).to(device)
```

**🍕 Analogy:** Hire photographer + critic
- Photographer (CNN) analyzes images
- Critic (RNN) writes descriptions

**🔧 Technical:**
- Creates complete model (Encoder + Decoder)
- `.to(device)` moves model to GPU/CPU
- Model has ~50M parameters (Inception) + ~5M (LSTM)

---

### 💥 Part 7: Loss Function

```python
criterion = nn.CrossEntropyLoss(ignore_index=dataset.vocab.stoi["<PAD>"])
```

**🍕 Analogy:** Quality inspector
- Compares chef's dish with expected dish
- Ignores placeholder dishes (padding)

**🔧 Technical:**
- **CrossEntropyLoss**: Measures prediction accuracy for classification
- **ignore_index**: Don't penalize model for `<PAD>` tokens
- Lower loss = better predictions

---

### 🔄 Part 8: Optimizer

```python
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
```

**🍕 Analogy:** Cooking coach
- Watches chef's mistakes
- Suggests improvements

**🔧 Technical:**
- **Adam**: Adaptive learning rate optimizer (better than SGD)
- Updates model weights to minimize loss
- `lr=3e-4`: Step size for weight updates

---

### 💾 Part 9: Load Checkpoint (Optional)

```python
if load_model:
    step = load_checkpoint(torch.load("my_checkpoint.pth.tar"), model, optimizer)
```

**🍕 Analogy:** Resume from saved progress
- Like loading a video game save file
- Continue training from where you left off

**🔧 Technical:**
- Only runs if `load_model=True`
- Restores model weights, optimizer state, and step counter
- Useful if training was interrupted

---

### 🎯 Part 10: Training Mode

```python
model.train()
```

**🍕 Analogy:** Open restaurant for practice
- Turn on all training features (dropout, batch norm)
- vs `model.eval()` which turns them off for testing

**🔧 Technical:**
- Enables dropout (randomly zeros neurons)
- Enables batch normalization training mode
- Required before training loop

---

### 🔁 Part 11: Epoch Loop

```python
for epoch in range(num_epochs):
    # Save checkpoint at start of each epoch
    if save_model:
        checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "step": step,
        }
        save_checkpoint(checkpoint, "my_checkpoint.pth.tar")
```

**🍕 Analogy:** Practice recipe book 10 times
- Each epoch = one complete pass through all recipes
- Save progress after each complete pass

**🔧 Technical:**
- **Epoch**: One complete pass through entire dataset
- Saves checkpoint at start (can resume if crash)
- `state_dict()` contains all model weights

---

### 🎲 Part 12: Batch Loop

```python
for idx, (imgs, captions) in enumerate(train_loader):
    imgs = imgs.to(device)
    captions = captions.to(device)
```

**🍕 Analogy:** Process one table of customers at a time
- 32 customers (batch_size=32)
- Bring their orders to the kitchen (GPU/CPU)

**🔧 Technical:**
- `enumerate(train_loader)`: Loop through batches
- `imgs.shape = (32, 3, 224, 224)`: 32 images
- `captions.shape = (max_len, 32)`: 32 captions (padded)
- `.to(device)`: Move data to GPU/CPU

---

### 🎯 Part 13: Forward Pass

```python
outputs = model(imgs, captions[:-1])
```

**🍕 Analogy:** Chef cooks the dish
- Look at ingredients (images)
- Follow recipe (captions[:-1])
- Produce dish (outputs)

**🔧 Technical:**
- **Input**: Images + captions (without last word)
- **Output**: Predicted words for each position
- **Shape**: `(32, max_len, vocab_size)` = probability for each word
- **Teacher forcing**: Use ground truth words during training

**Why `captions[:-1]`?**
```python
Original caption: [<SOS>, "A", "dog", "running", <EOS>]
Feed to model:    [<SOS>, "A", "dog", "running"]        # Remove last
Model predicts:   ["A",   "dog", "running", <EOS>]      # Predict next word
```

---

### 💥 Part 14: Calculate Loss

```python
loss = criterion(
    outputs.reshape(-1, outputs.shape[2]),  # Flatten to (batch*seq, vocab)
    captions.reshape(-1),                    # Flatten to (batch*seq)
)
```

**🍕 Analogy:** Judge tastes dish and scores it
- Compare chef's output with expected recipe
- Lower score = better match

**🔧 Technical:**
- **Reshape outputs**: `(32, 15, 10000)` → `(480, 10000)`
  - 32 captions × 15 words = 480 predictions
  - Each prediction has 10000 possibilities (vocab_size)
- **Reshape captions**: `(32, 15)` → `(480,)`
  - 480 ground truth word indices
- **CrossEntropyLoss**: Compares predicted probabilities with true words
- **Result**: Single number (e.g., 2.5) - lower is better

---

### 📈 Part 15: Log to TensorBoard

```python
writer.add_scalar("Training loss", loss.item(), global_step=step)
step += 1
```

**🍕 Analogy:** Write in quality log book
- Record: "Batch #1234, Loss = 2.5"
- Track improvement over time

**🔧 Technical:**
- Logs loss value for visualization
- `step` counts total batches (not just per epoch)
- View graph: `tensorboard --logdir=runs`

---

### 🔄 Part 16: Backward Pass (Learning!)

```python
optimizer.zero_grad()  # Reset gradients
loss.backward()        # Calculate gradients
optimizer.step()       # Update weights
```

**🍕 Analogy:** Coach gives feedback and chef improves
1. **zero_grad()**: Clear old feedback notes
2. **loss.backward()**: Calculate: "What went wrong and by how much?"
3. **optimizer.step()**: Chef adjusts technique

**🔧 Technical:**
1. **zero_grad()**: Clear gradient buffers (they accumulate by default)
2. **loss.backward()**: Backpropagation - calculates gradient for each parameter
3. **optimizer.step()**: Updates weights using gradients:
   ```python
   weight_new = weight_old - learning_rate * gradient
   ```

---

### 🖨️ Part 17: Print Progress

```python
if idx % 100 == 0:
    print(f"Epoch [{epoch}/{num_epochs}] Batch {idx}/{len(train_loader)} Loss: {loss.item():.4f}")
```

**🍕 Analogy:** Check progress every 100 dishes
- Don't print after every dish (too much!)
- Check occasionally to ensure quality improving

**🔧 Technical:**
- Prints every 100 batches
- Shows: which epoch, which batch, current loss
- Example output:
  ```
  Epoch [0/10] Batch 0/250 Loss: 9.2134
  Epoch [0/10] Batch 100/250 Loss: 4.5678
  Epoch [0/10] Batch 200/250 Loss: 3.1234
  ```

---

## 🎊 Complete Training Flow Summary

```
┌─────────────────────────────────────────────────────┐
│ FOR EACH EPOCH (10 times)                           │
│  ├─ Save checkpoint                                 │
│  │                                                   │
│  └─ FOR EACH BATCH (250 batches)                   │
│      ├─ 1. Get 32 images + captions                │
│      ├─ 2. Move to GPU                             │
│      ├─ 3. Forward pass (predict captions)         │
│      ├─ 4. Calculate loss (how wrong?)             │
│      ├─ 5. Log to TensorBoard                      │
│      ├─ 6. Backward pass (calculate gradients)     │
│      ├─ 7. Update weights (improve model)          │
│      └─ 8. Print progress (every 100 batches)      │
└─────────────────────────────────────────────────────┘
```

---

## ⏱️ Training Timeline Example

**Dataset:** 8,000 image-caption pairs, batch_size=32

**One Epoch:**
- Number of batches = 8,000 / 32 = 250 batches
- Time per batch ≈ 0.5 seconds (on GPU)
- **Total time per epoch** ≈ 125 seconds ≈ **2 minutes**

**Complete Training (10 epochs):**
- **Total time** ≈ 10 × 2 min = **20 minutes** (on GPU)
- On CPU: 10-20x slower ≈ **3-6 hours**

**Expected Loss Progress:**
```
Epoch 0: Loss starts at ~9.0
Epoch 1: Loss drops to ~5.0
Epoch 3: Loss drops to ~3.0
Epoch 5: Loss drops to ~2.0
Epoch 10: Loss around ~1.5 (good!)
```

---

## ⚠️ Important Notes

### 💾 Checkpoint Saving
- Saves after each epoch (every ~2 minutes)
- Can resume if training crashes
- File: `my_checkpoint.pth.tar` (~200MB)

### 🎯 Teacher Forcing
- During training: Use ground truth words (faster learning)
- During inference: Use model's own predictions (more realistic)

### 📊 TensorBoard Monitoring
```bash
# In terminal:
tensorboard --logdir=runs

# Open browser:
http://localhost:6006
```

### 🚨 Common Issues
1. **Out of memory**: Reduce batch_size (32 → 16)
2. **Loss not decreasing**: Check learning rate (try 1e-4)
3. **NaN loss**: Gradient explosion (add gradient clipping)

---

## 🎓 Key Takeaways

### ✅ Training is a Loop
```python
for epoch in epochs:
    for batch in batches:
        predict → calculate loss → backpropagate → update weights
```

### ✅ Three Key Steps
1. **Forward pass**: Model makes predictions
2. **Loss calculation**: Measure how wrong
3. **Backward pass**: Learn from mistakes

### ✅ Monitoring
- Loss should decrease over time
- Save checkpoints regularly
- Use TensorBoard for visualization

---

## 🚀 After Training

Once training completes:

```python
# Load best checkpoint
model.load_state_dict(torch.load("my_checkpoint.pth.tar")["state_dict"])
model.eval()

# Generate caption for new image
image = load_and_transform_image("dog.jpg")
caption = model.caption(image, vocabulary)
print(" ".join(caption))
# Output: "a brown dog is playing with a ball in the park"
```

**Congratulations! You've trained an image captioning model! 🎉**