# Image Captioning with TensorFlow/Keras

## What is Image Captioning?

**Goal:** Given an image, automatically generate a natural language description.

**Example:**
```
Input:  [Photo of a dog playing with a ball]
Output: "a brown dog is playing with a red ball in the park"
```

---

## Architecture Overview

This tutorial uses a **CNN-RNN encoder-decoder architecture**:

```
Image ‚Üí [Xception CNN] ‚Üí Feature Vector (2048) ‚Üí [LSTM RNN] ‚Üí Caption
  üì∑          üîç              üìä                    ‚úçÔ∏è           üìù
```

**Components:**
1. **Xception CNN (Encoder)**: Extracts visual features from images (pre-trained on ImageNet)
2. **LSTM RNN (Decoder)**: Generates caption word-by-word from image features
3. **Embedding Layer**: Converts words to dense vectors
4. **Merge Layer**: Combines image features with word embeddings

---

## Dataset: Flickr8k

**Structure:**
- **8,000 images** with **5 captions each** = 40,000 image-caption pairs
- **Caption format:** Each image has multiple captions in a text file

**Example from `Flickr8k.token.txt`:**
```
1351764581_4d4fb1b40f.jpg#0    a fireman sprays water into the hood of a small white car
1351764581_4d4fb1b40f.jpg#1    A fireman sprays inside the open hood of a small white car
1351764581_4d4fb1b40f.jpg#2    A fireman uses a firehose on a car engine
...
```

**Key difference from PyTorch version:**
- PyTorch version: Simple CSV format (`image.jpg,caption text`)
- TensorFlow version: Token format with `#0, #1, #2...` suffixes for multiple captions per image

---

## Complete Workflow

```
PHASE 1: DATA PREPARATION
  ‚îú‚îÄ Load captions from Flickr8k.token.txt
  ‚îú‚îÄ Clean text (lowercase, remove punctuation)
  ‚îú‚îÄ Build vocabulary
  ‚îî‚îÄ Save processed descriptions

PHASE 2: FEATURE EXTRACTION
  ‚îú‚îÄ Load Xception model (pre-trained)
  ‚îú‚îÄ Extract 2048-dim features from all images
  ‚îî‚îÄ Save features to pickle file

PHASE 3: TOKENIZATION
  ‚îú‚îÄ Create tokenizer (word ‚Üí number mapping)
  ‚îú‚îÄ Calculate max caption length
  ‚îî‚îÄ Save tokenizer

PHASE 4: MODEL BUILDING
  ‚îú‚îÄ Define CNN-RNN architecture
  ‚îú‚îÄ Image features ‚Üí Dense(256)
  ‚îú‚îÄ Word sequences ‚Üí Embedding(256) ‚Üí LSTM(256)
  ‚îú‚îÄ Merge ‚Üí Dense ‚Üí Softmax
  ‚îî‚îÄ Compile with categorical_crossentropy

PHASE 5: TRAINING
  ‚îú‚îÄ Create data generator (yields batches)
  ‚îú‚îÄ Train for 10 epochs
  ‚îî‚îÄ Save model checkpoints

PHASE 6: INFERENCE
  ‚îú‚îÄ Load trained model
  ‚îú‚îÄ Extract features from new image
  ‚îú‚îÄ Generate caption word-by-word
  ‚îî‚îÄ Display result
```

Let's build each phase step by step!

---

## Imports and Setup

In [5]:
import string
import numpy as np
import os
from pickle import dump, load
import tensorflow as tf
import matplotlib.pyplot as plt

### Basic Libraries

```python
import string          # For text cleaning (remove punctuation)
import numpy as np     # Numerical operations
import os              # File system operations
from pickle import dump, load  # Save/load Python objects
import tensorflow as tf         # Deep learning framework
import matplotlib.pyplot as plt # Visualization
```

**Purpose:** Essential utilities for data processing, model building, and visualization.

---

## PHASE 1: DATA PREPARATION

This phase handles loading and preprocessing the Flickr8k dataset. The dataset has a unique token format where each image has 5 captions stored like this:

```
image.jpg#0    A dog runs across the grass
image.jpg#1    A brown dog running in a field
image.jpg#2    A dog playing outside
...
```

**Workflow Steps:**
1. Load token file ‚Üí Extract image names and captions
2. Clean text ‚Üí Remove punctuation, lowercase, remove single chars
3. Build vocabulary ‚Üí Create word-to-frequency dictionary
4. Save to file ‚Üí Store processed captions for later use

---

### TensorFlow & Keras Modules

```python
from PIL import Image                                      # Image loading
from tensorflow.keras.applications.xception import Xception  # Pre-trained CNN
from tensorflow.keras.applications.xception import preprocess_input  # Image preprocessing
from tensorflow.keras.preprocessing.sequence import pad_sequences    # Pad sequences to same length
from tensorflow.keras.utils import to_categorical, plot_model        # One-hot encoding, model visualization
from tensorflow.keras.preprocessing.text import Tokenizer            # Text to sequences
from tensorflow.keras.layers import (
    Input, Dense, LSTM, Embedding, Dropout, add  # Model layers
)
from tensorflow.keras.models import Model  # Functional API
```

**Purpose:**
- **Xception:** Pre-trained CNN for feature extraction (2048-dimensional vectors)
- **Tokenizer:** Converts text captions to sequences of integers
- **LSTM:** Recurrent layer for generating captions word-by-word
- **Embedding:** Converts word indices to dense vectors

**Why Xception?**
- Efficient feature extraction (pre-trained on ImageNet)
- Produces rich 2048-dimensional feature vectors
- Faster than training CNN from scratch

### Function 1: `load_doc(filename)` - Load Text File

```python
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text
```

**What it does:** Reads the entire content of a text file

**Input:** `"Flickr8k.token.txt"` (contains all captions)

**Output:** One large string with all lines

**Example:**
```
Input file content:
    1000268201_693b08cb0e.jpg#0    A child in a pink dress is climbing up a set of stairs
    1000268201_693b08cb0e.jpg#1    A girl going into a wooden building
    
Output: "1000268201_693b08cb0e.jpg#0\tA child in a pink dress...\n1000268201_693b08cb0e.jpg#1\tA girl going..."
```

**Role in Workflow:** First step - loads raw caption data from disk

### Function 2: `all_img_captions(filename)` - Parse Token Format

```python
def all_img_captions(filename):
    file = load_doc(filename)
    captions = file.split('\n')
    descriptions = {}
    
    for caption in captions[:-1]:
        img, caption_text = caption.split('\t')
        
        if img[:-2] not in descriptions:
            descriptions[img[:-2]] = [caption_text]
        else:
            descriptions[img[:-2]].append(caption_text)
    
    return descriptions
```

**What it does:** Converts token format to dictionary mapping

**Key Logic:** `img[:-2]` removes the `#0`, `#1`, `#2` suffix to group captions by image

**Data Transformation:**
```
INPUT (token format):
    image.jpg#0    caption one
    image.jpg#1    caption two
    image.jpg#2    caption three

OUTPUT (dictionary):
    {
        "image.jpg": [
            "caption one",
            "caption two", 
            "caption three"
        ]
    }
```

**Why This Matters:** Groups all 5 captions per image into a single entry

**Role in Workflow:** Second step - structures raw text into usable dictionary format

### Function 3: `cleaning_text(descriptions)` - Text Preprocessing

```python
def cleaning_text(descriptions):
    table = str.maketrans('', '', string.punctuation)  # Translation table for punctuation removal
    
    for key, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            desc = desc.split()                           # Split into words
            desc = [word.lower() for word in desc]        # Lowercase
            desc = [word.translate(table) for word in desc]  # Remove punctuation
            desc = [word for word in desc if len(word) > 1]  # Remove single-character words
            desc_list[i] = ' '.join(desc)
    
    return descriptions
```

**What it does:** Cleans and normalizes text for better model training

**Cleaning Steps:**
1. **Lowercase:** "The Dog" ‚Üí "the dog"
2. **Remove punctuation:** "dog!" ‚Üí "dog"
3. **Filter short words:** "a", "I" ‚Üí removed
4. **Rejoin words:** ["the", "dog", "runs"] ‚Üí "the dog runs"

**Example Transformation:**
```
BEFORE: "A child in a pink dress is climbing up a set of stairs!"
AFTER:  "child in pink dress is climbing up set of stairs"

BEFORE: "The dog's running fast."
AFTER:  "the dogs running fast"
```

**Why Clean Text?**
- Reduces vocabulary size (fewer unique words)
- Removes noise (punctuation doesn't help captioning)
- Standardizes format (all lowercase)

**Role in Workflow:** Third step - prepares text for tokenization

### Function 4: `text_vocabulary(descriptions)` - Build Vocabulary Set

```python
def text_vocabulary(descriptions):
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc
```

**What it does:** Extracts all unique words from all captions

**Data Flow:**
```
INPUT:
    {
        "img1.jpg": ["dog runs fast", "brown dog running"],
        "img2.jpg": ["cat sits quietly"]
    }

PROCESSING:
    img1.jpg captions ‚Üí ["dog", "runs", "fast", "brown", "dog", "running"]
    img2.jpg captions ‚Üí ["cat", "sits", "quietly"]
    
OUTPUT (set):
    {"dog", "runs", "fast", "brown", "running", "cat", "sits", "quietly"}
```

**Why Use a Set?**
- Automatically removes duplicates
- Fast lookup for checking if word exists
- Gives vocabulary size: `len(vocabulary)`

**Typical Vocabulary Size:** ~8,000 unique words in Flickr8k

**Role in Workflow:** Fourth step - identifies all unique words needed for tokenization

### Function 5: `save_descriptions(descriptions, filename)` - Save Processed Data

```python
def save_descriptions(descriptions, filename):
    lines = []
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(f"{key}\t{desc}")
    
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
```

**What it does:** Saves cleaned captions to a file for later use

**Output Format:** Simple tab-separated format (no more #0, #1, #2)

**Example Output File (`descriptions.txt`):**
```
image1.jpg    child in pink dress is climbing up set of stairs
image1.jpg    girl going into wooden building
image1.jpg    little girl climbing into wooden playhouse
image2.jpg    dog runs across the grass
image2.jpg    brown dog running in field
```

**Why Save to File?**
- Don't need to reprocess text every time
- Can share cleaned data
- Faster loading for training

**Role in Workflow:** Final step of Phase 1 - persists cleaned data to disk

### Phase 1 Execution - Data Preparation Pipeline

**This cell runs all Phase 1 functions in sequence:**

```python
# 1. Load raw token file
descriptions = all_img_captions("Flickr8k.token.txt")  
# Output: {img: [cap1, cap2, cap3, cap4, cap5]} - 8,000 images

# 2. Clean text
clean_descriptions = cleaning_text(descriptions)
# Output: Lowercase, no punctuation, no single chars

# 3. Build vocabulary
vocabulary = text_vocabulary(clean_descriptions)
# Output: ~8,000 unique words

# 4. Save to disk
save_descriptions(clean_descriptions, "descriptions.txt")
# Output: descriptions.txt file created
```

**Expected Output:**
```
Length of descriptions = 8091
Length of vocabulary = 8763
```

**What You Get:**
- `descriptions.txt` - cleaned captions ready for training
- `vocabulary` - set of all unique words
- Ready for Phase 2 (Feature Extraction)

---

## PHASE 2: FEATURE EXTRACTION

Now that text is prepared, we need to extract visual features from images using a pre-trained CNN (Xception).

**Why Feature Extraction?**
- Training a CNN from scratch is slow and requires huge data
- Xception is pre-trained on ImageNet (1.4M images, 1000 classes)
- We "borrow" its learned features (edges, textures, objects)

**What Happens:**
```
Image (299x299x3) ‚Üí Xception CNN ‚Üí Feature Vector (2048)
```

Each image becomes a 2048-dimensional vector that captures its visual content.

**Workflow:**
1. Load Xception model (without top classification layer)
2. Preprocess images to 299x299
3. Extract features for all images
4. Save features to pickle file

---

### Phase 2 Execution - Feature Extraction with Xception

**This cell downloads Xception weights and extracts features:**

```python
# 1. Download Xception weights (if not cached)
weights_url = "https://storage.googleapis.com/.../xception_weights_tf_dim_ordering_tf_kernels_notop.h5"
weights_path = download_with_retry(weights_url, 'xception_weights.h5')

# 2. Load Xception model
model = Xception(include_top=False,    # Remove classification layer
                 pooling='avg',         # Global average pooling
                 weights=weights_path)  # Use downloaded weights

# 3. Extract features from all images
features = extract_features(dataset_images)
# Output: {img_name: 2048-dim vector}

# 4. Save features
dump(features, open("features.p", "wb"))
```

**Key Parameters:**
- `include_top=False`: Removes final classification layer (we don't need 1000 ImageNet classes)
- `pooling='avg'`: Adds global average pooling ‚Üí output shape (2048,)
- `weights`: Pre-trained weights from ImageNet

**What You Get:**
- `features.p` - pickle file with all image features
- Dictionary: `{image_name: numpy_array(2048)}`
- Processing time: ~20-30 minutes for 8,000 images

**Memory Note:** Features file is ~200MB (much smaller than raw images)

### Function 6: `extract_features(directory)` - CNN Feature Extraction

```python
def extract_features(directory):
    features = {}
    valid_images = ['.jpg', '.jpeg', '.png']
    
    for img in tqdm(os.listdir(directory)):
        # Skip non-image files
        ext = os.path.splitext(img)[1].lower()
        if ext not in valid_images:
            continue
        
        # Load and preprocess image
        filename = directory + "/" + img
        image = Image.open(filename)
        image = image.resize((299, 299))          # Xception input size
        image = np.expand_dims(image, axis=0)     # Add batch dimension: (299,299,3) ‚Üí (1,299,299,3)
        image = image / 127.5                      # Scale to [0, 2]
        image = image - 1.0                        # Scale to [-1, 1]
        
        # Extract features using pre-trained model
        feature = model.predict(image)             # Output: (1, 2048)
        features[img] = feature
    
    return features
```

**What it does:** Converts images to feature vectors using Xception CNN

**Image Preprocessing Steps:**
1. **Resize:** Any size ‚Üí 299x299 (Xception requirement)
2. **Add batch dimension:** (299,299,3) ‚Üí (1,299,299,3)
3. **Normalize:** Pixel values [0,255] ‚Üí [-1,1]

**Example:**
```
INPUT: dog.jpg (640x480x3)
‚Üì Resize
(299x299x3)
‚Üì Normalize
(299x299x3) with values in [-1, 1]
‚Üì Xception CNN
(2048,) feature vector

OUTPUT: features["dog.jpg"] = array([0.234, -0.567, 0.891, ...])  # 2048 values
```

**Why [-1, 1] normalization?**
- Xception was trained with this normalization
- Matches training distribution = better features

**Role in Workflow:** Converts raw images to dense feature vectors for the LSTM decoder

---

## PHASE 3: TOKENIZATION & DATA LOADING

Now we need to prepare the training data by:
1. Loading train/test splits
2. Adding `<start>` and `<end>` tokens to captions
3. Creating a tokenizer (word ‚Üí integer mapping)
4. Finding maximum caption length

**Why Tokenization?**
Neural networks work with numbers, not words. We need to convert:
```
"dog runs fast" ‚Üí [34, 156, 892]
```

**Special Tokens:**
- `<start>`: Signals beginning of caption
- `<end>`: Signals end of caption

---

### Function 7: `load_photos(filename)` - Load Train/Test Split

```python
def load_photos(filename):
    file = load_doc(filename)
    photos = file.split("\n")[:-1]
    photos_present = [photo for photo in photos if os.path.exists(os.path.join(dataset_images, photo))]
    return photos_present
```

**What it does:** Loads list of image filenames for train/test split

**Input:** `Flickr_8k.trainImages.txt` containing:
```
1000268201_693b08cb0e.jpg
1001773457_577c3a7d70.jpg
1002674143_1b742ab4b8.jpg
...
```

**Output:** List of image filenames that exist in the dataset folder

**Example:**
```python
train_imgs = load_photos("Flickr_8k.trainImages.txt")
# Output: ['1000268201_693b08cb0e.jpg', '1001773457_577c3a7d70.jpg', ...]
# Length: ~6,000 images for training
```

**Role in Workflow:** Separates training and testing images

### Function 8: `load_clean_descriptions(filename, photos)` - Add Start/End Tokens

```python
def load_clean_descriptions(filename, photos): 
    file = load_doc(filename)
    descriptions = {}
    
    for line in file.split("\n"):
        words = line.split()
        if len(words) < 1:
            continue
        
        image, image_caption = words[0], words[1:]
        
        if image in photos:
            if image not in descriptions:
                descriptions[image] = []
            
            # Add special tokens
            desc = '<start> ' + " ".join(image_caption) + ' <end>'
            descriptions[image].append(desc)
    
    return descriptions
```

**What it does:** Loads captions for specific images and adds start/end tokens

**Caption Transformation:**
```
BEFORE: "dog runs across the grass"
AFTER:  "<start> dog runs across the grass <end>"
```

**Why Add Tokens?**
- `<start>`: Tells model when to begin generating
- `<end>`: Tells model when to stop generating

**Example:**
```python
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
# Output:
{
    "dog.jpg": [
        "<start> dog runs across the grass <end>",
        "<start> brown dog running in field <end>",
        ...
    ]
}
```

**Role in Workflow:** Prepares captions for sequence-to-sequence training

### Function 9: `load_features(photos)` - Load Pre-extracted Features

```python
def load_features(photos):
    # Load all features
    all_features = load(open("features.p", "rb"))
    # Select only needed features
    features = {k: all_features[k] for k in photos}
    return features
```

**What it does:** Loads only the features for training/test images

**Why Filter?**
- `features.p` contains features for ALL 8,000 images
- We only need features for training set (~6,000) or test set (~2,000)
- Saves memory

**Example:**
```python
# All features: 8,091 images
all_features = load(open("features.p", "rb"))

# Filter for training set
train_features = load_features(train_imgs)  # Only ~6,000 features
```

**Data Structure:**
```python
{
    "dog.jpg": array([0.234, -0.567, ...]),  # Shape: (1, 2048)
    "cat.jpg": array([0.891, 0.123, ...]),
    ...
}
```

**Role in Workflow:** Loads pre-computed CNN features for efficient training

### Loading Training Data

**This cell loads training split:**

```python
# 1. Load training image filenames
train_imgs = load_photos("Flickr_8k.trainImages.txt")
# Output: List of ~6,000 image filenames

# 2. Load captions with start/end tokens
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
# Output: {img: ["<start> caption <end>", ...]}

# 3. Load pre-extracted CNN features
train_features = load_features(train_imgs)
# Output: {img: feature_vector(2048)}
```

**What You Get:**
- `train_imgs`: List of training image names
- `train_descriptions`: Dictionary of captions with special tokens
- `train_features`: Dictionary of 2048-dim feature vectors

**Ready for:** Tokenization and model building

### Function 10: `dict_to_list(descriptions)` - Flatten Descriptions

```python
def dict_to_list(descriptions):
    all_desc = []
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc
```

**What it does:** Converts dictionary of captions to a flat list

**Transformation:**
```
INPUT (dictionary):
{
    "img1.jpg": ["<start> dog runs <end>", "<start> brown dog <end>"],
    "img2.jpg": ["<start> cat sits <end>"]
}

OUTPUT (list):
[
    "<start> dog runs <end>",
    "<start> brown dog <end>",
    "<start> cat sits <end>"
]
```

**Why Flatten?**
- `Tokenizer.fit_on_texts()` expects a list of strings
- We need all captions together to build the vocabulary

**Example:**
```python
all_train_captions = dict_to_list(train_descriptions)
# Output: List of ~30,000 captions (6,000 images √ó 5 captions each)
```

**Role in Workflow:** Prepares data format for Keras Tokenizer

### Creating Tokenizer - Word to Integer Mapping

**This cell creates the tokenizer that converts words to numbers:**

```python
# 1. Flatten all captions to a list
all_train_captions = dict_to_list(train_descriptions)
# Output: ["<start> dog runs <end>", "<start> cat sits <end>", ...]

# 2. Create tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_train_captions)
# Learns vocabulary from all training captions

# 3. Get vocabulary size
vocab_size = len(tokenizer.word_index) + 1
```

**What the Tokenizer Does:**

```python
# Creates word_index dictionary:
tokenizer.word_index = {
    '<start>': 1,
    '<end>': 2,
    'dog': 3,
    'runs': 4,
    'cat': 5,
    'sits': 6,
    ...
}

# Can convert text to sequences:
tokenizer.texts_to_sequences(["dog runs"])
# Output: [[3, 4]]
```

**Why +1 for vocab_size?**
- Index 0 is reserved for padding
- Real vocabulary: indices 1 to vocab_size-1

**Role in Workflow:** Creates word-to-integer mapping for model input

### Finding Maximum Caption Length

**This cell calculates the longest caption length:**

```python
max_length = max(len(caption.split()) for caption in all_train_captions)
```

**Why Do We Need This?**
- All sequences must be the same length for batch processing
- Shorter captions will be padded to max_length
- Longer captions cannot exceed max_length

**Example:**
```python
captions = [
    "<start> dog runs <end>",           # Length: 4
    "<start> brown dog running in field <end>",  # Length: 7
    "<start> cat <end>"                 # Length: 3
]

max_length = 7  # Longest caption

# Padded sequences:
[1, 3, 4, 2, 0, 0, 0]    # "dog runs" padded with 0s
[1, 8, 3, 9, 10, 11, 2]  # "brown dog running in field"
[1, 5, 2, 0, 0, 0, 0]    # "cat" padded with 0s
```

**Typical max_length:** ~34 words for Flickr8k

**Role in Workflow:** Determines sequence padding length for model input

---

## PHASE 4: MODEL ARCHITECTURE

Time to build the image captioning model! We'll use an **Encoder-Decoder architecture**:

**Architecture Overview:**

```
IMAGE FEATURES (2048)             CAPTION SEQUENCE
       ‚Üì                                  ‚Üì
   Dense(256)                    Embedding(vocab_size, 256)
       ‚Üì                                  ‚Üì
    Dropout                               ‚Üì
       ‚Üì                                  ‚Üì
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí MERGE (add) ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚Üì
                    LSTM(256)
                         ‚Üì
                    Dense(256)
                         ‚Üì
                Dense(vocab_size, softmax)
                         ‚Üì
                   NEXT WORD
```

**Two Input Branches:**
1. **Image Encoder:** Dense layers to process CNN features
2. **Text Decoder:** Embedding + LSTM to process caption sequences

**They merge:** Combined features fed to LSTM for next-word prediction

---

### Function 11: `create_tokenizer(descriptions)` - Build Tokenizer

```python
def create_tokenizer(descriptions):
    desc_list = dict_to_list(descriptions)  # Flatten to list
    tokenizer = Tokenizer()                  # Create tokenizer
    tokenizer.fit_on_texts(desc_list)       # Learn vocabulary
    return tokenizer
```

**What it does:** Wrapper function that creates and trains the tokenizer

**Workflow:**
```
1. Flatten descriptions ‚Üí ["<start> dog <end>", "<start> cat <end>", ...]
2. Create Tokenizer() object
3. fit_on_texts() ‚Üí builds word_index dictionary
4. Return trained tokenizer
```

**Example Usage:**
```python
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))  # Save for later

vocab_size = len(tokenizer.word_index) + 1  # +1 for padding index 0
```

**Role in Workflow:** Creates reusable tokenizer object for converting text to sequences

### Function 12: `max_length(descriptions)` - Find Longest Caption

```python
def max_length(descriptions):
    desc_list = dict_to_list(descriptions)
    return max(len(d.split()) for d in desc_list)
```

**What it does:** Calculates the maximum caption length in the dataset

**Example:**
```python
descriptions = {
    "img1.jpg": ["<start> dog runs <end>", "<start> brown dog <end>"],
    "img2.jpg": ["<start> cat <end>"]
}

max_length(descriptions)
# Output: 4  (from "<start> dog runs <end>")
```

**Role in Workflow:** Determines padding length for variable-length sequences

### Function 13: `create_sequences()` - Create Training Sequences

This is the **MOST IMPORTANT** function - it creates input-output pairs for training!

```python
def create_sequences(tokenizer, max_length, desc_list, feature):
    X1, X2, y = list(), list(), list()
    
    for desc in desc_list:
        # Encode caption to integers
        seq = tokenizer.texts_to_sequences([desc])[0]
        
        # Create multiple training pairs from one caption
        for i in range(1, len(seq)):
            in_seq, out_seq = seq[:i], seq[i]  # Partial caption, next word
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            
            X1.append(feature)    # Image features
            X2.append(in_seq)     # Partial caption
            y.append(out_seq)     # Next word (one-hot)
    
    return np.array(X1), np.array(X2), np.array(y)
```

**How It Works - Example:**

Given caption: `"<start> dog runs <end>"`

**Step 1: Tokenize**
```
seq = [1, 23, 45, 2]  # <start>=1, dog=23, runs=45, <end>=2
```

**Step 2: Create Training Pairs**
```
i=1: in_seq=[1]         ‚Üí out_seq=23  (Given "<start>", predict "dog")
i=2: in_seq=[1,23]      ‚Üí out_seq=45  (Given "<start> dog", predict "runs")
i=3: in_seq=[1,23,45]   ‚Üí out_seq=2   (Given "<start> dog runs", predict "<end>")
```

**Step 3: Pad Sequences**
```
If max_length=10:
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]         ‚Üí 23
[1, 23, 0, 0, 0, 0, 0, 0, 0, 0]        ‚Üí 45
[1, 23, 45, 0, 0, 0, 0, 0, 0, 0]       ‚Üí 2
```

**Final Output:**
- `X1`: Image features (same for all 3 pairs)
- `X2`: Padded caption sequences
- `y`: Next word (one-hot encoded, size=vocab_size)

**Why This Works:**
- Model learns to predict next word given image + partial caption
- One caption creates multiple training examples
- At inference, we predict word-by-word using this same pattern

**Role in Workflow:** Converts (image, caption) pairs into (image, partial_caption) ‚Üí next_word training data

### Function 14: `data_generator()` - Batch Data Generator

```python
def data_generator(descriptions, features, tokenizer, max_length):
    def generator():
        while True:  # Infinite loop for training
            for key, description_list in descriptions.items():
                feature = features[key][0]  # Get image features
                
                # Create sequences for this image
                input_image, input_sequence, output_word = create_sequences(
                    tokenizer, max_length, description_list, feature
                )
                
                # Yield one sample at a time
                for i in range(len(input_image)):
                    yield {
                        'input_1': input_image[i],      # Image features (2048)
                        'input_2': input_sequence[i]    # Caption sequence (max_length)
                    }, output_word[i]                    # Next word (vocab_size)
    
    # Define output shapes for TensorFlow
    output_signature = (
        {
            'input_1': tf.TensorSpec(shape=(2048,), dtype=tf.float32),
            'input_2': tf.TensorSpec(shape=(max_length,), dtype=tf.int32)
        },
        tf.TensorSpec(shape=(vocab_size,), dtype=tf.float32)
    )
    
    # Create TensorFlow dataset
    dataset = tf.data.Dataset.from_generator(generator, output_signature=output_signature)
    
    return dataset.batch(32)  # Batch size 32
```

**What it does:** Creates a memory-efficient data pipeline for training

**Why Use a Generator?**
- Dataset is too large to fit in memory (6,000 images √ó 5 captions √ó 34 words = 1M+ samples)
- Generates batches on-the-fly during training
- Infinite loop ensures training never runs out of data

**Output Format:**
```python
# Each batch contains:
inputs = {
    'input_1': (32, 2048),      # 32 image features
    'input_2': (32, max_length) # 32 caption sequences
}
outputs = (32, vocab_size)      # 32 next-word predictions
```

**Role in Workflow:** Provides batched training data to model.fit()

### Function 15: `define_model()` - Build Caption Model Architecture

```python
def define_model(vocab_size, max_length):
    # IMAGE ENCODER BRANCH
    inputs1 = Input(shape=(2048,), name='input_1')  # Image features from Xception
    fe1 = Dropout(0.5)(inputs1)                      # Dropout for regularization
    fe2 = Dense(256, activation='relu')(fe1)         # Compress 2048 ‚Üí 256
    
    # TEXT DECODER BRANCH
    inputs2 = Input(shape=(max_length,), name='input_2')  # Caption sequence
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  # Word embeddings
    se2 = Dropout(0.5)(se1)                                     # Dropout
    se3 = LSTM(256)(se2)                                        # LSTM processes sequence ‚Üí 256
    
    # MERGE BRANCHES
    decoder1 = add([fe2, se3])                       # Element-wise addition
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)  # Probability over words
    
    # BUILD MODEL
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model
```

**Architecture Breakdown:**

**1. Image Encoder (fe = feature encoder):**
```
(2048) ‚Üí Dropout(0.5) ‚Üí Dense(256) ‚Üí [256-dim vector]
```

**2. Text Decoder (se = sequence encoder):**
```
(max_length) ‚Üí Embedding(vocab_size, 256) ‚Üí Dropout(0.5) ‚Üí LSTM(256) ‚Üí [256-dim vector]
```

**3. Merge & Prediction:**
```
[Image 256] + [Text 256] ‚Üí Dense(256) ‚Üí Dense(vocab_size, softmax)
```

**Key Components:**

- **Embedding Layer:** Converts word indices to 256-dim vectors
  - `mask_zero=True`: Ignores padding (index 0)
  
- **LSTM:** Processes sequence of word embeddings
  - Captures context: "dog" means different things after "brown" vs "hot"
  
- **add([fe2, se3]):** Combines image and text features
  - Both are 256-dim, so we can add them element-wise
  
- **Softmax Output:** Probability distribution over all words

**Example Forward Pass:**
```
Image features: [0.23, -0.45, ..., 0.89]  (2048)
Caption: "<start> dog"                     (tokenized, padded)

‚Üì
Image branch: [0.12, 0.89, ..., -0.34]    (256)
Text branch:  [0.56, -0.23, ..., 0.78]    (256)

‚Üì Merge (add)
Combined: [0.68, 0.66, ..., 0.44]         (256)

‚Üì Output
Probabilities: [0.001, 0.003, ..., 0.24, ...]  (vocab_size)
                                ‚Üë
                            "runs" (highest probability)
```

**Role in Workflow:** Defines the neural network architecture for image captioning

---

## PHASE 5: TRAINING

Now we train the model! This phase runs the training loop with our data generator.

**Training Setup:**
```python
model = define_model(vocab_size, max_length)
epochs = 10

dataset = data_generator(train_descriptions, train_features, tokenizer, max_length)
model.fit(dataset, epochs=4, steps_per_epoch=steps, verbose=1)
```

**What Happens During Training:**

Each step:
1. Generator produces batch of (image_features, caption_sequence) ‚Üí next_word
2. Model predicts next word probabilities
3. Loss calculated: predicted vs actual next word
4. Backpropagation updates weights
5. Repeat for all batches

**Training Progress:**
- `steps_per_epoch`: Number of batches per epoch
- Typical: ~2000-3000 steps/epoch for Flickr8k
- Each epoch processes all training data once

**Saving Models:**
```python
model.save("models2/model_0.h5")  # After epoch 0
model.save("models2/model_1.h5")  # After epoch 1
...
```

**Expected Training Time:** ~30-60 minutes per epoch on GPU

---

---

## PHASE 6: INFERENCE (GENERATING CAPTIONS)

After training, we can generate captions for new images! This phase uses the trained model to predict captions word-by-word.

**Inference Workflow:**
```
1. Load new image
2. Extract features using Xception
3. Start with "<start>" token
4. Predict next word
5. Append word to sequence
6. Repeat until "<end>" or max_length
```

**Key Difference from Training:**
- Training: Given full caption, predict each next word
- Inference: Start with "<start>", generate word-by-word

---

### Function 16: `word_for_id()` - Convert Index to Word

```python
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
```

**What it does:** Reverse lookup - converts word index back to word string

**Example:**
```python
# Tokenizer word_index:
{
    '<start>': 1,
    '<end>': 2,
    'dog': 3,
    'runs': 4
}

word_for_id(3, tokenizer)
# Output: 'dog'

word_for_id(999, tokenizer)
# Output: None  (not in vocabulary)
```

**Role in Workflow:** Converts model's integer predictions back to readable words

### Function 17: `generate_desc()` - Generate Caption for Image

This is the **INFERENCE FUNCTION** - generates captions word-by-word!

```python
def generate_desc(model, tokenizer, photo, max_length):
    in_text = 'start'  # Start with "start" token
    
    for i in range(max_length):
        # 1. Tokenize current caption
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        
        # 2. Pad to max_length
        sequence = pad_sequences([sequence], maxlen=max_length)
        
        # 3. Predict next word
        pred = model.predict([photo, sequence], verbose=0)
        pred = np.argmax(pred)  # Get highest probability word index
        
        # 4. Convert index to word
        word = word_for_id(pred, tokenizer)
        if word is None:
            break
        
        # 5. Append word to caption
        in_text += ' ' + word
        
        # 6. Stop if we predict "end"
        if word == 'end':
            break
    
    return in_text
```

**How It Works - Step by Step:**

**Given:** Image of a dog running

**Iteration 1:**
```
in_text = "start"
sequence = [1, 0, 0, ..., 0]  (padded to max_length)
Prediction: [0.001, 0.002, 0.245, ...]  ‚Üí argmax = 23 ‚Üí "dog"
in_text = "start dog"
```

**Iteration 2:**
```
in_text = "start dog"
sequence = [1, 23, 0, ..., 0]
Prediction: [0.003, 0.001, 0.189, ...]  ‚Üí argmax = 45 ‚Üí "runs"
in_text = "start dog runs"
```

**Iteration 3:**
```
in_text = "start dog runs"
sequence = [1, 23, 45, 0, ..., 0]
Prediction: [0.001, 0.678, 0.002, ...]  ‚Üí argmax = 2 ‚Üí "end"
in_text = "start dog runs end"
STOP (word == "end")
```

**Final Output:** `"start dog runs end"`

**Key Points:**
- Autoregressive: Each prediction depends on previous predictions
- Greedy search: Always picks highest probability word (not beam search)
- Stops at "end" token or max_length

**Example Usage:**
```python
# Load trained model
model = load_model('models2/model_9.h5')

# Extract features from new image
photo = extract_features('dog.jpg', xception_model)

# Generate caption
caption = generate_desc(model, tokenizer, photo, max_length)
print(caption)
# Output: "start dog runs across the grass end"
```

**Role in Workflow:** Core inference function that generates captions for new images

### Complete Inference Pipeline

**This cell shows the full inference workflow:**

```python
# 1. Load pre-trained Xception for feature extraction
xception_model = Xception(include_top=False, pooling='avg')

# 2. Load trained caption model
model = load_model('models2/model_9.h5')

# 3. Load tokenizer
tokenizer = load(open('tokenizer.p', 'rb'))

# 4. Load new image and extract features
img_path = 'test_image.jpg'
photo = extract_features(img_path, xception_model)

# 5. Generate caption
caption = generate_desc(model, tokenizer, photo, max_length)

# 6. Clean up (remove start/end tokens)
caption = caption.replace('start', '').replace('end', '').strip()
print("Caption:", caption)
```

**Example Outputs:**
```
Image: dog_running.jpg
Caption: "dog runs across the grass"

Image: child_playing.jpg
Caption: "child in pink dress is climbing up set of stairs"

Image: beach_sunset.jpg
Caption: "person standing on beach at sunset"
```

**Tips for Better Captions:**
- Use model from later epochs (model_9.h5 > model_0.h5)
- Images similar to training data work best
- Model may struggle with unusual objects not in Flickr8k

---

## SUMMARY: Complete Workflow

**PHASE 1: Data Preparation**
- `load_doc()` ‚Üí Load token file
- `all_img_captions()` ‚Üí Parse token format, group by image
- `cleaning_text()` ‚Üí Remove punctuation, lowercase
- `text_vocabulary()` ‚Üí Build word set
- `save_descriptions()` ‚Üí Save cleaned captions

**PHASE 2: Feature Extraction**
- `Xception` ‚Üí Load pre-trained CNN
- `extract_features()` ‚Üí Convert images to 2048-dim vectors
- Save to `features.p`

**PHASE 3: Tokenization**
- `load_photos()` ‚Üí Load train/test split
- `load_clean_descriptions()` ‚Üí Add `<start>` and `<end>` tokens
- `load_features()` ‚Üí Filter features for train/test
- `dict_to_list()` ‚Üí Flatten captions
- `create_tokenizer()` ‚Üí Build word-to-index mapping
- `max_length()` ‚Üí Find longest caption

**PHASE 4: Model Building**
- `define_model()` ‚Üí Build encoder-decoder architecture
  - Image Encoder: Dense(256)
  - Text Decoder: Embedding + LSTM(256)
  - Merge: add() + softmax

**PHASE 5: Training**
- `create_sequences()` ‚Üí Create (image, partial_caption) ‚Üí next_word pairs
- `data_generator()` ‚Üí Batch data generator
- `model.fit()` ‚Üí Train the model

**PHASE 6: Inference**
- `extract_features()` ‚Üí Get features for new image
- `word_for_id()` ‚Üí Convert index to word
- `generate_desc()` ‚Üí Generate caption word-by-word

**Key Differences from PyTorch:**
- Token format: `image.jpg#0` (TensorFlow) vs `image.jpg,caption` (PyTorch)
- Data loading: `tf.data.Dataset` generator vs PyTorch DataLoader
- Model: Functional API vs Sequential/nn.Module
- Training: `model.fit()` vs manual training loop

**Dataset:** Flickr8k - 8,000 images, 5 captions each, ~8,700 unique words

---

In [None]:
from keras.applications.xception import Xception, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.models import Model,load_model
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical, get_file
from keras.layers import Input, Dense, LSTM, Embedding, Dropout, add

In [6]:
def load_doc(filename):
    
    file=open(filename, 'r')
    text=file.read()
    file.close()
    return text

In [7]:
def all_img_captions(filename):
    file =load_doc(filename)
    captions=file.split('\n')
    descriptions={}
    for caption in captions[:-1]:
        img, caption_text=caption.split('\t')
        if img[:-2] not in descriptions:
            descriptions[img[:-2]]=[caption_text]
        else:
            descriptions[img[:-2]].append(caption_text)
    return descriptions

In [8]:
def cleaning_text(descriptions):
    table=str.maketrans('','',string.punctuation)
    for key, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc=desc_list[i]
            desc=desc.split()
            desc=[word.lower() for word in desc]
            desc=[word.translate(table) for word in desc]
            desc=[word for word in desc if len(word)>1]
            desc_list[i]=' '.join(desc)
    return descriptions

In [11]:
def text_vocabulary(descriptions):
    all_desc=set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc

In [9]:
def save_descriptions(descriptions, filename):
    lines=[]
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(f"{key}\t{desc}")
    data='\n'.join(lines)
    file=open(filename, 'w')
    file.write(data)
    file.close()

In [None]:
import time

# Set these path according to project folder in you system
dataset_text = "/Users/sreemanti/Documents/youtube/youtube-teach/image caption generator/Flickr8k_text"
dataset_images = "/Users/sreemanti/Documents/youtube/youtube-teach/image caption generator/Flicker8k_Dataset"

#we prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#mapping them into descriptions dictionary img to 5 captions
descriptions = all_img_captions(filename)
print("Length of descriptions =" ,len(descriptions))

#cleaning the descriptions
clean_descriptions = cleaning_text(descriptions)

#building vocabulary 
vocabulary = text_vocabulary(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))

#saving each description to file 
save_descriptions(clean_descriptions, "descriptions.txt")

def download_with_retry(url, filename, max_retries=3):
    for attempt in range(max_retries):
        try:
            return get_file(filename, url)
        except Exception as e:
            if attempt == max_retries - 1:
                raise e
            print(f"Download attempt {attempt + 1} failed. Retrying in 5 seconds...")
            time.sleep(5)


In [None]:
# Replace the Xception model initialization with:
from tqdm import tqdm


weights_url = "https://storage.googleapis.com/tensorflow/keras-applications/xception/xception_weights_tf_dim_ordering_tf_kernels_notop.h5"
weights_path = download_with_retry(weights_url, 'xception_weights.h5')
model = Xception(include_top=False, pooling='avg', weights=weights_path)

def extract_features(directory):
    features = {}
    valid_images = ['.jpg', '.jpeg', '.png']  # Add other formats if needed
    
    for img in tqdm(os.listdir(directory)):
        # Skip files that don't end with valid image extensions
        ext = os.path.splitext(img)[1].lower()
        if ext not in valid_images:
            continue
            
        filename = directory + "/" + img
        image = Image.open(filename)
        image = image.resize((299,299))
        image = np.expand_dims(image, axis=0)
        image = image/127.5
        image = image - 1.0

        feature = model.predict(image)
        features[img] = feature
    return features

# 2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))


In [None]:

features = load(open("features.p","rb"))

#load the data 
def load_photos(filename):
    file = load_doc(filename)
    photos = file.split("\n")[:-1]
    photos_present = [photo for photo in photos if os.path.exists(os.path.join(dataset_images, photo))]
    return photos_present


In [None]:

def load_clean_descriptions(filename, photos): 
    #loading clean_descriptions
    file = load_doc(filename)
    descriptions = {}
    for line in file.split("\n"):

        words = line.split()
        if len(words)<1 :
            continue

        image, image_caption = words[0], words[1:]

        if image in photos:
            if image not in descriptions:
                descriptions[image] = []
            desc = '<start> ' + " ".join(image_caption) + ' <end>'
            descriptions[image].append(desc)

    return descriptions

In [None]:


def load_features(photos):
    #loading all features
    all_features = load(open("features.p","rb"))
    #selecting only needed features
    features = {k:all_features[k] for k in photos}
    return features

In [None]:
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"

#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)


In [None]:

#converting dictionary to clean list of descriptions
def dict_to_list(descriptions):
    all_desc = []
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

In [None]:

#creating tokenizer class 
#this will vectorise text corpus
#each integer will represent token in dictionary

def create_tokenizer(descriptions):
    desc_list = dict_to_list(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(desc_list)
    return tokenizer

In [None]:


# give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)


In [None]:


#calculate maximum length of descriptions
def max_length(descriptions):
    desc_list = dict_to_list(descriptions)
    return max(len(d.split()) for d in desc_list)
    
max_length = max_length(train_descriptions)
print(max_length)

In [None]:

#create input-output sequence pairs from the image description.

#data generator, used by model.fit()
def data_generator(descriptions, features, tokenizer, max_length):
    def generator():
        while True:
            for key, description_list in descriptions.items():
                feature = features[key][0]
                input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
                for i in range(len(input_image)):
                    yield {'input_1': input_image[i], 'input_2': input_sequence[i]}, output_word[i]
    
    # Define the output signature for the generator
    output_signature = (
        {
            'input_1': tf.TensorSpec(shape=(2048,), dtype=tf.float32),
            'input_2': tf.TensorSpec(shape=(max_length,), dtype=tf.int32)
        },
        tf.TensorSpec(shape=(vocab_size,), dtype=tf.float32)
    )
    
    # Create the dataset
    dataset = tf.data.Dataset.from_generator(
        generator,
        output_signature=output_signature
    )
    
    return dataset.batch(32)

In [None]:

def create_sequences(tokenizer, max_length, desc_list, feature):
    X1, X2, y = list(), list(), list()
    # walk through each description for the image
    for desc in desc_list:
        # encode the sequence
        seq = tokenizer.texts_to_sequences([desc])[0]
        # split one sequence into multiple X,y pairs
        for i in range(1, len(seq)):
            # split into input and output pair
            in_seq, out_seq = seq[:i], seq[i]
            # pad input sequence
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # encode output sequence
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # store
            X1.append(feature)
            X2.append(in_seq)
            y.append(out_seq)
    return np.array(X1), np.array(X2), np.array(y)

In [None]:

#You can check the shape of the input and output for your model
dataset = data_generator(train_descriptions, features, tokenizer, max_length)
for (a, b) in dataset.take(1):
    print(a['input_1'].shape, a['input_2'].shape, b.shape)
    break

In [None]:

from keras.utils import plot_model

# define the captioning model
def define_model(vocab_size, max_length):

    # features from the CNN model squeezed from 2048 to 256 nodes
    inputs1 = Input(shape=(2048,), name='input_1')
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)

    # LSTM sequence model
    inputs2 = Input(shape=(max_length,), name='input_2')
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)

    # Merging both models
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)

    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)

    return model

In [None]:

# train our model
print('Dataset: ', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length: ', max_length)

In [None]:

model = define_model(vocab_size, max_length)
epochs = 10

def get_steps_per_epoch(train_descriptions):
    total_sequences = 0
    for img_captions in train_descriptions.values():
        for caption in img_captions:
            words = caption.split()
            total_sequences += len(words) - 1
    # Ensure at least 1 step, even if sequences < batch_size
    return max(1, total_sequences // 32)

# Update training loop
steps = get_steps_per_epoch(train_descriptions)

# making a directory models to save our models
os.mkdir("models2")
for i in range(epochs):
    dataset = data_generator(train_descriptions, train_features, tokenizer, max_length)
    model.fit(dataset, epochs=4, steps_per_epoch=steps, verbose=1)
    model.save("models2/model_" + str(i) + ".h5")

In [None]:
import argparse

ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args['image']

def extract_features(filename, model):
        try:
            image = Image.open(filename)
            
        except:
            print("ERROR: Couldn't open image! Make sure the image path and extension is correct")
        image = image.resize((299,299))
        image = np.array(image)
        # for images that has 4 channels, we convert them into 3 channels
        if image.shape[2] == 4: 
            image = image[..., :3]
        image = np.expand_dims(image, axis=0)
        image = image/127.5
        image = image - 1.0
        feature = model.predict(image)
        return feature


In [None]:

def word_for_id(integer, tokenizer):
 for word, index in tokenizer.word_index.items():
     if index == integer:
         return word
 return None


In [None]:


def generate_desc(model, tokenizer, photo, max_length):
    in_text = 'start'
    for i in range(max_length):
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length)
        pred = model.predict([photo,sequence], verbose=0)
        pred = np.argmax(pred)
        word = word_for_id(pred, tokenizer)
        if word is None:
            break
        in_text += ' ' + word
        if word == 'end':
            break
    return in_text


In [None]:

from keras.utils import plot_model

def define_model(vocab_size, max_length):

    # features from the CNN model squeezed from 2048 to 256 nodes
    inputs1 = Input(shape=(2048,), name='input_1')
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)

    # LSTM sequence model
    inputs2 = Input(shape=(max_length,), name='input_2')
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)

    # Merging both models
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)

    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)

    return model


In [None]:

#path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
vocab_size = len(tokenizer.word_index) + 1

# First define the model architecture
model = define_model(vocab_size, max_length)
# Then load the weights
model.load_weights('models/model_9.h5')
xception_model = Xception(include_top=False, pooling="avg")

photo = extract_features(img_path, xception_model)
img = Image.open(img_path)

description = generate_desc(model, tokenizer, photo, max_length)
print("\n\n")
print(description)
plt.imshow(img)
