# Image-to-Image Translation via Pix2Pix

> **Reference Article:** [Image-to-Image Translation with Conditional Adversarial Networks by Isola et al. (2017)](https://arxiv.org/abs/1611.07004)

This notebook demonstrates how to run the Pix2Pix implementation for **semantic label-to-image translation** using the **Cityscapes dataset**. All the core logic (models, dataset, training loop) is defined in the `src/` directory. This notebook serves as an execution and visualization front-end.

## 1. **Introduction**

**Image-to-image translation** is a powerful tool for converting an image from one domain to another while preserving structural characteristics. Tasks like converting **semantic label maps** to **realistic images**, **day-to-night transformation**, or **edge-to-photo synthesis** all fall under this category.

One of the most influential works in this field is **Pix2Pix**, which is based on **conditional Generative Adversarial Networks (cGANs)**. The model learns a mapping from input images to target images conditioned on paired data. It combines the **adversarial loss** from GANs with a **L1 reconstruction loss** to ensure both realistic outputs and fidelity to the input structure.

## 2. **Model Architecture Deep-Dive**

The Pix2Pix model consists of two main components: a **Generator** and a **Discriminator**, trained adversarially.

### **The Generator: U-Net**

The generator uses a **U-Net architecture**. This is an encoder-decoder network with a key feature: **skip connections**.

1.  **Encoder:** Progressively downsamples the input image (e.g., 256x256 -> 1x1) to extract features, learning the high-level *content*.
2.  **Decoder:** Progressively upsamples the feature maps back to the original image size.
3.  **Skip Connections:** These connections link layers from the encoder directly to corresponding layers in the decoder. This is crucial because it allows low-level information (like edges and textures) from the input map to be passed directly to the generator's output-generation layers, ensuring the output structure matches the input structure.

#### **Architecture**

The generator consists of:

- **DownSampling Layers:** Successive convolutional layers that downsample the input image while applying leaky ReLU activations and (optionally) batch normalization.
- **UpSampling Layers:** Transposed convolutional layers that upsample the feature maps. These layers also include skip connections, where the output of each upsampling layer is concatenated with the corresponding encoder output.
- **Final Layer:** A a convolution is applied to map to the number of output channels (3 in general, except in colorization, where it is 2), then a Tanh activation function ensures the output pixel values are scaled between -1 and 1.

##### **Encoder Layers**

| Layer             | Output Shape                   | Kernel Size | Stride | Padding |
|--------------------|--------------------------------|-------------|--------|---------|
| Conv2D            | (batch_size, 64, H/2, W/2)     | 4x4         | 2      | 1       |
| LeakyReLU (0.2)   | (batch_size, 64, H/2, W/2)     | -           | -      | -       |
| Conv2D            | (batch_size, 128, H/4, W/4)    | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 128, H/4, W/4)    | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 128, H/4, W/4)    | -           | -      | -       |
| Conv2D            | (batch_size, 256, H/8, W/8)    | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 256, H/8, W/8)    | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 256, H/8, W/8)    | -           | -      | -       |
| Conv2D            | (batch_size, 512, H/16, W/16)  | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 512, H/16, W/16)  | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 512, H/16, W/16)  | -           | -      | -       |
| Conv2D            | (batch_size, 512, H/32, W/32)  | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 512, H/32, W/32)  | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 512, H/32, W/32)  | -           | -      | -       |
| Conv2D            | (batch_size, 512, H/64, W/64)  | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 512, H/64, W/64)  | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 512, H/64, W/64)  | -           | -      | -       |
| Conv2D            | (batch_size, 512, H/128, W/128)| 4x4         | 2      | 1       |
| LeakyReLU (0.2)   | (batch_size, 512, H/128, W/128)| -           | -      | -       |

##### **Decoder Layers**

| Layer               | Output Shape                   | Kernel Size | Stride | Padding |
|----------------------|--------------------------------|-------------|--------|---------|
| ConvTranspose2D      | (batch_size, 512, H/64, W/64) | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 512, H/64, W/64) | -           | -      | -       |
| ReLU                 | (batch_size, 512, H/64, W/64) | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 512, H/32, W/32) | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 512, H/32, W/32) | -           | -      | -       |
| ReLU                 | (batch_size, 512, H/32, W/32) | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 512, H/16, W/16) | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 512, H/16, W/16) | -           | -      | -       |
| ReLU                 | (batch_size, 512, H/16, W/16) | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 512, H/8, W/8)   | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 512, H/8, W/8)   | -           | -      | -       |
| ReLU                 | (batch_size, 512, H/8, W/8)   | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 256, H/4, W/4)   | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 256, H/4, W/4)   | -           | -      | -       |
| ReLU                 | (batch_size, 256, H/4, W/4)   | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 128, H/2, W/2)   | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 128, H/2, W/2)   | -           | -      | -       |
| ReLU                 | (batch_size, 128, H/2, W/2)   | -           | -      | -       |
| Skip Connection      | Concatenate with Encoder      | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 64, H, W)        | 4x4         | 2      | 1       |
| BatchNorm2D          | (batch_size, 64, H, W)        | -           | -      | -       |
| ReLU                 | (batch_size, 64, H, W)        | -           | -      | -       |
| ConvTranspose2D      | (batch_size, 3, H, W)         | 4x4         | 2      | 1       |
| Tanh                 | (batch_size, 3, H, W)         | -           | -      | -       |


<br/>

### **The Discriminator: PatchGAN**

The discriminator is a **PatchGAN**. Instead of classifying the *entire* image as "real" or "fake" (which can be easy for the generator to fool), the PatchGAN outputs a grid (e.g., $30 \times 30$).

Each cell in this grid represents the discriminator's verdict for a specific **patch** (e.g., $70 \times 70$) of the input image. By averaging the results of all patches, the discriminator evaluates the realism of high-frequency details across the entire image. This forces the generator to produce sharp, realistic textures rather than just a blurry-but-plausible image.

#### **Architecture**

The layers are as follows:

1. **Convolutional Layer:** Extracts features from the input image with 4x4 filters and stride 2.
2. **Leaky ReLU Activation:** Applies non-linearity with a slope of 0.2 for negative values.
3. **Batch Normalization:** Normalizes the activations to stabilize training.
4. **Final Convolution:** Produces a matrix of patch-level predictions.

<br/>

| Layer             | Output Shape                  | Kernel Size | Stride | Padding |
|--------------------|-------------------------------|-------------|--------|---------|
| Conv2D            | (batch_size, 64, H/2, W/2)    | 4x4         | 2      | 1       |
| LeakyReLU (0.2)   | (batch_size, 64, H/2, W/2)    | -           | -      | -       |
| Conv2D            | (batch_size, 128, H/4, W/4)   | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 128, H/4, W/4)   | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 128, H/4, W/4)   | -           | -      | -       |
| Conv2D            | (batch_size, 256, H/8, W/8)   | 4x4         | 2      | 1       |
| BatchNorm2D       | (batch_size, 256, H/8, W/8)   | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 256, H/8, W/8)   | -           | -      | -       |
| Conv2D            | (batch_size, 512, H/8, W/8)   | 4x4         | 1      | 1       |
| BatchNorm2D       | (batch_size, 512, H/8, W/8)   | -           | -      | -       |
| LeakyReLU (0.2)   | (batch_size, 512, H/8, W/8)   | -           | -      | -       |
| Conv2D (Final)    | (batch_size, 1, H/8, W/8)     | 4x4         | 1      | 1       |
| Sigmoid           | (batch_size, 1, H/8, W/8)     | -           | -      | -       |


## 3. **Loss Functions**

The generator's goal is twofold, balanced by its loss function:

1.  **Adversarial Loss ($\mathcal{L}_{GAN}$):** Encourages the generator to produce images that are indistinguishable from real images (i.e., to "fool" the discriminator). This is a standard cGAN loss:
    $$
    \mathcal{L}_{GAN}(G, D) = \mathbb{E}_{x, y}[\log D(x, y)] + \mathbb{E}_{x}[\log (1 - D(x, G(x)))]
    $$
    Where:
    - \( $G$ \) is the generator.
    - \( $D$ \) is the discriminator.
    - \( $x$ \) is the input (segmented) image.
    - \( $y$ \) is the real (target) image.

2.  **L1 Loss ($\mathcal{L}_{L1}$):** To ensure that the generated images are not only realistic but also structurally similar to the target images, an **L1 loss** is applied:
    $$
    \mathcal{L}_{L1}(G) = \mathbb{E}_{x, y} \| y - G(x) \|_1
    $$
    The L1 loss penalizes large pixel-level differences between the generated image \( $G(x)$ \) and the real image \( $y$ \).

**Total Generator Loss:**
The final objective combines these two losses, with $\lambda$ (typically 100) weighting the L1 loss much more heavily.
$$
\mathcal{L}_{Generator} = \mathcal{L}_{GAN} + \lambda \cdot \mathcal{L}_{L1}
$$
Here, \( $\lambda$ \) is a scaling factor (e.g., \( $\lambda = 100$ \)) that balances the adversarial loss and the L1 loss.

## 4. **Setup and Training**

First, let's set up the environment by installing dependencies and downloading the dataset. 

**Note:** You must have a `kaggle.json` API key file in the root of this repository for the setup script to work.

In [None]:
# Install dependencies
!pip install -r requirements.txt

In [None]:
# Run the dataset setup script
# This will download and unzip the Cityscapes dataset
!bash setup_dataset.sh

Now, we can run the main training script. All logging will appear in the console and be saved to `logs/pix2pix.log`.

We will train for 50 epochs with a batch size of 16. This may take a significant amount of time (1-2 hours) depending on your GPU.

In [None]:
!python train.py --epochs 50 --batch-size 16

## 5. **Results and Analysis**

After training, all outputs (sample images, loss plots, and model checkpoints) are saved in the `outputs/` directory. Let's load and display them.

### Cost Function Analysis

We plot the Generator and Discriminator loss curves for both training and validation.

In [None]:
from IPython.display import Image, display
import os

train_loss_img = "outputs/plots/train_loss.png"
val_loss_img = "outputs/plots/val_loss.png"

if os.path.exists(train_loss_img):
    print("--- Training Loss ---")
    display(Image(filename=train_loss_img))
else:
    print(f"Could not find {train_loss_img}")

if os.path.exists(val_loss_img):
    print("--- Validation Loss ---")
    display(Image(filename=val_loss_img))
else:
    print(f"Could not find {val_loss_img}")

#### **Analysis**

* **Generator Loss (Train & Val):** The generator loss shows a clear **downward trend**, indicating that the generator is successfully improving its ability to fool the discriminator and match the target image (L1 loss).
* **Discriminator Loss (Train &Val):** The discriminator loss decreases initially but then **plateaus around 0.5-0.7**. This is a sign of a healthy and stable GAN equilibrium. It means the discriminator is not overpowering the generator (loss near 0) nor is it failing to learn (loss > 1). It is successfully learning to distinguish real from fake, while the generator learns to keep pace.

### Generated Image Samples

Let's look at the sample images saved during training. We'll display the final set from epoch 50.

In [None]:
final_samples_img = "outputs/samples/epoch_050.png"

if os.path.exists(final_samples_img):
    print("--- Generated Samples (Epoch 50) ---")
    display(Image(filename=final_samples_img))
else:
    print(f"Could not find {final_samples_img}. Checking for other epochs...")
    # Fallback to display any available sample
    sample_dir = "outputs/samples/"
    if os.path.exists(sample_dir):
        all_samples = sorted([f for f in os.listdir(sample_dir) if f.endswith(".png")])
        if all_samples:
            print(f"Displaying last available sample: {all_samples[-1]}")
            display(Image(filename=os.path.join(sample_dir, all_samples[-1])))
        else:
            print("No sample images found.")

#### **Analysis**

The generated images successfully preserve the overall structure of both the segmented input and the real target image. The model is faithful to the general color scheme and density of features (e.g., green for trees, grey for roads).

However, the finer details appear **blurry** compared to the target images. This is a common characteristic of models heavily reliant on L1 loss, as L1 encourages the generator to find the *average* pixel value, which is a "safe" but blurry solution. 

The original Pix2Pix paper trained for 200 epochs, which would give the adversarial loss more time to push the generator beyond this blurry average and produce sharper, more realistic high-frequency details. Given the stable loss curves, further training would likely yield significantly better results.