# Neural Style Transfer - Deep Conceptual Understanding

## What is Neural Style Transfer?

**The Concept**: Neural Style Transfer is an optimization technique that uses deep learning to compose one image in the artistic style of another image. Imagine taking a photograph and recreating it as if it were painted by Van Gogh or Picasso.

**The Problem We're Solving**: 
- Traditional image filters can only apply simple transformations (blur, sharpen, color shifts)
- We want to capture the complex artistic "essence" of a painting and apply it to any photo
- This requires understanding both **what** is in an image (content) and **how** it looks (style)

**The Breakthrough Insight**: 
Convolutional Neural Networks (CNNs) naturally separate content from style at different layers:
- **Early layers** (near input) capture low-level features: edges, colors, simple textures
- **Deep layers** (near output) capture high-level features: object parts, shapes, semantic content

**Why This Works**:
1. **Content representation**: Deep layers know WHAT objects are in the image
2. **Style representation**: The correlations between features (Gram matrix) capture HOW the image looks (textures, patterns, colors) regardless of WHAT is shown
3. **Optimization**: We can adjust pixel values to match both content and style simultaneously

## What We Need:

1. **Pre-trained CNN**: VGG19 trained on millions of images - it already knows how to extract meaningful features
2. **Content Image**: The photo whose structure we want to preserve
3. **Style Image**: The artwork whose artistic style we want to copy
4. **Generated Image**: Starting from content, we'll optimize its pixels to match style
5. **Loss Functions**: Mathematical ways to measure content similarity and style similarity
6. **Optimizer**: Algorithm to adjust pixels to minimize the total loss

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms
from PIL import Image
import matplotlib.pyplot as plt
from tqdm import tqdm


## Step 1: Load Pre-trained VGG19 Model

### What We Did:
```python
vgg_19 = models.vgg19(pretrained=True).features.eval()
```

### Why VGG19?
1. **Deep architecture**: 19 layers means it learns hierarchical features
2. **Well-studied**: We know exactly what each layer represents
3. **Pre-trained**: Already trained on ImageNet (1.2M images, 1000 categories)
4. **Good feature extractor**: Captures both low-level textures and high-level content

### Line-by-Line Explanation:

**`models.vgg19`**: Loads the VGG19 architecture
- VGG19 has 16 convolutional layers + 3 fully connected layers
- Named after Visual Geometry Group at Oxford

**`pretrained=True`**: Downloads and loads weights trained on ImageNet
- **Why?** We need a model that already understands images
- Training from scratch would require millions of images and weeks of computation
- Pre-trained weights encode knowledge about edges, textures, object parts

**`.features`**: Extracts only the convolutional feature extraction part
- **Why?** We don't need the classification layers (fully connected)
- We only want the feature maps, not predictions of "cat" or "dog"
- This gives us layers 0-28 (all conv + pooling layers)

**`.eval()`**: Sets the model to evaluation mode
- **Why?** Disables dropout and batch normalization training behavior
- We're using VGG19 as a fixed feature extractor, not training it
- Ensures consistent outputs every time

### What This Achieves:
We now have a powerful "feature detector" that can look at any image and tell us:
- What low-level patterns it contains (early layers)
- What high-level objects it contains (deep layers)

In [12]:
vgg_19=models.vgg19(pretrained=True).features.eval()
vgg_19



Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (17): ReLU(inplace=True)
  (18): MaxPoo

## Step 2: Create Modified VGG Model for Feature Extraction

### What We Did:
Created a custom wrapper that extracts features from specific intermediate layers instead of just the final output.

### Why We Need This:
- Default VGG19 only returns the final output
- We need features from MULTIPLE layers (early, middle, deep)
- Different layers capture different aspects of the image

### The Code Explained:

```python
class ModifiedVGG(nn.Module):
```
**Why inherit from `nn.Module`?** Makes our class a proper PyTorch model with forward propagation capabilities.

```python
def __init__(self):
    super(ModifiedVGG, self).__init__()
```
**Why `super()`?** Initializes the parent `nn.Module` class, giving us access to PyTorch's model functionality.

```python
self.chosen_features = ['0', '5', '10', '19', '28']
```
**These specific layers were chosen because:**
- **Layer 0**: `conv1_1` - First convolutional layer
  - Captures basic edges, colors (very low-level)
  - **Why?** Style includes color schemes and basic brush strokes
  
- **Layer 5**: `conv2_1` - After first max pooling
  - Captures simple textures (2x downsampled)
  - **Why?** Artistic styles have characteristic texture patterns
  
- **Layer 10**: `conv3_1` - After second max pooling
  - Captures more complex textures (4x downsampled)
  - **Why?** Combines simple textures into complex patterns
  
- **Layer 19**: `conv4_1` - After third max pooling
  - Captures object parts and structural elements (8x downsampled)
  - **Why?** Balances style and content information
  
- **Layer 28**: `conv5_1` - After fourth max pooling
  - Captures high-level content (16x downsampled)
  - **Why?** Represents WHAT is in the image (the content we want to preserve)

```python
self.model = vgg_19[:29]
```
**Why `[:29]`?** Takes layers 0 through 28 (up to conv5_1), discarding the rest.

```python
def forward(self, x):
    features = []
```
**The forward pass** - what happens when we pass an image through the model.

```python
for layer_number, layer in enumerate(self.model):
    x = layer(x)
```
**Step-by-step processing:**
- `enumerate()` gives us both the layer number and the layer itself
- `x = layer(x)` passes the image through one layer at a time
- `x` gets updated with the output of each layer

```python
if str(layer_number) in self.chosen_features:
    features.append(x)
```
**Selective feature extraction:**
- When we hit one of our chosen layers (0, 5, 10, 19, 28)
- Save that layer's output (feature map) to our list
- Convert `layer_number` to string to match our chosen_features list

```python
return features
```
**Returns a list of 5 feature maps** - one from each chosen layer.

### What This Achieves:
We can now pass any image and get 5 feature maps representing:
1. Basic visual elements (edges, colors)
2-4. Increasingly complex textures and patterns
5. High-level semantic content (what objects are present)

In [13]:
class ModifiedVGG(nn.Module):
    def __init__(self):
        super(ModifiedVGG,self).__init__()
        
        self.chosen_features = ['0','5','10','19','28']
        self.model = vgg_19[:29]
    
    def forward(self,x):
        features = []
        for layer_number, layer in enumerate(self.model):
            x = layer(x)
            if str(layer_number) in self.chosen_features:
                features.append(x)
        return features

In [14]:
model = ModifiedVGG()

## Step 3: Image Loading and Preprocessing

### What We Did:
Created a pipeline to load images and prepare them for the neural network.

### Why Preprocessing is Critical:
Neural networks are picky - they need:
- Specific input dimensions
- Normalized pixel values
- Proper tensor format
- Correct batch structure

### The Code Explained:

```python
loader = transforms.Compose([...])
```
**`transforms.Compose`**: Chains multiple transformations together
- **Why?** Apply multiple preprocessing steps in sequence
- Like a pipeline: image → resize → tensorize → output

```python
transforms.Resize((224, 224))
```
**Resizes image to 224×224 pixels**
- **Why 224×224?** VGG19 was trained on this size
- All images must be the same size for batch processing
- **What happens?** Image is stretched/squeezed to fit
- **Trade-off**: May distort aspect ratio, but necessary for VGG

```python
transforms.ToTensor()
```
**Converts PIL Image to PyTorch Tensor**
- **Before**: PIL Image with pixels as integers [0, 255]
- **After**: Tensor with values as floats [0.0, 1.0]
- **Shape change**: (H, W, C) → (C, H, W)
  - Moves channels first: Height, Width, Channels → Channels, Height, Width
- **Why?** PyTorch expects channel-first format for convolutions

```python
def load_image(image_path):
```
**Function to load and preprocess a single image**

```python
image = Image.open(image_path).convert('RGB')
```
**Opens image file and ensures RGB format**
- **Why `convert('RGB')`?** Some images are RGBA (with alpha channel) or grayscale
- RGB ensures 3 channels: Red, Green, Blue
- VGG19 expects exactly 3 input channels

```python
image = loader(image).unsqueeze(0)
```
**Two operations here:**

1. **`loader(image)`**: Applies our preprocessing pipeline
   - After this: shape is [3, 224, 224] (channels, height, width)

2. **`.unsqueeze(0)`**: Adds a batch dimension at position 0
   - Before: [3, 224, 224]
   - After: [1, 3, 224, 224] (batch_size=1, channels, height, width)
   - **Why?** PyTorch models expect batches of images, even if batch size is 1
   - The model processes in format: [N, C, H, W] where N = number of images

```python
return image
```
**Returns a tensor ready to feed into VGG19**

### What This Achieves:
- Loads any image file
- Standardizes to VGG19's expected format
- Returns a tensor with shape [1, 3, 224, 224]
- Ready for feature extraction!

In [15]:

loader=transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor()
])

def load_image(image_path):
    image=Image.open(image_path).convert('RGB')
    image=loader(image).unsqueeze(0)
    return image


In [16]:
original_image=load_image('images/test/anna_hathaway.jpg')
style_image=load_image('images/styles/acrylic_style.jpg')

## Step 4: Load Content and Style Images

### What We Did:
```python
original_image = load_image('images/test/anna_hathaway.jpg')
style_image = load_image('images/styles/acrylic_style.jpg')
```

### Why These Two Images:
- **Content Image (original_image)**: The photo we want to preserve the structure of
  - Contains the "what" - the objects, composition, layout
  - Example: A photo of a person, landscape, building
  
- **Style Image**: The artwork whose style we want to mimic
  - Contains the "how" - the textures, colors, brush strokes, artistic patterns
  - Example: A Van Gogh painting, impressionist artwork

### What Happens:
Both images are now tensors of shape [1, 3, 224, 224], ready to be analyzed by VGG19.

---

## Step 5: Initialize the Generated Image

### What We Did:
```python
generated = original_image.clone().requires_grad_(True)
```

### The Revolutionary Concept Here:

In normal deep learning:
- **Input**: Fixed (image)
- **Model weights**: Optimized (trained)
- **Output**: Predicted labels

In style transfer:
- **Input**: Optimized (the generated image) ← We change THIS
- **Model weights**: Fixed (pre-trained VGG19) ← This stays constant
- **Output**: Used to compute loss

### Line-by-Line Breakdown:

```python
original_image.clone()
```
**Creates a copy of the content image**
- **Why start with content image?** Gives us a head start
- We already have the right content, just need to adjust the style
- **Alternative**: Could start with random noise, but convergence is slower

```python
.requires_grad_(True)
```
**This is the KEY line - enables pixel optimization**
- **What it does**: Tells PyTorch to track all operations on this tensor
- **Why?** So we can compute gradients with respect to pixel values
- **Effect**: We can do backpropagation to find how to change each pixel

**Normally**: Gradients flow back to update model weights  
**Here**: Gradients flow back to update pixel values

### The Optimization Strategy:
1. Start with content image pixels
2. Calculate how different it is from our targets (content + style)
3. Compute gradient: "How should we change each pixel to reduce the loss?"
4. Update pixels using these gradients
5. Repeat until we match both content and style

### What This Achieves:
We have a mutable image that we can optimize, pixel by pixel, to achieve our artistic goal.

In [17]:
generated=original_image.clone().requires_grad_(True)

## Step 6: Set Hyperparameters

### What We Did:
```python
total_steps = 2000
learning_rate = 0.0003
alpha = 1
beta = 0.01
```

### Why Each Parameter Matters:

```python
total_steps = 2000
```
**Number of optimization iterations**
- **Typical range**: 1000-6000 steps depending on complexity

```python
learning_rate = 0.0003
```
**How much we change pixels in each step**
- **What it means**: Each pixel can change by at most 0.03% of its range per step
- **This value (0.0003)**: Proven sweet spot for image optimization

```python
alpha = 1
```
**Weight for content loss**
- **What it controls**: How much we care about preserving the original content
- **Value of 1**: Content loss contributes fully to total loss
- **Higher alpha**: Generated image looks more like original photo
- **Lower alpha**: Generated image can deviate more from original structure

```python
beta = 0.01
```
**Weight for style loss**
- **What it controls**: How much we care about matching the artistic style
- **Value of 0.01**: Style loss is 1/100th as important as content loss
- **Higher beta**: Stronger style transfer, but may lose content structure
- **Lower beta**: Weaker style transfer, keeps more photorealistic look

### The Critical Balance (alpha/beta ratio):

**Current ratio**: alpha/beta = 1/0.01 = 100
- This means: "Content is 100× more important than style"
- **Result**: You'll clearly see the original subject, painted in the style

**Different ratios produce different effects:**
- **alpha=1, beta=0.001** (ratio 1000): Subtle style hints, mostly original photo
- **alpha=1, beta=0.1** (ratio 10): Strong stylization, may lose some content clarity
- **alpha=1, beta=1** (ratio 1): Heavy stylization, abstract result

### Why This Specific Combination:
- **2000 steps**: Enough for convergence without wasting time
- **lr=0.0003**: Stable, proven rate for pixel optimization
- **alpha=1, beta=0.01**: Produces aesthetically pleasing results where you can clearly recognize the content image but see the artistic style applied

In [18]:
total_steps=2000
learning_rate=0.0003
alpha = 1
beta = 0.01

## Step 7: The Training Loop - Where the Magic Happens

### Overview of What We're Doing:
This is an **iterative optimization process** where we repeatedly:
1. Extract features from our current generated image
2. Compare them to content and style targets
3. Calculate how "wrong" we are (loss)
4. Update pixels to be more "right"
5. Repeat until convergence

---

## Part 1: Setup

```python
from torchvision.utils import save_image
```
**Utility to save tensor as image file** - we'll use this to save progress

```python
optimizer = optim.Adam([generated], lr=learning_rate)
```
**Creates an Adam optimizer to update pixel values**

**What is Adam?**
- Adaptive Moment Estimation - advanced gradient descent algorithm
- Automatically adjusts learning rate for each parameter (pixel)
- Better than basic SGD for this task

**`[generated]`**: List of tensors to optimize
- **Critical**: We're optimizing the IMAGE, not model weights!
- This tensor's values (pixels) will be updated each iteration

**`lr=learning_rate`**: How big our update steps are (0.0003)

---

## Part 2: Main Loop

```python
for step in tqdm(range(total_steps), desc="Training"):
```
**Iterates 2000 times** with a progress bar (tqdm shows status)

---

## Part 3: Feature Extraction

```python
generated_features = model(generated)
original_image_features = model(original_image)
style_image_features = model(style_image)
```

**What happens here:**
Each line passes an image through our ModifiedVGG and gets 5 feature maps (from layers 0, 5, 10, 19, 28).

**`model(generated)`**: 
- Input: Current generated image [1, 3, 224, 224]
- Output: List of 5 tensors, each a feature map
- **Why?** Need to see what the current generated image looks like to VGG

**`model(original_image)`**: 
- Content target - these features should match generated's features
- **Why?** To preserve the content (objects, structure)

**`model(style_image)`**: 
- Style target - the Gram matrices should match
- **Why?** To transfer the artistic style

**Key insight**: We extract features from ALL THREE images to compare them.

---

## Part 4: Initialize Losses

```python
style_loss = 0
original_loss = 0
```
**Accumulators for loss values across all layers**
- Start at 0, add contribution from each layer
- By the end, they'll represent total content and style differences

---

## Part 5: Layer-by-Layer Loss Computation

```python
for gen_feature, orig_feature, style_feature in zip(generated_features, original_image_features, style_image_features):
```

**Iterates through corresponding feature maps from all three images**
- **Iteration 1**: Layer 0 features from all three images
- **Iteration 2**: Layer 5 features from all three images
- ... (5 iterations total)

**`zip()`**: Groups corresponding items from three lists together

---

## Part 6: Content Loss

```python
batch_size, channel, height, width = gen_feature.shape
```
**Unpacks the dimensions of the feature map**
- `batch_size = 1` (we're processing one image)
- `channel = 64` (layer 0) up to `512` (layer 28) - number of filters
- `height, width = 224, 224` (gets smaller in deeper layers due to pooling)

**Why extract these?** We'll need them for reshaping in Gram matrix calculation.

```python
original_loss += torch.mean((gen_feature - orig_feature)**2)
```

**This is the content loss calculation - let's break it down:**

**Mathematical formula**: 
$$L_{content} = \frac{1}{N} \sum_{i=1}^{N} (F_{gen}^i - F_{orig}^i)^2$$

**`(gen_feature - orig_feature)`**: 
- Element-wise subtraction of feature maps
- If generated matches original perfectly, this is 0
- Larger values = bigger difference

**`**2`**: 
- Squares the differences
- Makes all values positive
- Penalizes large differences more heavily (quadratic penalty)

**`torch.mean()`**: 
- Averages over all positions and channels
- Gives us one number representing "how different" the features are

**`+=`**: 
- Adds this layer's contribution to total content loss
- We sum across all 5 layers for complete content representation

**Why this works**: 
- If generated has same features as content image → loss is low
- Features in deep layers represent semantic content (what objects exist)
- Matching features = matching content

---

## Part 7: Style Loss (The Gram Matrix)

### What is a Gram Matrix and Why Do We Need It?

**The Problem**: 
- Feature maps contain spatial information (where things are)
- Style shouldn't care about WHERE features occur, only HOW they co-occur
- A painting's style is the same whether the tree is on the left or right

**The Solution - Gram Matrix**:
- Measures correlations between different feature channels
- "Do vertical edges occur together with red color?"
- "Do curved lines appear with blue tones?"
- **Removes spatial information**, keeps only feature relationships

### The Mathematics:

Given feature map $F$ with shape $[C, H, W]$ (channels, height, width):

**Gram matrix formula**:
$$G_{ij} = \sum_{k=1}^{H \times W} F_{ik} \cdot F_{jk}$$

Where:
- $G_{ij}$ = correlation between channel $i$ and channel $j$
- $F_{ik}$ = activation of channel $i$ at spatial position $k$
- Sum over all spatial positions (H×W locations)

**What this means**:
- If channels $i$ and $j$ are both highly activated together → large $G_{ij}$
- If they never activate together → small $G_{ij}$
- **Result**: A $C \times C$ matrix encoding style without spatial info

---

### The Code Implementation:

```python
G = gen_feature.view(channel, height*width).mm(
    gen_feature.view(channel, height*width).t()
)
```

**Let's break this down step by step:**

**Step 1: `gen_feature.view(channel, height*width)`**
- **Before**: shape [1, C, H, W] e.g., [1, 64, 224, 224]
- **After**: shape [C, H×W] e.g., [64, 50176]
- **What it does**: Flattens spatial dimensions into one dimension
- **Why?** To treat each spatial position as a data point

**Step 2: `.mm()`** 
- Matrix multiplication
- **PyTorch function**: `torch.mm(A, B)` computes A @ B

**Step 3: `.t()`**
- Transpose operation
- **Effect**: Swaps rows and columns
- If input is [C, H×W], output is [H×W, C]

**The complete operation**:
```
[C, H×W] @ [H×W, C] = [C, C]
```

**Example with actual numbers** (layer 0: 64 channels, 224×224 spatial):
```
[64, 50176] @ [50176, 64] = [64, 64]
```

**What each element represents**:
- G[i, j] = dot product of channel i with channel j across all spatial positions
- High value → channels i and j activate together (style correlation)
- This is EXACTLY the Gram matrix formula!

```python
A = style_feature.view(channel, height*width).mm(
    style_feature.view(channel, height*width).t()
)
```
**Same calculation for the style image**
- **G**: Gram matrix of current generated image
- **A**: Gram matrix of target style image

---

### Computing Style Loss:

```python
style_loss += torch.mean((G - A)**2)
```

**Mathematical formula**:
$$L_{style} = \frac{1}{C^2} \sum_{i=1}^{C} \sum_{j=1}^{C} (G_{ij} - A_{ij})^2$$

**What happens**:
- **`(G - A)`**: Difference between generated and style Gram matrices
  - If style matches perfectly → difference is 0
- **`**2`**: Square the differences (MSE loss)
- **`torch.mean()`**: Average over all $C \times C$ elements
- **`+=`**: Add contribution from this layer to total style loss

**Why this works**:
- Gram matrix captures style (texture correlations)
- Matching Gram matrices = matching style
- Independent of spatial arrangement = style transfer without copying positions

---

## Part 8: Total Loss and Optimization

```python
total_loss = alpha*original_loss + beta*style_loss
```

**Combines both losses with our chosen weights**

**Mathematical formula**:
$$L_{total} = \alpha \cdot L_{content} + \beta \cdot L_{style}$$

With $\alpha = 1$ and $\beta = 0.01$:
$$L_{total} = 1 \cdot L_{content} + 0.01 \cdot L_{style}$$

**What this means**:
- Content is 100× more important than style
- If content loss = 10 and style loss = 100
- Total loss = 1×10 + 0.01×100 = 10 + 1 = 11
- Mostly driven by content preservation

---

```python
optimizer.zero_grad()
```
**Clears previous gradients**
- **Why?** PyTorch accumulates gradients by default
- Without this, gradients from previous iterations would add up
- We want fresh gradients for each iteration

---

```python
total_loss.backward()
```
**THE MOST IMPORTANT LINE - Computes gradients**

**What happens internally**:
1. PyTorch traces back through all operations that created `total_loss`
2. Computes $\frac{\partial L_{total}}{\partial pixel_{i,j}}$ for EVERY pixel
3. These gradients tell us: "Increase or decrease this pixel to reduce loss?"
4. Stored in `generated.grad`

**The chain rule in action**:
$$\frac{\partial L_{total}}{\partial pixel} = \frac{\partial L_{total}}{\partial features} \cdot \frac{\partial features}{\partial pixel}$$

**Key insight**: 
- Gradients flow backwards through VGG19 (which is frozen)
- End up in the input pixels (which are trainable)
- We now know how to adjust each pixel to minimize loss!

---

```python
optimizer.step()
```
**Updates the pixel values using gradients**

**Adam update rule** (simplified):
$$pixel_{new} = pixel_{old} - learning\_rate \cdot gradient$$

**What happens**:
- For each pixel, Adam computes an update based on gradient
- Applies learning rate (0.0003) to control step size
- Adjusts pixels in the direction that reduces loss
- `generated` tensor now has updated pixel values

**Result**: Our image is now slightly more stylized and still maintains content!

---

## Part 9: Logging Progress

```python
if step % 200 == 0:
    tqdm.write("Total loss at step {}: {}".format(step, total_loss.item()))
    save_image(generated, "results/generated.png")
```

**Every 200 iterations**:
- **`step % 200 == 0`**: True when step is 0, 200, 400, ..., 1800
- **`tqdm.write()`**: Prints without breaking progress bar
- **`total_loss.item()`**: Converts tensor to Python number
- **`save_image()`**: Saves current generated image to file

**Why log every 200 steps?**
- Can monitor if loss is decreasing (optimization working)
- Can see intermediate results
- Not every step (too much output), not too rare (want feedback)

---

## The Complete Flow:

1. **Extract features** from generated, content, and style images
2. **For each layer**:
   - Compare generated vs content features → **content loss**
   - Compute Gram matrices for generated and style → **style loss**
3. **Combine losses** with weights α and β → **total loss**
4. **Compute gradients**: How should each pixel change?
5. **Update pixels**: Apply gradients with Adam optimizer
6. **Repeat** for 2000 iterations
7. **Result**: An image that looks like the content image painted in the style image's artistic style!

---

## The Mathematical Beauty:

By optimizing:
$$\min_{generated} \left[ \alpha \sum_{l} \|F^l_{generated} - F^l_{content}\|^2 + \beta \sum_{l} \|G^l_{generated} - G^l_{style}\|^2 \right]$$

We find the image that simultaneously:
- Has the same **semantic content** (objects, structure) as the original photo
- Has the same **artistic style** (textures, patterns, colors) as the style painting

This is neural style transfer!

In [None]:
from torchvision.utils import save_image

optimizer=optim.Adam([generated],lr=learning_rate)

for step in tqdm(range(total_steps), desc="Training"):
    generated_features=model(generated)
    original_image_features=model(original_image)
    style_image_features=model(style_image)
    
    style_loss=0
    original_loss=0
    
    for gen_feature, orig_feature, style_feature in zip(generated_features, original_image_features, style_image_features):
        
        batch_size,channel,height,width=gen_feature.shape
        original_loss+=torch.mean((gen_feature - orig_feature)**2)
        
        #compute the gram matrix
        G=gen_feature.view(channel, height*width).mm(gen_feature.view(channel, height*width).t()) # mm = matrix multiplication
        A=style_feature.view(channel, height*width).mm(style_feature.view(channel, height*width).t())
        
        style_loss+=torch.mean((G - A)**2) 
    total_loss= alpha*original_loss + beta*style_loss
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()
    
    if step % 200 ==0:
        tqdm.write("Total loss at step {}: {}".format(step, total_loss.item()))
        save_image(generated,"results/generated.png")
