

In this lab, you will learn how to build **very deep convolutional neural networks** using **Residual Networks (ResNets)**. While deeper networks can theoretically learn highly complex functions, they are often difficult to train in practice due to challenges such as the **vanishing gradient problem**.  

**Residual Networks**, introduced by [He et al. (2015)](https://arxiv.org/pdf/1512.03385.pdf), solved this issue and made it possible to train extremely deep models efficiently.

ResNet was the breakthrough architecture that won the **2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC)** and has influenced almost every deep learning model since then. Residual connections are now used in:

- Recurrent models ([Kim et al., 2017](https://arxiv.org/abs/1701.03360), [Prakash et al., 2016](https://arxiv.org/abs/1610.03098))
- Transformers ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762))
- Graph neural networks ([Kipf & Welling, 2016](https://arxiv.org/abs/1609.02907))

Today, residual ideas form the backbone of modern deep learning architectures.


## Learning Objectives

**In this notebook, you will:**

- Understand and implement the core building blocks of **ResNets**
- Combine these blocks to construct a **50-layer network**
- Train this deep network for **image classification**
- Load and fine-tune a **pre-trained ResNet-50** model in Keras
- Run training on a **GPU** for the first time (deep network ‚âà 50 layers!)

We will use **Keras with PyTorch as the backend** for this lab.



###  Before You Begin

Run the cell below to import the required packages.

If your laptop has a GPU (e.g., Apple Silicon MacBook or a gaming laptop), the code will automatically use it. Otherwise, it will run on CPU.  

> üí° **Tip:** Try running this notebook on both CPU and GPU (e.g., using Google Colab) to experience how significantly a GPU accelerates model training.

Let's get started üëá


In [None]:
import os
os.environ['KERAS_BACKEND']='torch'

In [None]:
import keras
import torch

import keras.backend as K
K.set_image_data_format('channels_last')

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    device_name = torch.cuda.get_device_name(0)
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    device_name = "Apple Silicon MPS (Metal Performance Shaders)"
else:
    device = torch.device("cpu")
    device_name = "CPU"

print(f"Training on: {device_name}")

In [None]:

device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu"))
print(f"Using device: {device}")
# print keras backend

print(f"Keras backend: {keras.backend.backend()}")
print(f"Torch version: {torch.__version__}")
print(f"Keras version: {keras.__version__}")

## 1 - The Problem of Very Deep Neural Networks

Modern neural networks have grown dramatically in depth ‚Äî from a few layers (e.g., AlexNet) to hundreds in today‚Äôs architectures. Deeper models can represent more complex functions and learn hierarchical features, from simple edges in early layers to abstract concepts in deeper layers.

However, simply stacking more layers does **not** always improve performance. A major challenge is the **vanishing gradient problem**.


###  Vanishing Gradients

During backpropagation, gradients flow from the final layer back to earlier layers. At each step, gradients are multiplied by weight matrices (and activation derivatives). In very deep networks, this can cause gradients to:

-  **Shrink exponentially** ‚Üí approach zero (**vanishing gradients**)  
-  **Grow exponentially** ‚Üí explode to large values (**exploding gradients**, less common)

Even with:

- ‚úÖ Proper weight initialization  
- ‚úÖ ReLU activations  
- ‚úÖ Batch normalization  

**Training very deep networks is still harder** because gradients struggle to flow backward through many layers.  
Result: **Early layers learn slowly or stop learning** ‚Üí inefficient training and poor performance.


###  What You Observe During Training

Gradient norms for earlier layers typically decrease rapidly:

> **Earlier layers receive very weak learning signal ‚Üí slow or stalled learning**

*(See figure below ‚Äî earlier layers' gradients decay faster)*

<img src="images/vanishing_grad_kiank.png" style="width:450px;height:220px;">
<caption><center> <u> <font color='purple'> <b>Figure 1</b> </u><font color='purple'>  : <b>Vanishing gradient</b> <br> The speed of learning decreases very rapidly for the early layers as the network trains </center></caption>

**Key takeaway:**  
> Gradient flow weakens as networks deepen, making optimization difficult without special architectural mechanisms.



###  What's Next?

To address this, modern architectures introduce **skip (residual) connections**, which allow gradients to bypass layers and flow more easily.

You will now build a **Residual Network (ResNet)** to overcome the vanishing gradient problem.


## 2 - Building a Residual Network

### 2.0 - The Core Idea: Skip Connections

In ResNets, a **"shortcut"** or **"skip connection"** allows information and gradients to flow directly to earlier layers, bypassing intermediate transformations:  

<img src="images/skip_connection_kiank.png" style="width:650px;height:200px;">
<caption><center> <u> <font color='purple'> <b>Figure 2</b> </u><font color='purple'>  : A ResNet block showing a <b>skip-connection</b> <br> </center></caption>

**Left**: Traditional "main path" ‚Äî input passes through multiple transformations sequentially.  
**Right**: ResNet block ‚Äî the shortcut connection enables the network to learn **residual mappings**.



### 2.1 Key Innovation: Learning Residuals Instead of Direct Mappings

**Traditional approach**: Learn a direct mapping $H(x)$ from input $x$.

**ResNet approach**: Learn a residual function $F(x)$ such that:

$$H(x) = F(x) + x$$

where:

- $x$ is the input (identity/shortcut)
- $F(x)$ is the residual learned by the stacked layers
- $H(x)$ is the final output

**Why is this easier to optimize?**

1. **If optimal mapping is close to identity** ($H(x) \approx x$), it's easier to learn $F(x) \approx 0$ than to learn $H(x) = x$ directly
2. **Gradient flow**: During backpropagation, gradients can flow through both:
   - The main path: $\frac{\partial F(x)}{\partial x}$
   - The shortcut: $\frac{\partial x}{\partial x} = 1$ (direct path with gradient = 1)

This ensures at least some gradient always reaches earlier layers!

**Benefits of Residual Learning**

| Traditional Deep Networks | Residual Networks (ResNets) |
|---------------------------|----------------------------|
| Gradients vanish exponentially | Skip connections provide gradient highway |
| Harder to optimize as depth increases | Easy to optimize even with 100+ layers |
| Adding layers can hurt performance | Adding layers rarely hurts (can learn identity) |
| Requires careful initialization | More robust to initialization |

### 2.2 Two Types of ResNet Blocks

By stacking these ResNet blocks on top of each other, you can form very deep networks. The architecture uses two main block types based on whether input/output dimensions match:

1. **Identity Block** ‚Äî When input and output dimensions are the **same**
2. **Convolutional Block** ‚Äî When dimensions **differ** (e.g., spatial downsampling or channel expansion)


You will now implement both types of ResNet blocks and build a complete ResNet-50 architecture!

#### 2.2.1 - The Identity Block

The **identity block** is the standard building block used in ResNets. It's used when the **input and output dimensions are the same** (i.e., $a^{[l]}$ and $a^{[l+2]}$ have identical shapes).

##### Basic Structure (2-layer skip)

<img src="images/idblock2_kiank.png" style="width:650px;height:150px;">
<caption><center> <u> <font color='purple'> <b>Figure 3</b> </u><font color='purple'>  : <b>Identity block.</b> Skip connection "skips over" 2 layers. </center></caption>

**Key components:**

- **Upper path (shortcut)**: Direct connection ‚Äî input passes through unchanged
- **Lower path (main path)**: Sequence of transformations (CONV2D ‚Üí BatchNorm ‚Üí ReLU)
- **Final step**: Add both paths and apply ReLU activation


##### Enhanced 3-Layer Identity Block (What You'll Implement)

For better feature extraction, we'll implement a more powerful version where the skip connection jumps over **3 layers**:

<img src="images/idblock3_kiank.png" style="width:650px;height:150px;">
<caption><center> <u> <font color='purple'> <b>Figure 4</b> </u><font color='purple'>  : <b>Identity block.</b> Skip connection "skips over" 3 layers.</center></caption>


##### Architecture Details

The main path consists of three convolutional components with a **bottleneck design**:

**First Component** (Dimensionality Reduction):

- **CONV2D**: $F_1$ filters, kernel size (1,1), stride (1,1), padding='valid'
  - *Purpose*: Reduce channel dimensions (bottleneck)
  - *Name*: `conv_name_base + '2a'`
- **BatchNorm**: Normalize channels axis ‚Üí `bn_name_base + '2a'`
- **Activation**: ReLU

**Second Component** (Feature Extraction):

- **CONV2D**: $F_2$ filters, kernel size $(f,f)$, stride (1,1), padding='same'
  - *Purpose*: Extract spatial features at reduced dimensionality
  - *Name*: `conv_name_base + '2b'`
- **BatchNorm**: Normalize channels axis ‚Üí `bn_name_base + '2b'`
- **Activation**: ReLU

**Third Component** (Dimensionality Expansion):

- **CONV2D**: $F_3$ filters, kernel size (1,1), stride (1,1), padding='valid'
  - *Purpose*: Restore original channel dimensions
  - *Name*: `conv_name_base + '2c'`
- **BatchNorm**: Normalize channels axis ‚Üí `bn_name_base + '2c'`
- **No ReLU** (activation applied after adding shortcut)

**Final Step**:

- **Add**: Shortcut + Main path output
- **Activation**: ReLU on combined result


##### Why This Bottleneck Design?

The **1√ó1 ‚Üí 3√ó3 ‚Üí 1√ó1** pattern is called a **bottleneck architecture**:

```
Input channels: 256
    ‚Üì
1√ó1 conv: 256 ‚Üí 64   (reduce dimensions - fewer parameters!)
    ‚Üì
3√ó3 conv: 64 ‚Üí 64    (extract features at lower cost)
    ‚Üì
1√ó1 conv: 64 ‚Üí 256   (restore dimensions)
    ‚Üì
Add shortcut: 256 + 256 ‚Üí 256
```

**Benefits**:

- ‚úÖ **Fewer parameters**: 3√ó3 conv operates on fewer channels (64 vs 256)
- ‚úÖ **Faster computation**: Reduces computational cost significantly
- ‚úÖ **More depth**: Can stack more layers with same parameter budget
- ‚úÖ **Better features**: Multiple non-linearities capture complex patterns


### Task 1: Implement the ResNet Identity Block

We've implemented the first component for you. 

**Your task**: Complete the second component, third component, and final addition step.

**Implementation hints:**

- Conv2D: [Documentation](https://keras.io/api/layers/convolution_layers/convolution2d/)
- BatchNorm: [Documentation](https://faroit.github.io/keras-docs/1.2.2/layers/normalization/)
  - Set `axis=3` to normalize the channels axis
- Activation: Use `Activation('relu')(X)`
- Add layers: [Documentation](https://keras.io/api/layers/merging_layers/add/)

In [None]:
from keras.layers import Conv2D, BatchNormalization, Activation, Add

# FUNCTION: identity_block

def identity_block(X, f, filters, stage, block):
    """
    Implementation of the identity block as defined in Figure 3

    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network

    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """

    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    # Retrieve Filters
    F1, F2, F3 = filters

    # Save the input value. You'll need this later to add back to the main path.
    X_shortcut = X

    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a' )(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### START CODE HERE ###

    # Second component of main path (‚âà3 lines)
 

    # Third component of main path (‚âà2 lines)
 

    # Final step: Add shortcut value to main path, and pass it through a RELU activation (‚âà2 lines)
 

    ### END CODE HERE ###

    return X

In [None]:

import numpy as np
np.random.seed(1)
X = np.random.randn(3, 4, 4, 6)
output = identity_block(X, f = 2, filters = [2, 4, 6], stage = 1, block = 'a')

print(output.cpu().detach().numpy().sum())


**Expected Output**:

<table>
    <tr>
        <td>
           A value between <b>100</b> and <b>140</b>
        </td>
    </tr>
</table>

#### 2.2 - The Convolutional Block

You've mastered the identity block! Now let's tackle the **convolutional block** ‚Äî the second essential building block in ResNet.

##### When to Use Convolutional Blocks

Use this block when **input and output dimensions DON'T match**, such as:

-  **Spatial downsampling**: Reducing height/width (e.g., 32√ó32 ‚Üí 16√ó16)
-  **Channel expansion**: Increasing depth (e.g., 128 channels ‚Üí 256 channels)
-  **Transitioning between stages** in the network


##### Key Difference: Shortcut Path Has a Convolution

<img src="images/convblock_kiank.png" style="width:650px;height:150px;">
<caption><center> <u> <font color='purple'> <b>Figure 5</b> </u><font color='purple'>  : <b>Convolutional block</b> </center></caption>

**Unlike the identity block**, the shortcut path contains:

- **CONV2D layer**: Transforms input to match output dimensions
- **BatchNorm**: Normalizes the transformed shortcut
- **No activation**: Shortcut applies only a linear transformation

**Why is this needed?**

For the final addition to work, both paths must have **matching dimensions**:
```python
# This MUST work:
output = main_path_output + shortcut_output  # Shapes must match!
```

If input is `(32, 32, 128)` and output should be `(16, 16, 256)`:
- Main path reduces spatial dimensions and expands channels
- Shortcut path ALSO must transform `(32, 32, 128)` ‚Üí `(16, 16, 256)`



#####  Architecture Details

**Main Path** (same as identity block, but with stride):

**First Component**:

- **CONV2D**: $F_1$ filters, (1,1), **stride (s,s)**, padding='valid'
  -  **Stride s**: This is where spatial downsampling happens!
  - *Name*: `conv_name_base + '2a'`
- **BatchNorm**: `bn_name_base + '2a'`
- **Activation**: ReLU

**Second Component**:

- **CONV2D**: $F_2$ filters, (f,f), stride (1,1), padding='same'
  - *Name*: `conv_name_base + '2b'`
- **BatchNorm**: `bn_name_base + '2b'`
- **Activation**: ReLU

**Third Component**:

- **CONV2D**: $F_3$ filters, (1,1), stride (1,1), padding='valid'
  - *Name*: `conv_name_base + '2c'`
- **BatchNorm**: `bn_name_base + '2c'`
- **No activation** here



**Shortcut Path** (NEW! This is what makes it different):

- **CONV2D**: $F_3$ filters, (1,1), **stride (s,s)**, padding='valid'
  -  **Same stride as first component** ‚Üí matches spatial dimensions
  -  **Same number of filters as third component** ‚Üí matches channels
  - *Name*: `conv_name_base + '1'`
- **BatchNorm**: `bn_name_base + '1'`
- **No activation** (keeps it as a linear projection)



**Final Step**:

- **Add**: Main path + Shortcut path
- **Activation**: ReLU on sum



#####  Dimension Transformation Example

Let's trace dimensions through a convolutional block:

**Input**: `(H, W, C_in)` = `(32, 32, 128)`  
**Parameters**: `s=2`, `filters=[64, 64, 256]`

**Main Path**:
```
Input:          (32, 32, 128)
‚Üì Conv 1√ó1, s=2, 64 filters
                (16, 16, 64)   ‚Üê stride 2 reduces spatial dims
‚Üì Conv 3√ó3, s=1, 64 filters
                (16, 16, 64)   ‚Üê same padding preserves dims
‚Üì Conv 1√ó1, s=1, 256 filters
Main output:    (16, 16, 256)  ‚Üê expand to final channels
```

**Shortcut Path**:
```
Input:          (32, 32, 128)
‚Üì Conv 1√ó1, s=2, 256 filters
Shortcut:       (16, 16, 256)  ‚Üê matches main path!
```

**Addition**: `(16, 16, 256)` + `(16, 16, 256)` = `(16, 16, 256)` ‚úÖ



#####  Why No Activation on Shortcut?

The shortcut represents the **identity (or linear projection) of the input**. Adding a non-linearity would:

- ‚ùå Break the gradient highway property
- ‚ùå Make it harder to learn identity mappings
- ‚ùå Reduce the effectiveness of skip connections

The shortcut should be a **pure information pathway**, while the main path learns non-linear transformations.



### Task 2: Implement the Convolutional Block

We've implemented the first component of the main path. 

**Your task**: Complete the rest.

**What to implement**:

1. Second and third components of main path
2. **Entire shortcut path** (this is the new part!)
3. Final addition and activation

**Key reminders**:

- Use stride `(s,s)` in the FIRST main path conv AND the shortcut conv
- Shortcut has $F_3$ filters (same as third component output)
- No ReLU on shortcut path or after third component (only after final addition)

**References**:

- [Conv2D](https://keras.io/layers/convolutional/#conv2d)
- [BatchNorm](https://keras.io/layers/normalization/#batchnormalization) (use `axis=3`)
- [Activation](https://keras.io/layers/core/#activation): `Activation('relu')(X)`
- [Add](https://keras.io/layers/merge/#add)

In [None]:
# FUNCTION: convolutional_block

def convolutional_block(X, f, filters, stage, block, s = 2):
    """
    Implementation of the convolutional block as defined in Figure 4

    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used

    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """

    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    # Retrieve Filters
    F1, F2, F3 = filters

    # Save the input value
    X_shortcut = X


    ##### MAIN PATH #####
    # First component of main path
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a' )(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### START CODE HERE ###

    # Second component of main path (‚âà3 lines)

    # Third component of main path (‚âà2 lines)


    ##### SHORTCUT PATH #### (‚âà2 lines)


    # Final step: Add shortcut value to main path, and pass it through a RELU activation (‚âà2 lines)


    ### END CODE HERE ###

    return X

In [None]:
import numpy as np
np.random.seed(1)
X = np.random.randn(3, 4, 4, 6)
output = convolutional_block(X, f = 2, filters = [2, 4, 6], stage = 1, block = 'a')

print(output.detach().cpu().numpy().sum())


**Expected Output**:

<table>
    <tr>
        <td>
           20~30
        </td>
    </tr>

</table>



#### Identity Block vs Convolutional Block ‚Äî Quick Reference

| Feature | Identity Block | Convolutional Block |
|---------|----------------|---------------------|
| **Use Case** | Same input/output dimensions | Different input/output dimensions |
| **Shortcut Path** | Direct connection (no layers) | Conv2D + BatchNorm |
| **Main Path Stride** | Always (1,1) | First conv uses stride (s,s) |
| **When Used** | Within a stage | Between stages (transitions) |
| **Purpose** | Deepen network without changing dims | Spatial downsampling or channel expansion |
| **Example** | Stage 2, blocks 'b' and 'c' | Stage 2, block 'a' (entry to stage) |


### Summary: Building Blocks Mastered!

You've now implemented the two fundamental ResNet building blocks:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    IDENTITY BLOCK                           ‚îÇ
‚îÇ  Input (H, W, C) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê               ‚îÇ
‚îÇ       ‚îÇ                                     ‚îÇ               ‚îÇ
‚îÇ       ‚îú‚îÄ‚Üí Conv 1√ó1 (reduce) ‚îÄ‚Üí BN ‚îÄ‚Üí ReLU  ‚îÇ               ‚îÇ
‚îÇ       ‚îú‚îÄ‚Üí Conv 3√ó3 (extract) ‚îÄ‚Üí BN ‚îÄ‚Üí ReLU  ‚îÇ               ‚îÇ
‚îÇ       ‚îî‚îÄ‚Üí Conv 1√ó1 (expand) ‚îÄ‚Üí BN ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§               ‚îÇ
‚îÇ                                             ‚Üì               ‚îÇ
‚îÇ                                         ADD + ReLU          ‚îÇ
‚îÇ                                             ‚Üì               ‚îÇ
‚îÇ                                    Output (H, W, C)         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                 CONVOLUTIONAL BLOCK                         ‚îÇ
‚îÇ  Input (H, W, C‚ÇÅ) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê               ‚îÇ
‚îÇ       ‚îÇ                                     ‚îÇ               ‚îÇ
‚îÇ       ‚îú‚îÄ‚Üí Conv 1√ó1 (s=2, reduce) ‚îÄ‚Üí BN ‚îÄ‚Üí ReLU             ‚îÇ
‚îÇ       ‚îú‚îÄ‚Üí Conv 3√ó3 (s=1, extract) ‚îÄ‚Üí BN ‚îÄ‚Üí ReLU             ‚îÇ
‚îÇ       ‚îî‚îÄ‚Üí Conv 1√ó1 (s=1, expand) ‚îÄ‚Üí BN ‚îÄ‚îÄ‚îÄ‚îÄ‚î§               ‚îÇ
‚îÇ                                             ‚îÇ               ‚îÇ
‚îÇ                                             ‚îÇ               ‚îÇ
‚îÇ                          Conv 1√ó1 (s=2) ‚îÄ‚Üí BN               ‚îÇ
‚îÇ                                             ‚Üì               ‚îÇ
‚îÇ                                         ADD + ReLU          ‚îÇ
‚îÇ                                             ‚Üì               ‚îÇ
‚îÇ                                  Output (H/2, W/2, C‚ÇÇ)      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Next step**: Stack these blocks to build a complete ResNet-50!

###  Optional: Visualize Block Dimensions

Run the cell below to see how dimensions flow through each block type:

In [None]:
"""
Optional: Visualize how dimensions transform through ResNet blocks
This helps understand the architecture better!
"""
def visualize_block_dimensions():
    print("=" * 70)
    print("IDENTITY BLOCK - Dimension Flow")
    print("=" * 70)

    # Test identity block
    test_input = np.random.randn(1, 16, 16, 256).astype("float32")
    print(f"Input shape:        {test_input.shape}")

    output = identity_block(test_input, f=3, filters=[64, 64, 256], stage=1, block='test')

    if isinstance(output, torch.Tensor):
        output_shape = tuple(output.shape)
    else:
        try:
            from keras import ops as Kops
            output_shape = tuple(Kops.shape(output))
        except:
            output_shape = output.shape

    print(f"Output shape:       {output_shape}")
    print(f"‚úì Dimensions preserved (identity block characteristic)")

    print("\n" + "=" * 70)
    print("CONVOLUTIONAL BLOCK - Dimension Flow")
    print("=" * 70)

    # Test convolutional block
    test_input2 = np.random.randn(1, 32, 32, 128).astype("float32")
    print(f"Input shape:        {test_input2.shape}")

    output2 = convolutional_block(test_input2, f=3, filters=[64, 64, 256], stage=2, block='test', s=2)

    if isinstance(output2, torch.Tensor):
        output2_shape = tuple(output2.shape)
    else:
        try:
            from keras import ops as Kops
            output2_shape = tuple(Kops.shape(output2))
        except:
            output2_shape = output2.shape

    print(f"Output shape:       {output2_shape}")
    print(f"‚úì Spatial dims reduced by stride 2: {test_input2.shape[1]} ‚Üí {output2_shape[1]}")
    print(f"‚úì Channels expanded: {test_input2.shape[3]} ‚Üí {output2_shape[3]}")
    print("=" * 70)

#  run visualization
visualize_block_dimensions()

## 3 - Building your first ResNet model (50 layers)

You now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. "ID BLOCK" in the diagram stands for "Identity block," and "ID BLOCK x3" means you should stack 3 identity blocks together.

<img src="images/resnet_kiank.png" style="width:850px;height:150px;">
<caption><center> <u> <font color='purple'> <b>Figure 5 </b></u><font color='purple'>  : <b>ResNet-50 model</b> </center></caption>

The details of this ResNet-50 model are:

- Zero-padding pads the input with a pad of (3,3)
- Stage 1:
    - The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is "conv1".
    - BatchNorm is applied to the channels axis of the input.
    - MaxPooling uses a (3,3) window and a (2,2) stride.
- Stage 2:
    - The convolutional block uses three set of filters of size [64,64,256], "f" is 3, "s" is 1 and the block is "a".
    - The 2 identity blocks use three set of filters of size [64,64,256], "f" is 3 and the blocks are "b" and "c".
- Stage 3:
    - The convolutional block uses three set of filters of size [128,128,512], "f" is 3, "s" is 2 and the block is "a".
    - The 3 identity blocks use three set of filters of size [128,128,512], "f" is 3 and the blocks are "b", "c" and "d".
- Stage 4:
    - The convolutional block uses three set of filters of size [256, 256, 1024], "f" is 3, "s" is 2 and the block is "a".
    - The 5 identity blocks use three set of filters of size [256, 256, 1024], "f" is 3 and the blocks are "b", "c", "d", "e" and "f".
- Stage 5:
    - The convolutional block uses three set of filters of size [512, 512, 2048], "f" is 3, "s" is 2 and the block is "a".
    - The 2 identity blocks use three set of filters of size [512, 512, 2048], "f" is 3 and the blocks are "b" and "c".
- The 2D Average Pooling uses a window of shape (2,2) and its name is "avg_pool".
- The flatten doesn't have any hyperparameters or name.
- The Fully connected layer reduces its input to the number of classes using a softmax activation. Its name should be `'fc' + str(classes)`.

### Task 3: Implement the ResNet with 50 layers described in the figure above.
We have implemented Stages 1 and 2. Please implement the rest. (The syntax for implementing Stages 3-5 should be quite similar to that of Stage 2.) Make sure you follow the naming convention in the text above.

You'll need to use this function:

- Average pooling [see reference](https://keras.io/layers/pooling/#averagepooling2d)

Here're some other functions we used in the code below:

- Conv2D: [See reference](https://keras.io/layers/convolutional/#conv2d)
- BatchNorm: [See reference](https://keras.io/layers/normalization/#batchnormalization) (axis: Integer, the axis that should be normalized (typically the features axis))
- Zero padding: [See reference](https://keras.io/layers/convolutional/#zeropadding2d)
- Max pooling: [See reference](https://keras.io/layers/pooling/#maxpooling2d)
- Fully connected layer: [See reference](https://keras.io/layers/core/#dense)
- Addition: [See reference](https://keras.io/layers/merge/#add)

In [None]:
# FUNCTION: ResNet50

from keras.layers import Input, ZeroPadding2D, AveragePooling2D, Flatten, Dense, Conv2D, MaxPooling2D
from keras.models import Model

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """

    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)


    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)

    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1')(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    ### START CODE HERE ###

    # Stage 3 (‚âà4 lines)


    # Stage 4 (‚âà6 lines)


    # Stage 5 (‚âà3 lines)


    # AVGPOOL (‚âà1 line). Use "X = AveragePooling2D(...)(X)"


    ### END CODE HERE ###

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes))(X)


    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

Run the following code to build the model's graph. 

In [None]:
model = ResNet50(input_shape = (64, 64, 3), classes = 6)

In [None]:
# print model summary
model.summary()

### Compile the Model 

Run the following code to compile your model

   - `accuracy` as the evaluation metric

In [None]:

print("\nCompiling model...")

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

print("\nFinished Compiling model...")

The model is now ready to be trained. The only thing you need is a dataset.

## 4 - Applying the ResNet model (50 layers) to the sign dataset

Let's load the SIGNS Dataset.

<img src="images/signs_data_kiank.png" style="width:450px;height:250px;">
<caption><center> <u> <font color='purple'> <b>Figure 6</b> </u><font color='purple'>  : <b>SIGNS dataset</b> </center></caption>

> **Note:**  
> In this dataset, the labels are **one-hot encoded**, so we compile the model with:
>
> ```python
> loss='categorical_crossentropy'
> ```
>
> If your labels are instead **integer class indices** (e.g., `0, 1, 2, ...`), then you should use:
>
> ```python
> loss='sparse_categorical_crossentropy'
> ```
>
> Both losses compute the same objective ‚Äî they simply expect labels in different formats.


In [None]:
import h5py
import math

def load_dataset():
    # Loading the training and test datasets, you may need to change the path to where you have saved these files
    train_dataset = h5py.File('train_signs.h5', "r")
    train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
    train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels

    test_dataset = h5py.File('test_signs.h5', "r")
    test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
    test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels

    classes = np.array(test_dataset["list_classes"][:]) # the list of classes

    train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
    test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))

    return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes



def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)].T
    return Y

In [None]:
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
print(X_train_orig.shape, Y_train_orig.shape, X_test_orig.shape, Y_test_orig.shape, classes.shape)
# shuffle the training dataset
np.random.seed(1)
m = X_train_orig.shape[0]
permutation = list(np.random.permutation(m))
X_train_orig = X_train_orig[permutation]
Y_train_orig = Y_train_orig[:, permutation]

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

val_size = int(len(X_train)*0.8)
X_val = X_train[val_size:]
Y_val = Y_train[val_size:]
X_train = X_train[:val_size]
Y_train = Y_train[:val_size]


print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_val shape: " + str(X_val.shape))
print ("Y_val shape: " + str(Y_val.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

### Train Your Model for 2 Epochs

To begin, train the model for just **2 epochs**.  
This short run helps ensure that everything is working correctly and that your device (CPU or GPU) can train the model without any issues.

In [None]:
# Train for 2 epochs
model.fit(X_train, Y_train, epochs = 2, batch_size = 32)

**Expected Output**:

<table>
    <tr>
        <td>
            <b> Epoch 1/2 </b>
        </td>
        <td>
           loss: between 1 and 5, acc: between 0.2 and 0.5, although your results can be different from ours.
        </td>
    </tr>
    <tr>
        <td>
            <b> Epoch 2/2 </b>
        </td>
        <td>
           loss: between 1 and 5, acc: between 0.3 and 0.7, you should see your loss decreasing and the accuracy increasing.
        </td>
    </tr>

</table>

### Task 5: Output the test accuracy of this model (trained on only two epochs)


In [None]:
preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))

**Expected Output**:

<table>
    <tr>
        <td>
            <b>Test Accuracy</b>
        </td>
        <td>
           between 0.16 and 0.25
        </td>
    </tr>

</table>

For two epochs, You can see that it achieves poor performances. Let's next train it for 50 epochs and see whether the performance will get a lot better.

To prevent overfitting, we use the `EarlyStopping` callback. Training will stop if the validation loss does not improve for several epochs, and the best weights will be restored.

### Train the model for 50 epochs and output the test accuracy

In [None]:
# Adding a Callback for Early Stopping to prevent overfitting
from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, start_from_epoch=10,
                           restore_best_weights=True)

In [None]:
# Train a fresh model for 18 epochs with validation split
import time
start_time = time.time()
history_model = model.fit(X_train, Y_train, epochs = 50, batch_size = 32, validation_data=(X_val, Y_val),callbacks =[early_stop])
end_time = time.time()

# Calculate elapsed time
elapsed = end_time - start_time
mins = int(elapsed // 60)
secs = elapsed % 60

print(f"Total training time: {mins} min {secs:.1f} sec")


### Task 4: Draw the training curve and find the best epoch (lowest validation loss)

### Task 5: output the test accuracy

## 5 - `ResNet50` in Keras
ResNet50 is a powerful convolutional neural network architecture commonly used for image classification tasks. Keras offers the [ResNet50 model](https://keras.io/api/applications/) pre-trained on the ImageNet dataset, facilitating rapid development and achieving high accuracy on various image recognition tasks.

In the next step, we will leverage this pre-trained ResNet50 model to compare its performance with our custom implementation. This comparative analysis will provide insights into the effectiveness and efficiency of both approaches for image classification.


In [None]:
from keras.applications import ResNet50

# Load the ResNet50 model without the top classification layer
base_model = ResNet50(weights=None, include_top=False, input_shape=(64, 64, 3))

X = base_model.output
X = AveragePooling2D(pool_size=(2, 2), name = 'avg_pool')(X)
X = Flatten()(X)

# Add a softmax layer for the number of classes (6 in this case)
predictions = Dense(6, activation='softmax')(X)

# Create the final model
ResNet50_keras_model = Model(inputs=base_model.input, outputs=predictions)


### Task 6: Using the same model configuration you implemented above, train the model and report the test accuracy below.

In [None]:
# Draw the training curve



**What you should remember:**

- Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.  
- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
- There are two main type of blocks: The identity block and the convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.

##  Congratulations

ResNet50 is a powerful model for image classification when it is trained for an adequate number of iterations. We hope you can use what you've learnt and apply it to your own classification problem to perform state-of-the-art accuracy.

Congratulations on finishing this lab! You've now implemented a state-of-the-art image classification system!

## References

This notebook introduces the ResNet architecture originally proposed by **He et al. (2015)**.  
The implementation draws inspiration from official resources and community examples, including work by **Fran√ßois Chollet** and the **DeepLearning.AI** team.

**Primary Source**
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.  
  *Deep Residual Learning for Image Recognition*, 2015.  
  <https://arxiv.org/abs/1512.03385>


