### TP2 Advance Deep Learning

**Authors:**
- Carlos Cuevas Villarmin
- Oliver Jack
- Javier Alejandro Lopetegui Gonzalez

*MVA, ENS Paris-Saclay*

# **Visualization of CNNs: Grad-CAM**
* **Objective**: Convolutional Neural Networks are widely used on computer vision. They are powerful for processing grid-like data. However we hardly know how and why they work, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.


* NB: if `PIL` is not installed, try `conda install pillow`.
* Computations are light enough to be done on CPU.

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
from torchvision import models, datasets, transforms
import matplotlib.pyplot as plt
import pickle
import urllib.request

import numpy as np
from PIL import Image

%matplotlib inline

## Download the Model
We provide you with a model `DenseNet-121`, already pretrained on the `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **DenseNet-121**: A deep architecture for image classification (https://arxiv.org/abs/1608.06993)

In [None]:
densenet121 = models.densenet121(pretrained=True)
densenet121.eval() # set the model to evaluation model
pass

In [None]:
classes = pickle.load(urllib.request.urlopen('https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl'))

##classes is a dictionary with the name of each class
print(classes)

## Input Images
We provide you with 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [4]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    # Note: If the inverse normalisation is required, apply 1/x to the above object

    dataset = datasets.ImageFolder(dir_path, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224), # resize the image to 224x224
            transforms.ToTensor(), # convert numpy.array to tensor
            normalize])) #normalize the tensor

    return (dataset)

In [None]:
import os
if not os.path.exists("data"):
    os.mkdir("data")
if not os.path.exists("data/TP2_images"):
    os.mkdir("data/TP2_images")
    !cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2025/TP2/TP2_images.zip" && unzip TP2_images.zip

dir_path = "data/"
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert('RGB')
plt.imshow(input_image)

In [None]:
input_image = dataset[index][0].view(1, 3, 224, 224)
output = densenet121(input_image)
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

# Grad-CAM
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude.

More precisely, you should provide a function: `show_grad_cam(image: torch.tensor) -> None` that displays something like this:
![output_example.png](attachment:output_example.png)
where the heatmap will be correct (here it is just an example) and the first 3 classes are the top-3 predicted classes and the last is the least probable class according to the model.

* **Comment your code**: Your code should be easy to read and follow. Please comment your code, try to use the NumPy Style Python docstrings for your functions.

* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (with or without GPU) (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!


* **Hints**:
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully.
 + More on [autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) and [hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks)
 + The pretrained model densenet doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly.
 + Your heatmap will have the same size as the feature map. You need to scale up the heatmap to the resized image (224x224, not the original one, before the normalization) for better observation purposes. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link to the paper: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

Class: ‘pug, pug-dog’ | Class: ‘tabby, tabby cat’
- | -
![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/dog.jpg)| ![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/cat.jpg)

## Part 1: Grad-CAM implementation

In [28]:
def grad_cam(model: nn.Module = densenet121, image: Image=None, target_layer: nn.Sequential = densenet121.features.denseblock4) -> Tuple[list,list,list]:
    """
    Method to generate the Grad-CAM heatmaps given a CNN based model, an image and a target layer. It will generate it for the three classes with
    the highest scores.

    Args:
        model (nn.Module): The model to be considered.
        image (Image): The input image.
        target_layer (Sequential): The target layer for Grad-CAM.

    Returns:
        all_heatmaps (list): A list of heatmaps for each class.
        all_classes (list): A list of class indices.
        all_scores (list): A list of scores for each class.
    """
    output = model(image)
    values, indices = torch.topk(output, k=3, dim=1)

    all_heatmaps = []
    all_classes = indices[0].tolist()
    all_scores = values[0].tolist()

    for class_idx in indices[0]:
        model.zero_grad()

        feature_maps = None
        gradients = None

        def forward_hook(module, input, output):
            nonlocal feature_maps
            feature_maps = output

        def backward_hook(module, grad_in, grad_out):
            nonlocal gradients
            gradients = grad_out[0]

        forward_handle = target_layer.register_forward_hook(forward_hook)
        backward_handle = target_layer.register_backward_hook(backward_hook)

        output = model(image)

        one_hot = torch.zeros(output.shape, dtype=output.dtype, device=output.device)
        one_hot[0, class_idx] = 1

        output.backward(gradient=one_hot, retain_graph=True)

        num_channels = feature_maps.shape[1]

        alpha_k = torch.mean(gradients, dim=(2, 3))
        alpha_k = alpha_k.view(1, num_channels, 1, 1)

        grad_cam = torch.sum(alpha_k * feature_maps, dim=1)
        grad_cam = grad_cam.squeeze(0)
        grad_cam = F.relu(grad_cam)

        heatmap = F.interpolate(
            grad_cam.unsqueeze(0).unsqueeze(0),
            size=(224, 224),
            mode='bilinear',
            align_corners=False
        ).squeeze()

        heatmap -= heatmap.min()
        heatmap /= heatmap.max()

        all_heatmaps.append(heatmap.cpu().detach())

        forward_handle.remove()
        backward_handle.remove()

    return all_heatmaps, all_classes, all_scores

In [29]:
def show_grad_cam(model: nn.Module = densenet121,image: Image = None, target_layer: nn.Sequential = densenet121.features.denseblock4) -> None:
    """
    Method to show the Grad-CAM heatmaps given a model, an image and a target layer.

    Args:
        model (nn.Module): The model to be considered.
        image (Image): The input image.
        target_layer (Sequential): The target layer for Grad-CAM.

    Returns:
        None
    """
    all_heatmaps, all_classes, all_scores = grad_cam(model=model, image=image, target_layer=target_layer)

    fig, axes = plt.subplots(1, 4, figsize=(20, 5))

    original_img = image.squeeze(0).permute(1, 2, 0)
    mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 1, 3)
    std = torch.tensor([0.229, 0.224, 0.225]).view(1, 1, 3)
    original_np = original_img * std + mean
    original_np = original_np.clamp(0, 1).cpu().numpy()

    axes[0].imshow(original_np)
    axes[0].set_title('Input Image')
    axes[0].set_xticks(np.arange(0, 224, 50))
    axes[0].set_yticks(np.arange(0, 224, 50))

    for i, (heatmap, class_idx, score) in enumerate(zip(all_heatmaps, all_classes, all_scores)):
        heatmap_np = heatmap.numpy()

        colormap = plt.cm.jet(heatmap_np)
        colormap = colormap[..., :3]
        alpha = 0.5
        overlay = (1 - alpha) * original_np + alpha * colormap
        overlay = np.clip(overlay, 0, 1)

        im = axes[i+1].imshow(overlay)
        class_name = classes.get(class_idx, f'Class {class_idx}')
        axes[i+1].set_title(f'{class_name}\nScore: {score:.3f}')
        axes[i+1].set_xticks(np.arange(0, 224, 50))
        axes[i+1].set_yticks(np.arange(0, 224, 50))

        plt.colorbar(im, ax=axes[i+1])

    plt.tight_layout()
    plt.show()

## Part 2: Try it on a few (1 to 3) images and comment

In [None]:
for index in [9, 10, 15]:
    input_image = dataset[index][0].view(1, 3, 224, 224)
    show_grad_cam(image=input_image)

Based on the obtained results, we can observe that performing Grad-Cam using the DenseNet121 model leads to good classification results in general. Overall, the heatmap seems to highlight the core areas related to the target classes. We can see that the computed score for the correct class tends to be the highest and that it distances itself clearly from the scores of the other classes. Furthermore, with the help of the heatmaps, we can see that animals with similar colour, texture or structure tend to be the next most likely classes. For example, in the case of the sorrel image, the model correctly predicts sorrel when considering the entire body of the horses, while mainly looking at the head leads to a prediction of a basenji and mainly looking at the fur leads to a prediction of an ox. Finally, the model still shows some weaknesses in some cases, as can be seen in the case of the sea lion, where the 3rd most probable class is a balance beam, largely due to the lower body part of the sea lion.

## Part 3: Try GradCAM on others convolutional layers, describe and comment the results

In [None]:
target_layers = [densenet121.features.conv0,
                 densenet121.features.denseblock1,
                 densenet121.features.denseblock2,
                 densenet121.features.denseblock3,
                 densenet121.features.denseblock4]

for layer in target_layers:
    input_image = dataset[5][0].view(1, 3, 224, 224)
    show_grad_cam(image=input_image, target_layer=layer)

By considering different convolutional layers, we can see the evolution of how well the feature maps are able to identify the key structures of a specific class. A general observation is that as we move deeper along the architecture, the heatmap tends to localize more precisely regions that are relevant to the target class, while earlier layers tend to highlight more the contours present in the image. This progressive improvement occurs since deeper layers learn increasingly abstract and class-specific features, while shallow layers consider more basic visual elements like edges and textures that may appear throughout the image.

## Part 4: Try GradCAM on `9928031928.png` , describe and comment the results

In [None]:
# elephant image without noise
input_image = dataset[0][0].view(1, 3, 224, 224)
show_grad_cam(image=input_image)

# image 9928031928.png (adversarial example)
input_image = dataset[-1][0].view(1, 3, 224, 224)
show_grad_cam(image=input_image)

When comparing the two images, the human eye can easily identify an elephant in both cases. However, the model fails to predict the correct class in the second case, classifying it most likely as a jeep. Taking a closer look at the image, we can see that the colours in the image are clearly distorted compared to the original (first) image. This suggests the addition of random noise, making it an example of an adverserial attack. This shows a key vulnerability in deep neural networks, where small perturbations to the input image, not always observable to the naked eye, can cause the model to make completely wrong predictions.

## Part 5: What are the principal contributions of GradCAM (the answer is in the paper) ?

Grad-CAM is introduced as a class-discriminative localization technique. The technique generates visual explanations for any CNN-based network without requiring architectural changes or re-training. It is a generalization of CAM [Zhou et al. (2026)](https://openaccess.thecvf.com/content_cvpr_2016/papers/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf), which is restricted to CNNs without fully-connected layers. This allows to get explainability without changing model's architechture and therefore does not compromise the performance. Grad-CAM surpasses this and can be used in CNNs with fully-connected layers (such as VGG), CNNs used for structured outputs or with multi-modal inputs.

Grad-CAM is applied to classification, captioning and VQA models. It is seen to be useful for finding a reasonable explanation to predictions that originally seemed unreasonable (classification task). On the other hand, for captioning and VQA the visualizations obtained with Grad-CAM show that usual CNN with LSTM models are good localizing discriminative images regions although not being trained on grounded (image,text) pairs.

The authors also show the utility of analyzing Grad-CAM visualizations in order to identify failure samples due to inductive bias in the dataset. Taking this information into consideration not only allow to gain explainability but also to obtain fair and bias-free outcomes which is very important in fields where more and more decisions are made by algorithms.

Based on the neuron importance weights $\alpha_k^c$ a deeper analysis can be done to tag each neurons to the concepts they look at in a given image. Higher positive values of the neuron importance indicate that the presence of that concept leads to an increase in the class score, whereas higher negative values indicate that its absence leads to an increase in the score for the class. Concretely, the authors use top-5 and bottom-5 concepts based on the $\alpha_k^c$ values.

To reinforce the conclusions human studies were done to show that the explanations obtained with the technique are class-discriminative. Not only helping humans to establish trust and explainability but also it was seen that even untrained users were able to differenciate between a 'stronger' and a 'weaker' network based on visualizations although they make the same prediction, i.e., the visualizations obtained with Grad-CAM are more understandable and interpretable for human being.

## Bonus 5: What are the main differences between DenseNet and ResNet ?

DenseNet and ResNet are two convolutional neural network architectures commonly used for image classification tasks. The main differences between these architectures can be summarized as follows:

- **Architecture and Connectivity**

ResNet introduces skip connections that allow information to bypass one or more layers. These connections facilitate the flow of gradients through the network, enabling the training of very deep models. In contrast, in DenseNet they propose a different *connectivity pattern* where each layer is directly connected to all preceding layers. This results in feature maps from all previous layers being concatenated and fed into subsequent layers.

- **Feature Propagation and Reuse**

The skip connections in ResNet primarily address the vanishing gradient problem, allowing for the training of extremely deep networks. DenseNet's dense connectivity pattern promotes feature reuse throughout the network, potentially leading to more efficient use of parameters.

- **Model Complexity and Efficiency**

ResNet typically requires more parameters due to its architecture, but it can achieve great depths, sometimes extending to hundreds or even thousands of layers. DenseNet, on the other hand, tends to be more parameter-efficient, often achieving comparable or superior performance with fewer parameters.

- **Memory Usage and Computational Requirements**

ResNet generally exhibits lower memory usage during training and inference. DenseNet, due to its dense connectivity and feature map concatenation, may require more memory, especially for deeper networks.

- **Performance Characteristics**

Both architectures have demonstrated exceptional performance across various tasks. ResNet excels in training very deep networks, while DenseNet often achieves high accuracy with fewer parameters. The choice between them typically depends on specific task requirements, available computational resources, and the nature of the dataset being used.


**Grad-CAM analsysis for ResNet:**

After comparing the two architectures let's see the results obtained by applying Grad-CAM technique to ResNet-based models.

In [None]:
# Load ResNet pre-trained model
resnet18 = models.resnet18(pretrained=True)
resnet18.eval()

for index in [9, 10, 15]:
    input_image = dataset[index][0].view(1, 3, 224, 224)
    show_grad_cam(resnet18, image=input_image, target_layer=resnet18.layer4[1].bn2)

In overall we can notice a similar behavior of the Grad-CAM despite the model we use. The main differences are in terms of model's predictions and confidence which was bigger (in the examples considered) for densenet121. Moreover, as for densenet, in the sea lion image we can see another fany example if we look at the second heatmap: the second label with highest score is *cowboy boot* and the model is looking at what should be the "the tip of the shoe".

**Grad-CAM evolution across layers for Resnet:**

If we make the same analysis as for densenet about the evolution of Grad-CAM as we go deeper in the architecture, we can arrive to the same conclusions. Early layers appears to focus on more local features such as edges or textures while the deeper ones captures more semantic relevant information.

In [None]:
target_layers = [resnet18.conv1,
                 resnet18.layer1,
                 resnet18.layer2,
                 resnet18.layer3,
                 resnet18.layer4]
names = ["conv1", "layer1", "layer2", "layer3", "layer4"]
input_image = dataset[5][0].view(1, 3, 224, 224)
for i,layer in enumerate(target_layers):
    print(f"Applying Grad-CAM to layer: {names[i]}")
    show_grad_cam(model = resnet18, image=input_image, target_layer=layer)

**Adversarial example analysis**:

In the next cell we can see the results of applying Grad-CAM in the elephant images. As we can notice, in this case the adversarial example (index=-1) does not affect significatively the model's prediction, actually the three labels with the highest scores remain the same, but in different order. It suggests that the adversarial attack was done targeting the densenet model.

In [None]:
# elephant image without noise
input_image = dataset[0][0].view(1, 3, 224, 224)
show_grad_cam(resnet18, image=input_image, target_layer=resnet18.layer4[1].bn2)

# elephant image with noise
input_image = dataset[-1][0].view(1, 3, 224, 224)
show_grad_cam(resnet18, image=input_image, target_layer=resnet18.layer4[1].bn2)