# **Visualization of CNNs: Grad-CAM**
* **Objective**: Convolutional Neural Networks are widely used on computer vision. They are powerful for processing grid-like data. However we hardly know how and why they work, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.


* NB: if `PIL` is not installed, try `conda install pillow`.
* Computations are light enough to be done on CPU.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
import matplotlib.pyplot as plt
import pickle
import urllib.request

import numpy as np
from PIL import Image

%matplotlib inline

## Download the Model
We provide you with a model `DenseNet-121`, already pretrained on the `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **DenseNet-121**: A deep architecture for image classification (https://arxiv.org/abs/1608.06993)

In [None]:
densenet121 = models.densenet121(pretrained=True)
densenet121.eval() # set the model to evaluation model
pass

In [None]:
# print(densenet121)

In [None]:
classes = pickle.load(urllib.request.urlopen('https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl'))

##classes is a dictionary with the name of each class 
print(classes)

## Input Images
We provide you with 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [None]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    # Note: If the inverse normalisation is required, apply 1/x to the above object
    
    dataset = datasets.ImageFolder(dir_path, transforms.Compose([
            transforms.Resize(256), 
            transforms.CenterCrop(224), # resize the image to 224x224
            transforms.ToTensor(), # convert numpy.array to tensor
            normalize])) #normalize the tensor

    return (dataset)

In [None]:
import os
if not os.path.exists("data"):
    os.mkdir("data")
if not os.path.exists("data/TP2_images"):
    os.mkdir("data/TP2_images")
    !cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2025/TP2/TP2_images.zip" && unzip TP2_images.zip

dir_path = "data/" 
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image 
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert('RGB')
plt.imshow(input_image)

In [None]:
input_image = dataset[index][0].view(1, 3, 224, 224)
output = densenet121(input_image)
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

# Grad-CAM 
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude. 

More precisely, you should provide a function: `show_grad_cam(image: torch.tensor) -> None` that displays something like this:
![output_example.png](attachment:output_example.png)
where the heatmap will be correct (here it is just an example) and the first 3 classes are the top-3 predicted classes and the last is the least probable class according to the model.

* **Comment your code**: Your code should be easy to read and follow. Please comment your code, try to use the NumPy Style Python docstrings for your functions.

* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (with or without GPU) (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully.
 + More on [autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) and [hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks)
 + The pretrained model densenet doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly. 
 + Your heatmap will have the same size as the feature map. You need to scale up the heatmap to the resized image (224x224, not the original one, before the normalization) for better observation purposes. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link to the paper: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

Class: ‘pug, pug-dog’ | Class: ‘tabby, tabby cat’
- | - 
![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/dog.jpg)| ![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/cat.jpg)

## Part 1: Grad-CAM implementation

In [None]:
def get_gradcam(model, input_image, class_idx):
    """
    Calculate Grad-CAM for a specific class index.
    
    Args:
        model: The neural network model (DenseNet121)
        input_image: Input tensor of shape (1, 3, 224, 224)
        class_idx: Index of the class to analyze
        
    Returns:
        Numpy array of the Grad-CAM heatmap
    """
    # Get the last convolutional layer
    target_layer = model.features.denseblock4.denselayer16.conv2
    
    # Store the feature maps and gradients
    feature_maps = []
    gradients = []
    
    def hook_features(module, input, output):
        feature_maps.append(output)
        
    def hook_grads(module, grad_input, grad_output):
        gradients.append(grad_output[0])
    
    # Register hooks
    handle_features = target_layer.register_forward_hook(hook_features)
    handle_grads = target_layer.register_backward_hook(hook_grads)
    
    # Forward pass
    model.zero_grad()
    output = model(input_image)
    
    # Backward pass for the target class
    one_hot = torch.zeros_like(output)
    one_hot[0][class_idx] = 1
    output.backward(gradient=one_hot)
    
    # Remove hooks
    handle_features.remove()
    handle_grads.remove()
    
    # Calculate Grad-CAM
    feature = feature_maps[0]
    gradient = gradients[0]
    
    weights = torch.mean(gradient, dim=(2, 3), keepdim=True)
    cam = torch.sum(feature * weights, dim=1, keepdim=True)
    cam = F.relu(cam)
    
    # Normalize
    cam = cam - cam.min()
    cam = cam / (cam.max() + 1e-7)
    
    # Resize to input size
    cam = F.interpolate(cam, size=(224, 224), mode='bilinear', align_corners=False)
    
    return cam[0, 0].detach().cpu().numpy()

def show_grad_cam(image):
    """
    Display the original image and Grad-CAM visualizations for top-3 predicted classes and least probable class with axes and ticks.

    Args:
        image: Input tensor of shape (1, 3, 224, 224)
    """
    # Get model predictions
    with torch.no_grad():
        output = densenet121(image)
    
    # Get top-3 and bottom-1 predictions
    values, indices = torch.topk(output, k=3)
    min_value, min_index = torch.min(output, dim=1)
    
    # Create figure with subplots (1 row, 5 columns)
    fig, axs = plt.subplots(1, 5, figsize=(18, 5))
    
    # Original image for overlay
    orig_img = image[0].permute(1, 2, 0)
    orig_img = orig_img * torch.tensor([0.229, 0.224, 0.225]) + torch.tensor([0.485, 0.456, 0.406])
    orig_img = orig_img.numpy()
    orig_img = np.clip(orig_img, 0, 1)
    
    # Display original image in the first subplot
    axs[0].imshow(orig_img)
    axs[0].set_title('Original Image')
    axs[0].axis('on')
    axs[0].set_xticks(np.linspace(0, 224, 5))
    axs[0].set_yticks(np.linspace(0, 224, 5))
    axs[0].tick_params(labelsize=8)
    
    # Process each class
    for i, idx in enumerate(indices[0]):
        cam = get_gradcam(densenet121, image, idx)
        
        axs[i + 1].imshow(orig_img)
        axs[i + 1].imshow(cam, cmap='jet', alpha=0.5)
        axs[i + 1].set_title(f'{classes[idx.item()]}\n{output[0][idx].item():.2f}')
        axs[i + 1].axis('on')
        axs[i + 1].set_xticks(np.linspace(0, 224, 5))
        axs[i + 1].set_yticks(np.linspace(0, 224, 5))
        axs[i + 1].tick_params(labelsize=8)

    # Process least probable class
    cam = get_gradcam(densenet121, image, min_index)
    axs[4].imshow(orig_img)
    axs[4].imshow(cam, cmap='jet', alpha=0.5)
    axs[4].set_title(f'{classes[min_index.item()]}\n{min_value.item():.2f}')
    axs[4].axis('on')
    axs[4].set_xticks(np.linspace(0, 224, 5))
    axs[4].set_yticks(np.linspace(0, 224, 5))
    axs[4].tick_params(labelsize=8)

    plt.tight_layout()
    plt.show()
    


## Part 2: Try it on a few (1 to 3) images and comment

In [None]:
for i in range(3, 6):
    input_image = dataset[i][0].view(1, 3, 224, 224)
    show_grad_cam(input_image)


Comment of Grad-CAM Results

For images 3 to 5, we observe that Grad-CAM effectively highlights the relevant regions that contribute to the model's decisions:

1. For example, in the first group, the high-intensity areas of the heatmap correspond to the bodies and legs of the elkhound, husky, and malamute, while the heatmap for the least relevant label corresponds to the unrelated floor.
2. The high-intensity areas in the heatmaps accurately correspond to discriminative regions that are crucial for classification
3. Different class predictions focus on different regions of the same image, showing the model's ability to identify multiple features
4. The least probable class typically shows scattered or irrelevant attention areas, confirming the model's discriminative capabilities

This visualization demonstrates how DenseNet-121 attends to specific image regions for making its predictions, providing insights into the model's decision-making process.

![output_example.png](attachment:output_example.png)

## Part 3: Try GradCAM on others convolutional layers, describe and comment the results

In [None]:
def get_gradcam(model, input_image, class_idx, target_layer):
    """
    Calculate Grad-CAM for a specific class index.
    
    Args:
        model: The neural network model (DenseNet121)
        input_image: Input tensor of shape (1, 3, 224, 224)
        class_idx: Index of the class to analyze
        
    Returns:
        Numpy array of the Grad-CAM heatmap
    """
    # Get the last convolutional layer
    target_layer = target_layer
    
    # Store the feature maps and gradients
    feature_maps = []
    gradients = []
    
    def hook_features(module, input, output):
        feature_maps.append(output)
        
    def hook_grads(module, grad_input, grad_output):
        gradients.append(grad_output[0])
    
    # Register hooks
    handle_features = target_layer.register_forward_hook(hook_features)
    handle_grads = target_layer.register_backward_hook(hook_grads)
    
    # Forward pass
    model.zero_grad()
    output = model(input_image)
    
    # Backward pass for the target class
    one_hot = torch.zeros_like(output)
    one_hot[0][class_idx] = 1
    output.backward(gradient=one_hot)
    
    # Remove hooks
    handle_features.remove()
    handle_grads.remove()
    
    # Calculate Grad-CAM
    feature = feature_maps[0]
    gradient = gradients[0]
    
    weights = torch.mean(gradient, dim=(2, 3), keepdim=True)
    cam = torch.sum(feature * weights, dim=1, keepdim=True)
    cam = F.relu(cam)
    
    # Normalize
    cam = cam - cam.min()
    cam = cam / (cam.max() + 1e-7)
    
    # Resize to input size
    cam = F.interpolate(cam, size=(224, 224), mode='bilinear', align_corners=False)
    
    return cam[0, 0].detach().cpu().numpy()

def show_grad_cam_select_layer(image, target_layer):
    """
    Display the original image and Grad-CAM visualizations for top-3 predicted classes and least probable class with axes and ticks.

    Args:
        image: Input tensor of shape (1, 3, 224, 224)
    """
    # Get model predictions
    with torch.no_grad():
        output = densenet121(image)
    
    # Get top-3 and bottom-1 predictions
    values, indices = torch.topk(output, k=3)
    min_value, min_index = torch.min(output, dim=1)
    
    # Create figure with subplots (1 row, 5 columns)
    fig, axs = plt.subplots(1, 5, figsize=(18, 5))
    
    # Original image for overlay
    orig_img = image[0].permute(1, 2, 0)
    orig_img = orig_img * torch.tensor([0.229, 0.224, 0.225]) + torch.tensor([0.485, 0.456, 0.406])
    orig_img = orig_img.numpy()
    orig_img = np.clip(orig_img, 0, 1)
    
    # Display original image in the first subplot
    axs[0].imshow(orig_img)
    axs[0].set_title('Original Image')
    axs[0].axis('on')
    axs[0].set_xticks(np.linspace(0, 224, 5))
    axs[0].set_yticks(np.linspace(0, 224, 5))
    axs[0].tick_params(labelsize=8)
    
    # Process each class
    for i, idx in enumerate(indices[0]):
        cam = get_gradcam(densenet121, image, idx, target_layer)
        
        axs[i + 1].imshow(orig_img)
        axs[i + 1].imshow(cam, cmap='jet', alpha=0.5)
        axs[i + 1].set_title(f'{classes[idx.item()]}\n{output[0][idx].item():.2f}')
        axs[i + 1].axis('on')
        axs[i + 1].set_xticks(np.linspace(0, 224, 5))
        axs[i + 1].set_yticks(np.linspace(0, 224, 5))
        axs[i + 1].tick_params(labelsize=8)

    # Process least probable class
    cam = get_gradcam(densenet121, image, min_index, target_layer)
    axs[4].imshow(orig_img)
    axs[4].imshow(cam, cmap='jet', alpha=0.5)
    axs[4].set_title(f'{classes[min_index.item()]}\n{min_value.item():.2f}')
    axs[4].axis('on')
    axs[4].set_xticks(np.linspace(0, 224, 5))
    axs[4].set_yticks(np.linspace(0, 224, 5))
    axs[4].tick_params(labelsize=8)

    plt.tight_layout()
    plt.show()

for i in range(5, 6):
    input_image = dataset[i][0].view(1, 3, 224, 224)
    show_grad_cam_select_layer(input_image, densenet121.features.denseblock1.denselayer1.conv2)
    show_grad_cam_select_layer(input_image, densenet121.features.denseblock2.denselayer1.conv2)
    show_grad_cam_select_layer(input_image, densenet121.features.denseblock3.denselayer1.conv2)
    show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer1.conv2)
    # show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer4.conv2)
    # show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer8.conv2)
    # show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer16.conv2)

    



Comment of GradCAM Results Across Different Layers
Layer Progression Analysis
1. DenseBlock1 (Early Layer)
    - Shows broad, less focused activations
    - Responds to basic features like edges and textures
    - Lower layer resolution leads to coarser heatmaps

2. DenseBlock2 and DenseBlock3 (Middle Layers)
    - More refined feature detection
    - Better localization of relevant objects
    - Shows intermediate-level feature combinations

3. DenseBlock4 (Later Layers)
    - Most precise object localization
    - Clear distinction between different class predictions
    - Progressive refinement from denselayer1 to denselayer16

Key Observations
- Deeper layers show more class-specific activations
- Earlier layers capture more general features
- The quality of localization improves with depth
- Dense connections help maintain fine-grained detail


## Part 4: Try GradCAM on `9928031928.png` , describe and comment the results

In [None]:
# Your code here
input_image = dataset[-1][0].view(1, 3, 224, 224)


show_grad_cam_select_layer(input_image, densenet121.features.denseblock1.denselayer1.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock2.denselayer1.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock3.denselayer1.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer1.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer4.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer8.conv2)
show_grad_cam_select_layer(input_image, densenet121.features.denseblock4.denselayer16.conv2)


## Analysis of Grad-CAM Results for the Image '9928031928.png'

### Results

1. In all the classification results, the image was classified as a jeep landrover. According to the heatmap, for the jeep label, the model's attention is focused on the background of the elephant rather than the elephant itself. This may be because the model is more likely to perceive the elephant's body as part of the background.

2. The advantage of Gradient-Class Activation Mapping (Grad-CAM)：From the heatmap, it can be seen that the model's attention is focused on the background area rather than the elephant itself. This indicates that Grad-CAM visualization can reveal the model's incorrect focus on certain areas for specific classes (such as jeep). Through this visualization, we can further analyze potential flaws in the model's decision-making process.

## Part 5: What are the principal contributions of GradCAM (the answer is in the paper) ?

1. Visual Explanations: Grad-CAM generates heatmaps that highlight the image regions that contribute most to the model's prediction. This allows for a direct understanding of how the model works, improving its interpretability and transparency.

2. Generality: Grad-CAM is a versatile technique that can be applied to various deep learning models, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and more. This makes Grad-CAM a valuable tool for understanding different types of models. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families.

## Bonus 5: What are the main differences between DenseNet and ResNet ?

Connection Method:

1. DenseNet uses dense connections, where the input to each layer includes not only the output of the previous layer but also the outputs of all preceding layers. This means that each layer receives feature maps from all previous layers as input. This method allows features to be transmitted more efficiently and reduces the vanishing gradient problem.

2. ResNet uses residual connections, where the output of each layer is added to its input. Through these skip connections, information can be passed directly, thus alleviating the vanishing gradient problem in deep networks.

