Authors :

**BENYAMINA Elyas**

**ZEKRI Oussama**

## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.


* NB: if `PIL` is not installed, try `conda install pillow`.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
import matplotlib.pyplot as plt
import pickle
import urllib.request
import cv2

import numpy as np
from PIL import Image

%matplotlib inline

### Download the Model
We provide you a pretrained model `ResNet-34` for `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **ResNet-34**: A deep architecture for image classification.

In [None]:
resnet34 = models.resnet34(weights='ResNet34_Weights.IMAGENET1K_V1')  # New PyTorch interface for loading weights!
resnet34.eval() # set the model to evaluation mode

![ResNet34](https://miro.medium.com/max/1050/1*Y-u7dH4WC-dXyn9jOG4w0w.png)


Input image must be of size (3x224x224). 

First convolution layer with maxpool. 
Then 4 ResNet blocks. 

Output of the last ResNet block is of size (512x7x7). 

Average pooling is applied to this layer to have a 1D array of 512 features fed to a linear layer that outputs 1000 values (one for each class). No softmax is present in this case. We have already the raw class score!

In [None]:
classes = pickle.load(urllib.request.urlopen('https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl'))

##classes is a dictionary with the name of each class 
print(classes)

### Input Images
We provide you 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [None]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    # Note: If the inverse normalisation is required, apply 1/x to the above object
    
    dataset = datasets.ImageFolder(dir_path, transforms.Compose([
            transforms.Resize(256), 
            transforms.CenterCrop(224), # resize the image to 224x224
            transforms.ToTensor(), # convert numpy.array to tensor
            normalize])) #normalize the tensor

    return (dataset)

In [None]:
import os
if not os.path.exists("data"):
    os.mkdir("data")
if not os.path.exists("data/TP2_images"):
    os.mkdir("data/TP2_images")
    !cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2023/TP2/TP2_images.zip" && unzip TP2_images.zip

dir_path = "data/" 
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image 
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert('RGB')
plt.imshow(input_image)

In [None]:
output = resnet34(dataset[index][0].view(1, 3, 224, 224))
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

### Grad-CAM 
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude. 


* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully.
 + More on [autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) and [hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks)
 + The pretrained model resnet34 doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly. 
 + The size of feature maps is 7x7, so your heatmap will have the same size. You need to project the heatmap to the resized image (224x224, not the original one, before the normalization) to have a better observation. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

Class: ‘pug, pug-dog’ | Class: ‘tabby, tabby cat’
- | - 
![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/dog.jpg)| ![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/cat.jpg)

In [None]:
def grad_cam_heatmap_input(indice_category, indice_image):
    # Lists and functions that will track and store output and gradients.
    # We track the values at the output of the conv2 layer and at the input and the output of the bn2 layer
    input_list=[]
    output_list=[]
    grad_input_list=[]
    grad_output_list=[]
    grad_output_list_bn2=[]
    output_list_bn2=[]
    
    def fw_hook(self, input, output):
        output_list.append(output)
    def fw_hook2(self, input, output):
        input_list.append(input)
        output_list_bn2.append(output)
    def bw_hook(self, grad_input, grad_output):
        grad_output_list.append(grad_output)
    def bw_hook2(self, grad_input, grad_output):
        grad_input_list.append(grad_input)
        grad_output_list_bn2.append(grad_output)

    resnet34.layer4[2].bn2.register_forward_hook(fw_hook2)
    resnet34.layer4[2].bn2.register_backward_hook(bw_hook2)
    resnet34.layer4[2].conv2.register_forward_hook(fw_hook)
    resnet34.layer4[2].conv2.register_backward_hook(bw_hook)
    
    #Generating the output and performing backpropogation only for the class we are working in
    resnet34.zero_grad()
    input_image=dataset[indice_image][0].view(1, 3, 224, 224)
    output = resnet34(input_image)
    output_category=output[:,indice_category]
    output_category.backward(retain_graph=True) 

    # We use the gradients at the output of the conv2 layer and the values at the output of the bn2 layer
    # One may try to use the values at the input of the bn2 layer (=output of the conv2 layers) to have different but accurate results
    #grad_list = grad_input_list[0][0][0]
    grad_list = grad_output_list[0][0][0]
    #grad_list = grad_output_list_bn2[0][0][0]

    #image_list = input_list[0][0][0]
    image_list = output_list[0][0]
    #image_list = output_list_bn2[0][0]

    # For each of the 512 maps, we compute the mean of the gradients related
    grad_list_mean = torch.mean(grad_list,axis=(1,2))
    
    # We multiply each map by its related mean
    heatmap = torch.zeros(grad_list[0,:,:].shape)
    for i in range(grad_list_mean.shape[0]):
        heatmap += grad_list_mean[i] * image_list[i,:,:]
    
    # We perform relu and resize the image
    heatmap_relu = nn.ReLU()(heatmap)
    heatmap_resized = F.interpolate(heatmap_relu.view(1, 1, 7, 7), size=(224, 224), mode='bilinear')[0,0]
    heatmap_resized = heatmap_resized/torch.max(heatmap_resized)
    return heatmap_resized.detach().numpy()

In [None]:
def heatmap_to_image(heatmap, img):
    # We create a heatmap with openCV
    heatmap = cv2.applyColorMap(np.uint8(255*heatmap), cv2.COLORMAP_JET) 
    # We apply the heatmap to the image
    img_heatmap = (np.float32(img)/255 + np.float32(heatmap)/255)
    img_heatmap = (img_heatmap-np.min(img_heatmap))/(np.max(img_heatmap)-np.min(img_heatmap))
    #img_heatmap = np.uint8(255*img_heatmap)
    return img_heatmap[:, :, ::-1]

### Complementary questions:

##### Try GradCAM on others convolutional layers, describe and comment the results

In [None]:
def grad_cam_heatmap_input_bn2(indice_category, indice_image):
    # Lists and functions that will track and store output and gradients.
    # We track the values at the output of the conv2 layer and at the input and the output of the bn2 layer
    input_list=[]
    output_list=[]
    grad_input_list=[]
    grad_output_list=[]
    grad_output_list_bn2=[]
    output_list_bn2=[]
    def fw_hook(self, input, output):
        output_list.append(output)
    def fw_hook2(self, input, output):
        input_list.append(input)
        output_list_bn2.append(output)
    def bw_hook(self, grad_input, grad_output):
        grad_output_list.append(grad_output)
    def bw_hook2(self, grad_input, grad_output):
        grad_input_list.append(grad_input)
        grad_output_list_bn2.append(grad_output)

    resnet34.layer4[2].bn2.register_forward_hook(fw_hook2)
    resnet34.layer4[2].bn2.register_backward_hook(bw_hook2)
    resnet34.layer4[2].conv2.register_forward_hook(fw_hook)
    resnet34.layer4[2].conv2.register_backward_hook(bw_hook)
    
    #Generating the output and performing backpropogation only for the class we are working in
    resnet34.zero_grad()
    input_image=dataset[indice_image][0].view(1, 3, 224, 224)
    output = resnet34(input_image)
    output_category=output[:,indice_category]
    output_category.backward(retain_graph=True) 

    # We use the gradients at the output of the conv2 layer and the values at the output of the bn2 layer
    # One may try to use the values at the input of the bn2 layer (=output of the conv2 layers) to have different but accurate results
    #grad_list = grad_input_list[0][0][0]
    #grad_list = grad_output_list[0][0][0]
    grad_list = grad_output_list_bn2[0][0][0]

    #image_list = input_list[0][0][0]
    #image_list = output_list[0][0]
    image_list = output_list_bn2[0][0]

    # For each of the 512 maps, we compute the mean of the gradients related
    grad_list_mean = torch.mean(grad_list,axis=(1,2))
    
    # We multiply each map by its related mean
    heatmap = torch.zeros(grad_list[0,:,:].shape)
    for i in range(grad_list_mean.shape[0]):
        heatmap += grad_list_mean[i] * image_list[i,:,:]
    
    # We perform relu and resize the image
    heatmap_relu = nn.ReLU()(heatmap)
    heatmap_resized = F.interpolate(heatmap_relu.view(1, 1, 7, 7), size=(224, 224), mode='bilinear')[0,0]
    heatmap_resized = heatmap_resized/torch.max(heatmap_resized)
    return heatmap_resized.detach().numpy()

In [None]:
for indice_image in range(20):
    input_image=dataset[indice_image][0].view(1, 3, 224, 224)
    output = resnet34(input_image)
    _, indices = torch.topk(output,3)
    
    # We use convert('RGB') to avoid failure because there is one grey
    img = np.asarray(Image.open(dataset.imgs[indice_image][0]).convert('RGB'))
    img = cv2.resize(img, (224, 224))
    
    f, ax = plt.subplots(1,4,figsize=(20,5))
    ax[0].imshow(img)

    for i in range(3):
        heatmap = grad_cam_heatmap_input(indices[0].numpy()[i], indice_image)
        img_with_heatmap = heatmap_to_image(heatmap, img)
        ax[i+1].imshow(img_with_heatmap)
        ax[i+1].set_title(classes[indices[0].numpy()[i]])
    plt.show()

##### What are the principal contributions of GradCAM (the answer is in the paper) ?

As mentionned in the paper, Grad-CAM is a localization technique that provides visual exaplanation for a CNN-based network. It means that it gave access for each pixel a score of importance in the final output. It can be applied to classification, captioning and VQA models, and without changing it or retraining it.
We worked on classification models in this lab. 
Graphically, we can see which areas were used to provide the prediction of a class.
This technque is important as it allow us to track wrong predictions of a model and to understand why it has failed.

Let us comment our results.

We followed the hints, and we used the function Hook to track outputs and gradients. We noticed that if we use the values (computed during the forward steps) in the output of the bn2 layer (which follows the last convulational layer), we obtained different results that when we use the values in the input of this layer (which are the same as the values of the output of the conv layer).
The results seems to be a little bit better in this case. 

We notice that for most of the cases, the three most probable classes are similar (different type of a same animal) and the ares used for the prediction are very similar.

Grad-CAM serves as a valuable tool for gaining insight into the inner workings of the network and its visual focus during prediction. It helps in understanding the variations in predictions and their correlation with specific features observed by the network in the input image. In fact in our examples, Grad_Cam provided clarity on how the network distinguishes between animal species or between species of the same family.