This project aims to study the kind of representations learned by ConvNets at different layers. The implementations in this project are based on the ideas presented in the papers:
-
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps by Simonyan et.al.
-
Understanding deep image representations by inverting them by Mahendran et.al.
This idea was presented in the paper Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps by Simonyan et.al. The authors exlore the following two ideas:
-
What kind of representations are learned by CNN's when we talk about a particular classification score in the solftmax layer of any CNN ? This can be visualized by trying to learn an image from random noise which maximizes the class specific score in the CNN.
-
What is the contribution of individual pixels of any image in the final score achieved by the network? This is explored by extracting Visual Saliency maps by computing the derivative of the final score w.r.t to the input image (this can be done in a single backprop step). The authors also use the Visual Saliency Maps as an initialization tool for GraphCut segmentation which can be used for Object localization and detection. We do not explore this aspect of the saliency map in this implementation
The code for this Visualization can be found in the file Class_score_and_Saliency_Map_Visualization.ipynb
. It can be run in a top-down manner like any other IPython notebook.
I pretrain a MobileNetV2
architeture on the cats vs dogs
dataset and then learnt the Visualizations. Would it be better to directly optimize over the learnable image using one of the imagenet class scores rather than pretraining the model
[WIP]
This idea was presented in the paper Understanding deep image representations by inverting them by Mahendran et.al. by Mahendran et.al. The authors exlore the following idea:
What kind of representations are learned by CNN's at different convolutional blocks? This is explored by trying to invert the representations of a test image at different layers of the network. This can provide us with alternate forms of the same image that can have similiar representations for the ConvNet. This alternative formulation helps us visualize the kind of images that are sufficient to form meaningful representations at different layers of the network.
The inversion procedure involves the following steps:
- Randomly initialize an image.
- Compute the reconstruction loss between the representation of the actual image and the random image at a layer l.
- Update the random image by backproping though it using any standard optimizer.
The code for this Visualization can be found in the file Inverting_convnet_representations.ipynb
. It can be run in a top-down manner like any other IPython notebook.
I use a pretrained VGG16 network for Visualization at various levels of the network.
Original Image:
Reconstructed Images:
Block3 Conv1 | Block3 Conv2 | Block3 Conv3 |
Block4 Conv1 | Block4 Conv2 | Block4 Conv3 |
Block5 Conv1 | Block5 Conv2 | Block5 Conv3 |
-
Inverting the image representations clearly show that as we move down a deep hierarchical model like the VGG16 convnet, the representations become more abstract with the network discarding most of the finer level details in the image and considering only higher level features like shape, eyes, ears etc.
-
Visualizing the Image Saliency Maps clearly shows the main image pixels being responsible for the classfication score assigned by the network. This is somewhat obvious and is expected.
-
Updating the image representation by maximizing the image score does not give any clear representations in my implementation at convergence but this might be due to fine tuning the network and not optimizing the image net score. This is possible since the convnet layers are freezed when pretraining the network.