# Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Class Activation의 Generalize Version 이라고 할 수 있다.

## Why interpretability matters?

- The lack of decomposability of deep network into intuitive and understandable components makes them hard to interpret
- Transparent model is necessary to bulid trust in intelligent systems and move towards into our everyday life
- When AI is weaker to identify failure modes
- When AI is on par to establish trust and confidence in users
- When AI is stronger Machine teaching a human to make better decisions

## Motivation

- CAM: Learning deep features for discrimitive localization

<img src="./Images/1.png" width=600 />

- Class Activation Mapping is applicable to only GAP layers
- Make CAM to applicable to a wide variety of CNN models <br />
: CNNs with fully-connected layers (e.g VGG) <br />
: CNNs for structured outputs (e.g. captioning) <br />
: CNNs used in tasks with multi-modal inputs (e.g VQA)

## Related Work

- Visualizing CNNs <br />
Highlight important pixels: non discriminative <br />
Synthesize images to maximally activate a network unit or invert a latent representation: not for specific input images
- Assessing Model Trust <br />
: Motivated by notions of interpretability <br />
: There are some methods to assess trust in models
- Weakly supervised localization <br />
: Pertubing inputs by occlusion <br />
: Marginal winning probability <br />
: Class Activation Mapping(Global Average Pooling, Global Max Pooling, log-sum-exp pooling) <br />
: This paper introduces a new way of combining feature maps using the gradient signal

## Contributions

- Apply Grad-CAM to any CNN-based network without requring architectural changes or re-training
- Apply Gram-CAM to existing top-performing classification, captioning, anc VQA
- Conduct human studies if it helps establish human trust and untrained user can discern a stronger network

## Demo video

https://github.com/ramprs/grad-cam/


https://www.youtube.com/watch?v=COjUB9Izk6E

## Approach

<img src="./Images/2.png" width=600 />

## Grad-CAM as a generalization of CAM

<img src="./Images/3.png" width=600 />

## Evaluating Localization

- Weakly supervised localization <br />
: Use off-the-shelf VGG-16 from Caffe Model Zoo <br />
: Binarize Grad-CAM with 15% of the max indensity <br />
: Draw bounding box around the single largest segment

<img src="./Images/4.png" width=600 />

- Weakly supervised segmentation <br />
: Replace CAM with Grad-CAM in Seed, Expand, Constrain(SEC) algorithm

- Pointing Game <br />
: Grad-CAM (70.58%) vs C-MWP(60.30%)

- Class Discrimination <br />
: 43 AMT workers, 4 visualizations, 90 image category pairs, 9 rating each <br />
: Deconv vs Guided backprop vs Guided Grad-CAM vs Deconv Grad-CAM <br />
: 53.33% vs 44.44% vs 61.23% vs 61.23%

- Trust worthiness <br />
: 54 AMT workers, 2 classifiers(AlexNet, VGG-16), 2 visualizations <br />
: Show same prediction with similar output score <br />
: Human can identify VGG-16 is better <br />
: Guided Grad-CAM show higher difference <br />
: 1.27(vs 1.0 with Guided Backprop)

- Faithfulness vs Interpretability <br />
: CNN scores after occlude image patches <br />
: Score correlates highly with Grad-CAM

- Class discrimination

<img src="./Images/6.png" width=200 />

- Trust worthiness

<img src="./Images/7.png" width=300 />

## Diagnosing image classification CNNs

- Analyzing failure modes for VGG-16

<img src="./Images/9.png" width=400 />

- Identifying bias in dataset

<img src="./Images/10.png" width=400 />

<img src="./Images/11.png" width=250 />

## Counterfactual explanations

- Use negative values to find regions that decrease output score

<img src="./Images/12.png" width=600 />

## Image captioning

- Use neuraltalk2: VGG-16 for image and LSTM language model (No explicit attention)

- Compare with DenseCap <br />
: Consist of Fully Convolitional Localization Network and LSTM

<img src="./Images/13.png" width=400 />

## Visual Q&A

- Compare with human attention map <br />
: Correlation 0.136 on 1374 val question-image pairs

<img src="./Images/14.png" width=600 />

## Conclusion

- The paper proposed Gradient-weighted Class Activation Mapping as a generalization of CAM
- Combined Grad-CAM with existing high-resolution visualzations
- Human studies reveal the trustworthiness of a classifier, and help identify biases in datasets
- AI system should not only be inteliigent, but also be able to reason about its beliefs and actions for human to trust it
- Future work includes other networks such as reinforcement learning, natural language processing, and video applications