In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
from IPython.display import Image

import cnn_helper
%aimport cnn_helper
cnnh = cnn_helper.CNN_Helper()

# Overview

We have seen that Neural Networks are capable of performing many tasks very well.

One surprising aspect of this
- we have not *directed* the Neural Network on how to achieve the task
- the task is achieved by minimization of a Loss Function

We have seen that it the *potential* to be a Universal Function approximatator
- implementing the function defined implicitly
- by the empirical distribution of input/output pairs
- represented by the labeled training dataset

But is there any way to gain insight into what is happening within the layers of a Neural Network ?

That is
- given the many synthetic features created by the Neural Network
- can we  discover/interpret what is the meaning of a particular feature ?

That is the topic of this module.

# Interpretation: The first layer

It is relatively easy to understand the features created by the first layer
- they involve the dot product of an input and some weights
- matches inputs $\x$ against weights (pattern) $\w$

So we can understand the feature
- if we understand the pattern



## Inputs with only a feature dimension

For examples that have only feature dimensions
- the pattern is just a vector of feature values
- of length equal to the length of the input example

A Dense Layer has a pattern
- that exactly identifies the "ideal" input (highest dot product)

Recall the $10$ patterns from our simple Logistic Regression Classifier for the $10$ MNIST digits
- these are the "idealized" digits

<br>
<table>
    <th><center><strong>Patterns for each of the 10 MNIST digits</strong></center></th>
    <tr>
        <td><img src="images/mnist_patterns.png" width=80%></td>
    </tr>
</table>   

## Inputs with non-feature dimensions as well as a feature dimension

But we also allow examples to have a "shape"
- *non-feature* dimensions

For example
- an image has $2$ non-feature dimensions: row and column
- in addition to a feature dimensions: e.g., $3$ features: Red, Green, Blue

Recall our terminology when dealing with examples having $N \ge 1$ non-feature dimensions
- an element is a vector with only a feature dimension
- we can index an element by a vector of length $N$ in
$$
[1:\dim_{1}] \times[1:\dim_{2}] \times \ldots [1:\dim_{N} ]
$$
- an index identifies a specific *location* in the non-feature dimensions

The patterns are $(N +1)$ dimensional
- one feature dimensions of length $n$, which is also the number of input features
- $N$ feature dimensions
    - each of length $f$
    - which is smaller than the length of the corresponding non-feature dimension of the input example
    
The output of the  match with a single pattern is a *feature map*
- $N$-dimensional: matches the lengths of the input non-feature dimensions
- a measure of the strength of the pattern's match with the sub-region centered at each location    

So the pattern 
- identifies an "ideal" sub-region in the input example

To illustrate
- we show the patterns a CNN layer appearing in layer $1$ of a NN
- there are $n_{(1)} = 96$ patterns
- each pattern is $(7 \times 7 \times n_{(0)})$
    - $n_{(0)} = 3$ are the number of input channels

Each square is a kernel.

<table>
    <center><strong>Layer 1 kernels</strong></center>
    <tr>
        <td><img src="images/img_on_page_-004-112.jpg" width=600></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1311.2901.pdf

The "patterns" being recognized by these kernels seem to represent
- Lines, in various orientations
- Colors
- Shading

We interpret Layer $1$ as trying to construct synthetic features representing these simple concepts.



# Beyond the first layer

Examining weights beyond the first layer presents difficulties
- the patterns are matched against outputs of layer $\ll \gt 0$
- we only know what the features are for layer $0$
    - visually recognizable
    
So we can identify a pattern but can't assign a meaning to the inputs that are being matched.

We will have to come up with ways of interpreting synthetic features
- that do no involve interpreting the patterns

## Probing

One way to gain insight is by *probing*
- choose one feature somewhere in the Neural Network
- try to discover Layer $0$ inputs
- that causes this feature to assume high (positive or negative) values

We call the values produced at a feature in response to inputs the feature's *activations*.

To eliminate ambiguity, we will write
$$\y_{\llp,k} |_{  \y_{(0)} = \x^\ip  }$$

to denote the activation when the Layer $0$ input is $\x^\ip$

If  
- we identify a property $\mathcal{P}$ common to all the inputs resulting in High values
- we can interpret the feature as being a detector for $\mathcal{P}$

The common property may not be easy to discern
- semantics: meaning
- rather than surface: appearance

For example
- there may be a neuron in some layer
- that acts as a "smile detector"
    - triggering only on inputs containing humans that are smiling

To be more precise:

Given a multi-layer Sequential Neural Network
- choose one feature at some layer to probe: $\y_{\llp,k}$

We are interested in the output values (called *activations*) of this feature.

<table>
    <th><center><strong>Interpreting neuron for feature k in layer l</strong></center></th>
    <tr>
        <td><img src="images/interp.png" width=80%></td>
    </tr>
</table>   

When the layer $\ll$ output has only feature-dimensions
- the selected feature is a scalar

for instance, a `Dense` layer:

<br>
<table>
    <center><strong>Dense layer: $\y_\llp$: selecting a neuron to probe</strong></center>
    <tr>
        <td><center><strong>Dense layer: $\y_\llp$</strong></center></td>
        <td><center><strong>Dense layer, one neuron selected: $\y_{\llp,j}$</strong></center></td>
    </tr>
    <tr>
        <td><img src="images/layer.png"></td>
        <td><img src="images/layer_select.png"></td>
    </tr>
</table>

But when layer $\ll$ has $N \ge 1$ non-feature dimensions 
- the selected feature is really a *feature map*
- with dimensions matching the non-feature dimensions of the layer input
$$
(\dim_{1} \times \dim_{2} \times \ldots \dim_{N} )
$$

So 
there are $\prod_{i=1}^N { \dim_{i} }$ values (one per location) in the feature map
- rather than a single scalar value
- as in the case of layer outputs with only a feature dimension

<br>
<table>
    <center><strong>Convolutional layer: $\y_\llp$: selecting a feature map to probe</strong></center>
    <tr>
        <td><center><strong>Layer w/non-feature dimensions: $\y_\llp$</strong></center></td>
        <td><center><strong>Layer w/non-feature dimensions, one element selected: $\y_{\llp,j}$</strong></center></td>
    </tr>
    <tr>
        <td><img src="images/layer_w_2d_elements.png"></td>
        <td><img src="images/layer_w_2d_elements_select.png"></td>
    </tr>
</table>

In such a case
- we reduce each feature map (with non-feature dimensions)
- to a scalar
- using a Pooling operation to eliminate the non-feature dimensions
    - for example: Global Max Pooling
 

<br>
<table>
    <center><strong>Convolutional layer: $\y_\llp$: selecting a feature map to probe<br>Global Pooling</strong></center>
    <tr>
        <td><center><strong>Layer w/non-feature dimensions: $\y_\llp$</strong></center></td>
        <td><center><strong>Layer w/non-feature dimensions, <br>pooled, one element selected: $\y_{\llp,j}$</strong></center></td>
    </tr>
    <tr>
        <td><img src="images/layer_w_2d_elements.png"></td>
        <td><img src="images/layer_w_2d_elements_pool.png"></td>
    </tr>
</table>

Thus, Probing 
- examines the activation of a feature
- where the activation is represented
- by a single scalar value

## Maximally Activating Examples

This method identifies
- a subset $S$ of the training examples
$$
S \subset \X
$$
- that produces high activations for the selected feature $\y_{\llp,k}$

Hence, this method is called *Maximally Activating Examples*

The method is quite simple
- pass each input example $\x^\ip$ to the network
- measure the resulting activation of the selected feature
$$\y_{\llp,k} |_{  \y_{(0)} = \x^\ip  }$$
- rank the $m$ resulting activations
$$ \{ i_1, \ldots, i_m \}$$

- Classify 
    - the $K$ highest (positive) magnitude activations as High
    - the $K$ highest (negative) magnitude activations as Low


i | $\y_{\llp,k}$ | class
:--|:--|:--|
1 | 7.1 | 
2 | -100.2 | Low
3 | - 6.3 |
$\vdots$
234 | 1000.4 | High
$\vdots$
$m$ | 45.6 |

Then the $K$ Maximally Activating examples for $\y_{\llp,k}$ are defined as
- the $K$ examples with highest rank (classified as High)

$$
\text{MaxAct}_{\llp,k, K} = \{ \x^{(i_1)}, \ldots, \x^{(i_N)} \}
$$

We then try
- via Intuition, Experiment
- to identify the property $\mathcal{P}$
- that is unique among $\X$ to the examples in 
$
\text{MaxAct}_{\llp,k, K} = \{ \x^{(i_1)}, \ldots, \x^{(i_N)} \}
$


## Probing the  Classifier Head

Applying the Maximally Activating Examples technique to the head 
layer $L$  is particularly useful

For a Classifier Head:
$$\y_{(L),k} |_{  \y_{(0)} = \x^\ip  }$$

- is the probability (or pre-probability "logit")
- that example $\x^\ip$ is in Class $k$


We can use Maximally Activation examples on a Head feature
- to identify inputs
- that are most/least confidently classified as being in Class $k$

Here we apply the technique to a restricted subset $\X' \subset \X$ 
of input images of digits that have label "8"
$$
\X' = \{ \x^\ip \, | \, \y^\ip = 8 \,\, \text{where } 1 \le i \le m \}
$$

<table>
    <tr>
        <center><strong>MNIST CNN maximally activating 8's</strong></center>
    </tr>
    <tr>
        <td><img src="images/best_worst_8.png"></td> 
    </tr>
</table>

Interesting !  Do we have a problem with certain 8's ?

Much lower probability when
- 8 is thin versus thick
- tilted left versus right

So although our goal was interpretation, this technique may be useful for Error Analysis as well.

# Occlusion

Maximally activating inputs are very coarse: they identify concepts at the level of entire input.
- when the inputs have non-feature dimensions
- Global Pooling compresses *all* the locations to a single scalar
- losing information about the sub-region having the property



There is a simple technique called *Occlusion* 
- that enables us to find a sub-region of a *particular input* $\x^\ip$
- that is responsible for the activation 
$$\y_{(L),k} |_{  \y_{(0)} = \x^\ip  }$$

It is similar in concept to Convolution applied to Layer $0$ (the example)

In Convolution, we take a filter 
- with $N$ non-feature dimensions
- each of length $f$
- and $n_{(0)}$ features

and compute the dot product of the filter with the sub-region of $\x^\ip$ centered at each location.
- resulting in a feature map with identical non-feature dimensions as the input
$$
\dim_{1} \times \dim_{2} \times \ldots \dim_{N} 
$$
- measuring the strength of the match of the filter and sub-region at each location
    - for each filter/kernel in the Convolutional layer

In Occlusion
- the sub-region of $\x^\ip$ centered at each location
- has all its values changed to an extreme value
    - equivalent to "hiding" the sub-region
    

Rather than computing the dot product at each location, Occlusion produces
- a feature map (*Occlusion Sensitivity map*) with identical non-feature dimensions as the input
- measuring *the change in* the probability $\y_{(L),k}$
    - from un-occluded  $\y_{(L),k}|_{  \y_{(0)} = \x^\ip  }$
    - to the value of $\y_{(L),k}$ when the location is the center of the occluded region
    
It is the sensitivity of $\y_{(L),k}|_{  \y_{(0)} = \x^\ip  }$ to being occluded at each location.



For inputs with non-feature dimensions
$$
\dim_{1} \times \dim_{2} \times \ldots \dim_{N} 
$$
the Occlusion sensitivity Map has the same non-feature dimensions
- just like Convolution with a single kernel/filter

Thus the non-feature dimensions of the input and the sensitivity map are identical.



Below is an example for an image with label: Afghan Hound.

It would seem that this feature recognizes faces.
- activation  drops (blue = cold) when the faces are occluded


<table>
    <tr><td colspan="2"><center><strong>Occlusion Sensitivity Map</strong></center></td></tr>
        <th><center>Input image</center></th>
        <th><center>Activation of one filter at layer 5</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=300"></td>
        <td><img src="images/img_on_page_-007-148.png" width=300></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1311.2901.pdf#page=7

## Occlusion Experiment 1:  Head Layer  logit on MNIST digit Classification
The following figure shows
- some of the occluded locations in the feature map
- of a particular example $\x^\ip$ representing digit "8"
- with the  proportional change in $\y_{(L),8}$ indicated at the top of the occluded input
- for a NN performing MNIST digit classification

<br>
<table>
    <th><center>Occlusion: Relative decrease in probability of being "8"</center></th>
    <tr>
        <td>
            <img src="images/occlude_8_4_1.png" width=600>
        </td>
    </tr>
</table>

Not what we expected !  

The mere presence of the square changes the classification probability
greatly
- even when we are not occluding what we believed to be the most important sub-regions of $\x^\ip$
    - the "pinched waist" of the 8.
-

This suggest that the NN performs Classification 
- in a way different than what we might have directed to using a Procedural Program
- perhaps extreme locations
    - are used to recognize other digits
    - so the "bright" occlusion mask confuses the Classifier
    
We might want to use Data Augmentation to correct the Classifier
- adding noise to inputs, preserving the label
- to immunize the Classifier from bright spots at extreme locations

## Occlusion Experiment 2: How does an  ImageNet Classifier work

ImageNet was a competition (important historically in the evolution of Neural Networks)
- classification of images
- from among 1000 different classes
    - 200 different types of dogs and cats !

[Zieler and Fergus](https://arxiv.org/pdf/1311.2901.pdf) have some interesting Occlusion results.

The Occlusion Sensitivity map we used as illustration above comes from this paper
- Interpretation of Layer 5 feature: face detector

<table>
    <tr><td colspan="2"><center><strong>Occlusion Sensitivity Map</strong></center></td></tr>
        <th><center>Input image</center></th>
        <th><center>Activation of one filter at layer 5</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=300"></td>
        <td><img src="images/img_on_page_-007-148.png" width=300></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1311.2901.pdf#page=7

The fact that we have discovered a "face detector" is interesting.
- Faces *are not* one of the 1000 possible labels
- Perhaps this non-label feature is necessary
    - to assist in creating features that *do identify* labels

For example
- there is evidence that many Classifiers have features that recognize Letter Characters (e.g., A-Z)
    - not one of the 1000 classes
- which may, in turn
    - help to identify "Book", which is of the 1000 classes

The results of probing
- the logit of the class "Afghan Hound"
    - the correct label for the input image
- is very interesting


<table>
    <tr><td colspan="2"><center><strong>Occlusion Sensitivity Map</strong></center></td></tr>
        <th><center>Input image</center></th>
        <th><center>Change in logit for "Afghan hound"</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=300"></td>
        <td><img src="images/img_on_page_-007-145.png" width=300></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1311.2901.pdf#page=7

Occluding the dog causes a big drop (blue: cold) in probability of correct classification
- as expected

But occluding each face *increases the probability* (red: hot) of correct classification !
- Perhaps the presence of a face is suggestive of an alternative class
    - removing the input signal for the alternative class results in a more confident prediction for the correct clas
- Even though "face" is not itself a class


Occlusion has helped us learn something unexpected about the workings of the Neural Network.

# Saliency maps

Each location in the Occlusion Sensitivity map reflects
- a change in $\y_{\llp,k}$
- give a fairly big change
    - occlusion replaces pixels with an extreme value
- in a region of $\y_{(0)}$

We can compute a more traditional sensitivity via the derivative

$$
\frac{\partial \y_{\llp,k}}{\partial \y_{(0)}} \, {\Big |}_{\y_{(0)} = \x^\ip}
$$

Each location in this derivative (same non-feature dimensions as $\y_{(0)}$) reflects
- a change in $\y_{\llp,k}$
- for an infinitesimal change
- in a single location in $\y_{(0)}$

This is called a *Saliency Map*
- when input has non-feature dimensions
- the Saliency Map has the same non-feature dimensions
$$
\dim_{1} \times \dim_{2} \times \ldots \dim_{N} 
$$

Saliency Maps, when applied to a Head Layer logit $k$
- explains the influence of each location in the input $\y_{(0)}$
- on the classification of the input as being in class $k$

Hence, they are useful for explaining the output of a NN.

## Understanding a non-head layer via Saliency Maps

Saliency Maps are also useful for explaining features in non-head layers.

Recall that the Saliency Map and Input have the same non-feature dimensions.

### Saliency map for a shallow layer

Below are a collection of Saliency Maps for
some feature in Layer 2 of an ImageNet Classifier.
- maps for $9$ **different input examples**
    - the examples with largest activation in the feature map

All $9$ examples appear to be eye-balls.

It would seem this Layer $2$ feature is recognizing eye balls.

The diagram can be confusing
- they are for $9$ **different input examples**
- the non-feature dimensions seem to be for a sub-region (a *patch*) of the input, rather than the entire input
    - just the eye, not the rest of the image
    
We will explain after presenting the diagram.

As a first pass
- these are the $9$ examples that stimulated the feature most strongly
    - hence, may be useful for interpreting what the feature is
- on the left is a saliency map for a sub-region (*patch*) of the input
- on the right is the corresponding patch


<center><strong>Saliency Maps and Corresponding Patches<br>Single Layer 2 Feature Map<br>On multiple input images</strong></center>
<table>
    <tr>
        <td><img src="images/ZF_p4_115_row10_col3_mag.png"></td>
        <td><img src="images/ZF_p4_115_row10_col3_patch_mag.png"></td>
    </tr>
    <tr>
        <td colspan=2><center>Layer 2 Feature Map (Row 10, col 3).</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/abs/1311.2901#page=4

**Explaining why the diagram has "small" maps and patches**

Why are the Saliency Maps and corresponding patches restricted to sub-regions of the input ?
- i.e., smaller than
$$
\dim_{1} \times \dim_{2} \times \ldots \dim_{N} 
$$

Recall that the multiple locations in the layer are reduced to a single value
- the max, when using Max Pooling for the summarization
- so the Saliency map is the change of a *single location* $\y_{\llp,\text{idx}, k}$ in $\y_{\llp,k}$ 
$$
\frac{\partial \y_{\llp,\text{idx}, k}}{\partial \y_{(0)}} \, {\Big |}_{\y_{(0)} = \x^\ip}
$$
    - where $\text{idx}$ the location of the *max*


In a NN with multiple CNN layers, 
- the [*receptive field*](CNN_Receptive_Field.ipynb)
- is the input **sub-region** that affects a single location in a layer
-  the dimensions of the sub-region grows with the depth (i.e., layer number $\ll$) of the layer    

So, in a shallow layer (i.e., Layer $2$ in our diagram)
- the receptive field for *any* location
- is less than the full input
    - very small: only slightly larger than $f$, the size of a side of the filter/kernel

Thus, the non-feature dimensions of the Saliency Map for a shallow layer (e.g., layer 2 in the diagram)
- is much smaller than
$$
\dim_{1} \times \dim_{2} \times \ldots \dim_{N} 
$$
- because the receptive field for $\text{idx}$, the location of the max in layer $\ll$
- is small

### Saliency map for a deeper layer 

As we go deeper into the network
- the size of the receptive field grows in a NN with successive CNN layers
- the representations become more complex
    - perhaps because of the larger receptive field
    - perhaps just because they are combinations of more complex representations
        - their layer inputs



In Layer 5, the feature whose map we show
- may be recognizing "smiling faces"
    - note the high (red) sensitivity
    - to lips and cheeks

<br>
<center><strong>Saliency Maps and Corresponding Patches<br>Single Layer 5 Feature Map<br>On 9 Maximally Activating Input images</strong></center>

<table>
    <tr>
        <td><img src="images/ZF_p4_118_row11_col1_mag.png"></td>
        <td><img src="images/ZF_p4_118_row11_col1_patch_mag.png"></td>
    </tr>
    <tr>
        <td colspan=2><center>Layer 5 ? Feature Map (Row 11, col 1).</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/abs/1311.2901


## Computing the Saliency Map

Computing a Saliency Map is easy
- a simple variant of Back Propagation



Recall the definition of the Loss Gradient in Back Propagation
$$\loss'_\llp = \frac{\partial \loss}{\partial \y_\llp}$$ 

and it's recursive update
$$
\begin{array}[lll] \\
\loss'_{(\ll-1)} & = & \frac{\partial \loss}{\partial \y_{(\ll-1)}} \\
         & = & \frac{\partial \loss}{\partial \y_\llp} \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}} \\
         & = & \loss'_\llp \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}}
\end{array}
$$

To compute Saliency Maps
- replace $\loss$ with $\y_{\llp,k}$
- so the "loss gradient" is now the "saliency gradient"
$$
\loss'_{(\ll ')} = \frac{\partial \y_{\llp,k}}{\partial \y_{(\ll ')}}
$$
    - we use the index $\ll$ to denote the layer of the feature map
    - thus, we are forced to use $\ll '$ in the subscript of $\loss'$ to avoid conflict

Substituting $\ll' = 0$:
$$\loss'_{(0)} = \frac{\partial \y_{\llp,k}}{\partial \y_{(0)}}$$

we get the derivative defining the Saliency Map.

### Guided Back Propagation

Our ultimate purpose is to try to *interpret* the meaning of a synthetic feature.

The "true" mathematical derivative of the Saliency Map
- is sometimes sacrificed
- in order to enhance the interpretability

[Zeiler and Fergus]( https://arxiv.org/abs/1311.2901) (and similar related papers) modify Back propagation 
- In an attempt to get better intuition as to which input features most affect $\y_{\llp,k}$
- For example: ignore the *sign* of the derivatives as they flow backwards
    - Look for strong positive or negative influences, not caring which

This is called *Guided Back propagation*.

# Video: interactive interpretation of features

There is a nice video by [Yosinski](https://youtu.be/AgkfIQ4IGaM) which examines the behavior of
a Neural Network's layers on video images rather than stills.
- using several of the techniques we describe

In [4]:
print("Done")

Done
