In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
from IPython.display import Image

import cnn_helper
%aimport cnn_helper
cnnh = cnn_helper.CNN_Helper()

# Basic methods for Interpretation

We begin our study of Interpretability by presenting simple techniques.

Our discussion will be specialized to Neural Networks
- Consisting of multiple Convolutional Layers

The reason for this specialization is two-fold
- They are extremely common for task involving images (something that humans can easily interpret)
- The ability of a Convolutional Layer to preserve spatial dimensions
- Across Layers
- Means its easy to relate features at layer $\ll$ back to the same spatial location in the input

Let's do a quick refresher on the important concepts and notation of Convolutional Layers.

# CNN refresher (notation)

(We review concepts from the lecture on Convolutional Neural Networks (CNN)

A *feature map* for layer $\ll$
- Is the value of a *single* feature at layer $\ll$
- At *each* spatial location

An element of a feature map is the value of the feature at a single spatial location.



Here are the feature maps for two layers
- Layer $(\ll-1)$ has three feature maps 
$$\y_{(\ll-1),\ldots, k} \text{ for features }1 \le k \le 3$$
- Layer $\llp$ has two feature maps
$$\y_{\llp, \ldots, k} \text{ for features }1 \le k \le 2$$


**Aside: Notation reminder**

The feature/channel dimension
- Appears *last* in the subscripted list of indices (Channel Last convention)
- The ellipses ($\ldots$) signify the variable number of *spatial* dimensions
- Thus feature $k$ of layer $\ll$ is denoted $\y_{\llp, \ldots, k}$


<div>
    <center><strong>Feature maps</strong></center>
    <br>
<img src=images/Conv3d_2_feature_maps.png>
    </div>

Each feature map $k$ of layer $\ll$
- Was created by applying a $(f_\llp \times f_\llp \times n_{(\ll-1)})$ convolutional kernel $\kernel_{\llp,k}$
- To layer $(\ll-1)$ output $\y_{(\ll-1)}$

We "slide the kernel" over all spatial locations  of $\y_{(\ll-1)}$ 
- The Convolutional Layer $\ll$
- Preserves the spatial dimension
- But changes the number of features from $n_{(\ll-1)}$ to $n_\llp$

<div>
    <center><strong>Two layer l feature maps, same spatial location but different output features</strong></center>
    <br>
<img src=images/Conv3d_2.png>
    </div>

Since a Convolutional layer $\ll$
- Preserves the spatial dimension of its input (layer $(\ll-1)$ output
- Assuming full padding
- We can directly relate the spatial location of each feature map
- To a spatial location of layer $0$, the input

The question we seek to answer:
- Can we describe (interpret) the feature being recognized in a single feature map of layer $\ll$ ?

Much of our presentation is based on a very influential paper
by [Zeiler and Fergus](https://arxiv.org/abs/1311.2901)
- NYU PhD candidate and advisor !


# Interpretation: The first layer

It is relatively easy to understand the features created by the first layer

Since feature map $k$ is the result of a dot-product (convolution)
- And the dot product is performing a pattern match
- Of the pattern given by kernel $\kernel_{(1),k}$
- Against a region of the input
- We can interpret layer $1$ as trying to create synthetic features identified by the pattern




So all we have to do is examine each kernel to see the pattern for feature $k$ !

Here is a visualization of the kernels from the Zeiler and Fergus paper
- For 96 individual features
- Being computed by the first layer ($1$)
- Using a $(7 \times 7 \times n_{(0)})$ kernel
    - $n_{(0)}$ are the number of input channels

Each square is a kernel, whose spatial dimensions are $(7 \times 7)$ and depth $n_{(0)} = 3$

<table>
    <center><strong>Layer 1 kernels</strong></center>
    <tr>
        <td><img src="images/img_on_page_-004-112.jpg", width=800"></td>
    </tr>
</table>

The "patterns" being recognized by these kernels seem to represent
- Lines, in various orientations
- Colors
- Shading

We interpret Layer $1$ as trying to construct synthetic features representing these simple concepts.



So feature map $k$ of layer $1$ can be interpreted as
- Identifying the presence/absence of pattern $\kernel_{(1),k}$ in input $\x$
- At each spatial location of the input

**Layer 1 Kernel example From  Figure 2**

There are kernels looking for "checkered" patterns
- At row 7, columns 1 and 5


Note that examining layer $1$ kernels
- Is *input independent*
- Does not depend on the value of any example $\x^\ip$

# Beyond the first layer: Clustering examples

We could try to interpret the kernels of layer $(\ll \gt 1)$ but this will be difficult
- Layer $\ll$'s inputs ($\y_{(\ll-1)}$) are *synthetic features*, rather than actual inputs
- Unless we understand the synthetic features of the earlier layers
- We won't be able to interpret the pattern that layer $\ll$ is matching

What we can hope to do
- Somehow map the representation created by layer $(\ll >1)$ back to the inputs (layer 0 output)

We will present several *input dependent* methods 
- Depend on a particular input example $\x^\ip$

So the interpretation is only as good as the set of input examples we use.

The methods will  find *clusters* of examples
- That produce a similar feature map
- For map $k$ at layer $\ll$

If we can identify a property that is common to all examples in the cluster
- We can interpret feature map $k$ of layer $\ll$ as implementing the feature
>"Is the property present in the input ?"

## Maximally Activating Examples

Recall that  the *feature map* $k$ of layer $\ll$
- Is matching a pattern (the kernel for $k$)
- At each  index $\idxspatial$ in the spatial dimension
- Thus, $\y^\ip_{(\ll), \idxspatial,k}$ is the intensity of the feature being present at spatial location $\idxspatial$ of input $\x^\ip$

The problem: there are lots of locations in the spatial dimensions.

Rather than examining all locations, we
 we can *summarize* whether the feature exists *anywhere* in example $i$.
- i.e, we attempt to interpret what feature map $k$ is recognizing in general, rather than at specific location $\idxspatial$

For example, using "max" for summarization
- We can identify the *value* $\summaxact^\ip_{\llp,k}$ of the strongest activation
- Without identifying its exact location

$$
\summaxact^\ip_{\llp,k} = \max{ \idxspatial } \y^\ip_{(\ll), \idxspatial,k}
$$


By sorting examples on $\summaxact^\ip_{\llp,k}$
- We identify a cluster of examples
- That are most identified with the feature

These examples with largest $\summaxact^\ip_{\llp,k}$ are the *Maximally Activating Examples* for feature $k$ of layer $l$.

If we can identify a common property among the examples with largest $\summaxact^\ip_{\llp,k}$
- We can interpret feature map $k$ of layer $\ll$ as implementing the feature
>"Is the property present in the input ?"

Formally
- Let $\text{MaxAct}_{\llp,k} = [ i_1, \ldots, i_m ]$ be the permutation of example indices, i.e., $[ i | 1 \le i \le m]$
- That sorts $\summaxact^\ip_{\llp,k}$ in ascending order
$$
\summaxact^{(i_1)}_{\llp,k} \ge \summaxact^{(i_2)}_{\llp,k} \ge \ldots \ge \summaxact^{(i_m)}_{\llp,k}$
$$



In this way we can try to interpret the feature map of each layer.

Applying this technique to 
layer $L$ (the "head", a Classifier in the case of MNIST) is particularly useful
- We can identify the examples most/least strongly identified with the concept of each digit
- Because $\y^\ip_{(L),k}$ is the *probability* that example $i$ is digit $k \in  \{ 0, \ldots, 9 \}$

Here are examples of the digit "8" that maximally/minimally activate the classifier's "8" output $\y_{(L),8}$

<table>
    <tr>
        <center><strong>MNIST CNN maximally activating 8's</strong></center>
    </tr>
    <tr>
        <td><img src="images/best_worst_8.png"></td> 
    </tr>
</table>

Interesting !  Do we have a problem with certain 8's ?

Much lower probability when
- 8 is thin versus thick
- tilted left versus right

So although our goal was interpretation, this technique may be useful for Error Analysis as well.

# Occlusion

Maximally activating inputs are very coarse: they identify concepts at the level of entire input.
    
But, it's reasonable to suspect that some elements of the input are more important to the concept than others.

In particular, a CNN has a "receptive field" which defines the input elements that contribute to the layer output.

Close to the input layer, the receptive field is narrow so its clear that the "features" being identified are small in span.

Occlusion is one way of identifying the elements of the input layer that most affect the latent
representation.  

We will describe this in terms of a 2D input, but we can generalize.

Let
- $\y_{\llp,j}^\ip$ denote the response of feature $\y_{\llp,j}$ to input $\x^\ip$.
- Place an occluding square over some portion of input $\x^\ip$ and measure the change in $\y_{\llp,j}$
- Do this for each location in input $\x^\ip$ and create a "heat map" of changes in response $\y_{\llp,j}$ 

Let's use occlusion to see how images of the digit "8" are recognized
- Perhaps: by the two "donut" holes and "pinched waist" ?

<table>
    <th><center>Occlusion: Relative decrease in probability of being "8"</center></th>
    <tr>
        <td>
            <img src="images/occlude_8_4_1.png" width=600>
        </td>
    </tr>
</table>

Not what we expected !  

The labels above each image is the reduction in probability of correct classification when the
image is occluded relative to the un-occluded imaged.

The mere presence of the square changes the classification probability
greatly, even when we are not blocking the "waist" of the 8.

**Curiousity**

Occlusion reduces the probability of correct classification by 81% **for all but one** occluded image
- Third row from bottom, third column
- Is it because
    - the occluding square is attached to the body ?
    - there are no images in the training set with bright pixels in the location of the occluding square ?

Here is the change in response of a single feature map in layer 5 of an image classifier (Zeiler and Fergus).

The chosen feature map is the one with the highest activation level in the layer.

You can see that it is responding to "faces"
- Occluding each of the two faces causes a "drop in temperature" (lower intensity)
- "Hot": red; "Cold": blue

<table>
    <tr>
        <th><center>Input image</center></th>
        <th><center>Activation of one filter at layer 5</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=400"></td>
        <td><img src="images/img_on_page_-007-148.png" width=400></td>
    </tr>
</table>

Zeiler and Fergus also measured the change in activation of $\y_{(L),j}^\ip$, the logit corresponding to the correct
class ("Afghan Hound").
- "Hot" colors: increase in intensity of the "Afghan Hound" logit

<table>
      <tr>
        <th><center>Input image</center></th>
        <th><center>Change in logit for "Afghan hound"</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=400"></td>
        <td><img src="images/img_on_page_-007-145.png" width=400></td>
    </tr>
</table>

Occluding the dog causes a big drop in probability of correct classification

But occluding each face increases the probability of correct classification !
- Perhaps the presence of a face is suggestive of an alternative class
- Even though "face" is not itself a class

# Conclusion

We began our quest for understanding how Neural Networks work with simple techniques.

The first technique
- Find clusters of example
- Created by a particular feature map
- Relate a human-observable common property of the cluster
- To the feature that the feature map is attempting to recognize

Whereas clustering identifies groups of examples, the second technique tries to find *sub-regions* of the examples

Occlusion measures the change in response of a feature map summary (or single neuron)
- When a sub-region of the input is visible
- Versus when it is not visible

The interpretation that arises is that the feature map is attempting to recognize a property in a narrow area.

So, beyond clustering, it is attempting to *localize* the spatial location of the feature.

In [4]:
print("Done")

Done
