In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
from IPython.display import Image

import cnn_helper
%aimport cnn_helper
cnnh = cnn_helper.CNN_Helper()

# What excites a neuron ?

[Feature Visualization: How neural networks build up their understanding of images](https://distill.pub/2017/feature-visualization/)

The inversion process that we described by Deconvolution and Saliency Maps
- Is **input dependent** (depends on an example in a dataset)
- The Saliency Map for a single (or summary) location at feature map $k$ of layer $\ll$
- Depends on a particular input $\x^\ip$ being feed to layer $0$

By finding the input examples that "most excite" the feature map, we were indirectly able to guess
at the feature being recognized by the feature map.

We now demonstrate a more direct **input independent** approach
- Determine the input value (not necessarily an example in a dataset)
- That excites (causes large values)
- A single location/neuron (or summary) of feature map $k$ of layer $\ll$

By finding the *single input* that most excites a feature map, we may interpret the feature map
as attempting to recognize similar inputs.

# Gradient Ascent: Inverting a Neural Network

We have already introduced the notion of computing the *sensitivity* of a feature
- At spatial location $\idxspatial$ of feature map $k$ of layer $\ll$
- To a change in the feature at spatial location $\idxspatial'$ feature map $k'$ of layer $0$

$$
\mathcal{s}_{\llp, \idxspatial, k, (0), \idxspatial', k'} =  \frac{\partial \y_{\llp, \idxspatial, k}}{\partial  \y_{(0), \idxspatial', k'}}
$$

We used this to define Saliency Maps
- Which indicate how much more "excited" $\y_{\llp, \idxspatial, k}$ becomes
- When we increase the stimulus at layer $0: \y_{(0), \idxspatial', k'}$
- For a particular input $\y_{(0)} = \x^\ip$

We also know that Gradient Descent is used 
- To find the optimal value  $\W^*$ for the weights $\W$ that parameterize the layers of a Neural Network
- By optimizing (find the minimum) a Loss Function
- Using derivatives of the Loss with respect to the weights

$$
\W^* = \argmin{\W} L(\hat{\y},\y; \W)
$$

What happens if we *combine* these two ideas:
- Find the optimal value for  input $\x^*$  
- By optimizing (maximizing) the value $\y_{\llp, \idxspatial, k}$
- Using derivatives of $\y_{\llp, \idxspatial, k}$ with respect to $\x$ ?

$$
\x^* = \argmax{ \y_{(0)} = \x } \y_{\llp, \idxspatial, k}
$$

(Remember that the value of $\y_{\llp, \idxspatial, k}$ is a function of input $\x$)

That is:
- We can use Gradient Ascent (rather than Descent, as we are maximizing rather than minimizing the objective)
- To find the value $\x^* = \y_{(0)}$
- That, when used as input to the Neural Network
- Maximizes the value of a particular neuron $\y_{\llp, \idxspatial, k}$
- Using derivatives
$$
 \frac{\partial \y_{\llp, \idxspatial, k}}{\partial  \y_{(0), \idxspatial', k'}}
$$

We start off by initializing $\y_{(0)}$ to random noise.
- Compute $\y_{\llp, \idxspatial, k}$ on the Forward Pass
- Compute $
 \frac{\partial \y_{\llp, \idxspatial, k}}{\partial  \y_{(0), \idxspatial', k'}}
$ given the current $\y_{(0)}$, on the Backward Pass
- Move $\y_{(0)}$ in the direction of the derivative

After some number of epochs, we obtain an $\x^* = \y_{(0)}$ that maximizes $\y_{\llp, \idxspatial, k}$.

That is: we find the input $\x^*$ that maximally excites $\y_{\llp, \idxspatial, k}$.

We can then interpret $\y_{\llp, \idxspatial, k}$ as looking for the feature
>"Is like $\x^*$"


Since we are maximizing a value ($\y_{\llp, \idxspatial, k}$) rather than minimizing one (the Loss)
- This method is called *Gradient Ascent*
- My multiplying the objective $\y_{\llp, \idxspatial, k}$ by $-1$ we can trivially turns this into a minimization problem

Let's use Gradient Ascent to visualize the inputs that a particular
layer in an image classifier (implemented as layers of CNN's) is stimulated by

[Visualizing what Convnets learn](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/visualizing_what_convnets_learn.ipynb)


# Conclusion

Gradient Ascent is a technique for find the input $\x^*$
that is the "paradigmatic" value for a feature at layer $\ll$

It is a simple combination of techniques that we have already learned.

You can do many more interesting things with Gradient Ascent
- What if your initial guess is not random noise ?
- What if you add a constraint on $\x^*$ ?

We will explore these ideas in another lecture.

In [4]:
print("Done")

Done
