<a href="https://colab.research.google.com/github/malcolmlett/ml-learning/blob/main/train_observability_toolkit_theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Observability Toolkit

This file explains the theory behind the operation of the training observability toolkit.

## Explaining Near-zero Gradients
The `explain_near_zero_gradients()` function has to reverse engineer some aspects of both the forward and backprop operation of the network. This is may more complex by the fact that technologies like `GradientTape` and `AutoDiff` eliminate the need for layers to expose details about their operation that can be used for mathematical analyses. This section of the docs explains the maths used to perform that reverse-engineering.

To explain the underlying cause of a zero or nero-zero gradient, we need to examine the maths behind the gradient computation. The gradient matrix of the weights for a given layer $l$ is defined by:

$$\frac{\partial{J}}{\partial{W_l}} ≈ A_{l-1} \cdot S_l \cdot \frac{\partial{J}}{\partial{Z_{l+1}}}$$


where:
* $J$ = loss
* $A_{l-1}$ = layer $l-1$ output after activation function applied, ie: output from previous layer as input to layer $l$
* $S_l$ = $A_l > 0$ = matrix of 0s and 1s that approximately represents effect of activation function at layer $l$, assuming ReLU activation
* $\frac{\partial{J}}{\partial{Z_{l+1}}}$ = backprop from next layer

What we really want to compute can be described via a bayesion formalism:

$$P(A_{l-1} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$
$$P(S_{l} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$
$$P(\frac{\partial{J}}{\partial{Z_{l+1}}} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$

To do that fully and accurately gets quite involved. So, rather, I merely report on the reverse and allow the reader to draw their own inferences. This is made up of estimates of the likelihood of obtaining zero or near-zero gradients due to each of the three components alone:

$$P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | A_{l-1})$$

$$P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | S_l)$$

$$P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | \frac{\partial{J}}{\partial{Z_{l+1}}})$$

And it includes the two groupings of adjacent pairs of component:

$$P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | A_{l-1} \cdot S_l)$$

$$P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | S_l \cdot \frac{\partial{J}}{\partial{Z_{l+1}}})$$

The following sub-sections go into detail for each calculation.

### Differences in layers and notation
How the gradient equation is operationalised depends on the kind of layer. The gradient equation is complete for a simple dense layer, but becomes more complex when considering a convolutional layer.

Firstly, the computation of an output activation at a given spatial position depends only on a subset of input activations. To cope with that I use the following notation to indicate only the relevant subset of the matrix for the given gradient is to be considered:

$$\lfloor A_{l-1} \rfloor_{\text{subset}}$$

Secondly, any given weight contributes to the value of activation activations at multiple spatial positions. This means that I cannot simply compute a boolean yes/no that a given gradient will be zero because of a zero value in the input activations. Rather, typically I'll only know that a certain fraction of the applicable values are tending in that direction, but I won't have the information about subsequent processing that might happen to consider only the non-zero values. Thus I'll be computing indicative probabilities based on the fraction of problematic states versus the whole. This will be represented as follows:

$$\text{fraction}(A_{l-1} \approx 0) = \frac{|A_{l-1} \approx 0|_\text{cardinality}}{|A_{l-1}|_\text{cardinality}}$$

This fraction-based reporting may will be relevant for all layer types anyway due to the fact that multiple samples are grouped together into a batch, and a single gradient is computed across the batch.

### Input activations as causes
Any given hidden layer has one or more previous layers as input, with their output activations $A_{l-1}$ holding the values that become inputs.

There are multiple ways in which those inputs can contribute to near-zero gradients in the target layer, but thanks to the magic of the chain rule and the gradient equation, we only need to consider the raw values of $A_{l-1}$.

The following general guide applies for estimating the gradients in the target layer if all input layer activations fall into a particular range:

|$A_{l-1}$|Gradient|Description|
|---|---|---|
|$A_{l-1} \gg 0$|unknown|$A$ does not directly contribute to near-zero gradients.|
|$A_{l-1} \approx 0$|$\approx 0$|Gradient could be kept large by a large backprop gradient, but it probably won't.|
|$A_{l-1} = 0$|$= 0$|All gradients will definitely be zero, regardless of other components|
|$A_{l-1} < 0$|n/a|Negative inputs are not possible if the input layers use ReLU activation function. Otherwise the gradient depends on the combination of $A_{l-1}$ and the sign of $W_{l}$.

Thus we have the following equation for computing the independent influence of the input activations on near-zero gradients:

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|A_{l-1}) = \text{fraction}(\lfloor A_{l-1} \rfloor_{\text{subset,ij}} \approx 0)$$

This varies a little by layer type:
* Dense - all $A_{l-1}$ contribute, fraction must be computed over batch dimension only
* Conv - $A_{l-1,ij}$ values within the kernel size

### Layer activation as cause
The gradient equation does not incorporate the weights $W_l$ or the bias $b_l$ of the target layer, nor does it incorporate its pre-activation output $Z_l$ or its raw activation values $A_l$. Rather, it considers only

$$\frac{\partial{A_l}}{\partial{Z_l}} \approx S_l$$

where, for ReLU activation layers:
$$S_l = A_l > 0$$

For ReLU activation layers, $S_l$ is a matrix of 0s and 1s as follows. This can be seen in how $\frac{\partial{A_l}}{\partial{Z_l}}$ varies by value of $Z_l$:

|$Z_l$|$\frac{\partial{A_l}}{\partial{Z_l}}$|Description|
|---|---|---|
|$Z_{l,ij} > 0$|$1.0$ for all values of $Z_{l,ij}$|ReLU activation function simply passes value through unchanged, which is equivalent of $1 \cdot z$, and thus a constant $1.0$ gradient.|
|$Z_{l,ij} < 0$|$0.0$ for all values of $Z_{l,ij}$|ReLU activation function clips input to zero, leading to constant $0.0$ gradient.|
|$Z_{l,ij} = 0$|$0.0$|Edge case that depends on internal implementation, but we treat as for $Z_{l,ij} < 0$ because we cannot distinguish otherwise.|

Even for ReLu activation functions, the form of $S_l$ as used in the gradient equation is just an approximation. This is becaus the element-wise nature of the activation function doesn't translate well into a simple and easy to read string of matrix multiplications. We get to ignore that complexity at this step and to simply use $S_l$ in its element-wise form, having the same shape as $A_l$.

_TODO: confirm shapes of matrices. I think $S_l$ in the gradient equation is a weird square matrix. But here I'm using it the same shape as $A_l$._

For a more detailed background to the derivation of $S_l$ see:
* https://medium.com/ai-advances/grokking-gradients-in-deep-neural-networks-6849fa42f1fa

Thus we have the following equation for computing the independent influence of the target layer activation on near-zero gradients:

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|S_l) = \text{fraction}(\lfloor S_l \rfloor_{\text{subset,ij}} = 0)$$

### Layer activation components as cause
However that's not all for the effect of $S_l$. Its value is derived based on multiple sub-components, and we'll be interested to know which of those sub-components have contributed to the state of $S_l$:

$$S_l = A_l > 0 = A_{l-1} \cdot W_l + b_l > 0$$

In particular, for a given $S_{l,ij} = 0$, we have at least three possible and useful explanations:
* $\lfloor A_{l-1} \rfloor_\text{subset} = 0$, and $W_l$ had no effect
* $\lfloor W_l \rfloor_\text{subset} \le 0$, and $A_{l-1}$ had no effect
  * (remember that $A_{l-1}$ is always positive for ReLU activation, so a negative weight leads to a zero output)
* $\lfloor A_{l-1} \cdot W_l \rfloor_\text{subset} < b_l$

In otherwords, $A_{l-1}$ and $W_l$ have a chance of leading to a zero gradient on their own, but there is also a separate scenario where only their combination leads to a zero gradient. Knowing which scenario is applicable for a given problem is extremely useful.

Now, we have easy direct access to captured values of $A_{l-1}$ and $W_l$. But the exact form of the calculation of $A_{l-1} \cdot W_l$ varies by layer type,so we cannot easily compute that. Thankfully we don't have to, as $A_l \le 0$ is a good enough proxy.

Thus we have the following independent sub-component influences on near-zero gradients in the target layer:

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|A_{l-1}) = \text{fraction}(\lfloor A_{l-1} \rfloor_{\text{subset,ij}} \approx 0)$$

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|W_l) = \text{fraction}(\lfloor W_l \rfloor_{\text{subset,ij}} \approx 0 \text{ or } \lfloor W_l \rfloor_{\text{subset,ij}} < 0)$$

But notice that the first is a duplicate of one we've already identified.

### Backprop values as cause
The final component of the weight gradient equation is the backprop from the later layers. This leads to some interesting questions because we don't collect that information during training, but we do collect some other information that can be used to guess at the backprop values.

We want the following:

$$P(\frac{\partial{J}}{\partial{W_l}} \approx 0| \frac{\partial{J}}{\partial{Z_{l+1}}})$$

But instead what we have available for use is:

$$\frac{\partial{J}}{\partial{W_{l+1}}}$$

Thankfully, that is is only one step removed from what we need:

$$\frac{\partial{J}}{\partial{W_{l+1}}} = \frac{\partial{J}}{\partial{Z_{l+1}}} \cdot \frac{\partial{Z_{l+1}}}{\partial{W_{l+1}}} = \frac{\partial{J}}{\partial{Z_{l+1}}} \cdot A_l$$

$$\frac{\partial{J}}{\partial{Z_{l+1}}} = \frac{\partial{J}}{\partial{W_{l+1}}} \cdot A_l^{-1}$$

Now, we almost certainly cannot invert $A_l$ as we're explicitly using all this logic in the case where we're searching for zeros, which makes $A_l$ non-singular. As a possibly naive first approximation, we'll _hope_ that the following holds (where $A_l^+$ is the Moore Penrose pseudo-inverse):

$$\frac{\partial{J}}{\partial{Z_{l+1}}} \approx \frac{\partial{J}}{\partial{W_{l+1}}} \cdot A_l^+$$

$$P(\frac{\partial{J}}{\partial{W_l}} \approx 0|\frac{\partial{J}}{\partial{Z_{l+1}}}) \approx P(\frac{\partial{J}}{\partial{W_l}} \approx 0|\frac{\partial{J}}{\partial{W_{l+1}}} \cdot A_l^+)$$

Now, gradients may legitimately be positive or negative, so we are only interested in values near zero. Secondly, we now have another equation with sub-components that we may be interested in. However, we'll ignore $A_l^+$ as an independent value because its not directly captured value, and its complex derivation makes any interpretation from it difficult.

Thus we have the following final independent contributions towards near-zero gradients:

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|\frac{\partial{J}}{\partial{Z_{l+1}}}) = \text{fraction}( \lfloor \frac{\partial{J}}{\partial{W_{l+1}}} \cdot A_l^+ \rfloor_\text{subset,ij} \approx 0 )$$

$$P(\frac{\partial{J}}{\partial{W_{l,ij}}} \approx 0|\frac{\partial{J}}{\partial{Z_{l+1}}}) = \text{fraction}( \lfloor \frac{\partial{J}}{\partial{W_{l+1}}} \rfloor_\text{subset,ij} \approx 0)$$

### Combined input and target activation as cause
_todo_

### Combined target activation and backprop as cause
_todo_


### Summary
With all that in place, we have identified ways in which the following raw values can contribute to near-zero gradients:
* $A_{l-1}$
* $W_l$
* $A_l$
* $W_{l+1}$
