<a href="https://colab.research.google.com/github/malcolmlett/ml-learning/blob/main/train_observability_toolkit_theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Observability Toolkit

This file explains the theory behind the operation of the training observability toolkit.

## Explaining Near-zero Gradients
The `explain_near_zero_gradients()` function has to reverse engineer some aspects of both the forward and backprop operation of the network. This is may more complex by the fact that technologies like `GradientTape` and `AutoDiff` eliminate the need for layers to expose details about their operation that can be used for mathematical analyses. This section of the docs explains the maths used to perform that reverse-engineering.

To explain the underlying cause of a zero or nero-zero gradient, we need to examine the maths behind the gradient computation. The gradient matrix of the weights for a given layer $l$ is defined by:

$$\frac{\partial{J}}{\partial{W_l}} ≈ A_{l-1} \cdot S_l \cdot \frac{\partial{J}}{\partial{Z_{l+1}}}$$


where:
* $J$ = loss
* $A_{l-1}$ = layer $l-1$ output after activation function applied, ie: output from previous layer as input to layer $l$
* $S_l$ = $A_l > 0$ = matrix of 0s and 1s that approximately represents effect of activation function at layer $l$, assuming ReLU activation
* $\frac{\partial{J}}{\partial{Z_{l+1}}}$ = backprop from next layer

What we really want to compute can be described via a bayesion formalism:

$$P(A_{l-1} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$
$$P(S_{l} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$
$$P(\frac{\partial{J}}{\partial{Z_{l+1}}} \text{causal} | \frac{\partial{J}}{\partial{W_l}} ≈ 0)$$

To do that fully and accurately gets quite involved. So, rather, I merely report on the reverse, and allow the reader to draw their own inferences:

$$
P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | A_{l-1}) \\
P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | S_l) \\
P(\frac{\partial{J}}{\partial{W_l}} ≈ 0 | \frac{\partial{J}}{\partial{Z_{l+1}}})$$