## Notes on computations in ForwardHG class

Let $s=(s^i)_{i=1}^C$ be the state vector divided in its components (e.g. vaious weights of a neural network, accumulated gradients, ...). Assume that each component $s^i\in\mathbb{R}^{d_i}$ is a (column) vector as well as $s\in\mathbb{R}^d$, with $d=\sum_i d_i$. We should compute the update of the total derivative of $s_t$ w.r.t. a single scalar hyperparameter $\lambda$. $s_t$ is the $t-$th iterate of the mapping $\Phi$:
$$
  s_0 = \Phi_0(\lambda) \qquad
    s_t =  \Phi_t(s_{t-1},\lambda),\quad t \in \{1,\dots,T\}
$$.
Let $\Phi^i$ denote the components of the iterative mapping relative to $i$-th state vector. 

Calling
$$
A_t = \partial_s \Phi_t(s_{t-1},\lambda) \in \mathbb{R}^{d\times d}  \qquad B_t = \partial_{\lambda} \Phi_t(s_{t-1},\lambda)\in \mathbb{R}^d, 
$$
the update on the variable $Z=\frac{\mathrm{d} s}{\mathrm{d} \lambda}$ will be 
\begin{equation}
Z_t = A_t Z_{t-1} + B_t.  \end{equation}
Let $Z=(Z^i)_{i=1}^C$ be also divided in its component, where each $Z^i$ has the same dimensionality of $s^i$ (sice the hyperparameter is a scalar). Componentwise, the above update becomes
$$
Z^i_t = \partial_s \Phi^i_{t}(s_{t-1}, \lambda)^T Z_{t-1} + \partial_{\lambda} \Phi_t^i(s_{t-1}, \lambda) = 
\sum_{j=1}^C \partial_{s^j} \Phi^i_t(s_{t-1}, \lambda)^T Z^j_{t-1}
$$

### How to compute the $Z$-update using only scalar gradients

Introduce the auxiliary variable $v=(v_i)_{i=1}^C$ divided in the same way as the state $s$. 
Then, dropping the iteration index for simplicity, $B$ can be computed as
$$
B = \partial_v \left[\sum_{i=1}^C \partial_{\lambda} ( \Phi^i(s, \lambda)^T v_i)  \right]
$$
by noting that $\Phi^i(s, \lambda)^T v_i$ is a scalar quantity as well as $\partial_{\lambda} ( \Phi^i(s, \lambda)^T v_i)$. 

For the computation of $A_t Z_{t-1}$, instead, (dropping the iteration index) define
$$
\Psi_i = \partial_{s^i} \left[\sum_{j=1}^C  \Phi^j(s, \lambda)^T v_j  \right].
$$
Then
$$
\partial_s \Phi^i(s, \lambda)^T Z = \partial_{v_i} \left[\sum_{j=1}^C \Psi_j^TZ^j \right],
$$
which proves to be the right quantity by excanging the order of summations and derivatives, and makes use only of gradients of scalar functions. 

On a practical note... the variables $v_i$ may take any value since they do not appear in the final computation