# Fr&#233;chet Derivative and Matrix Calculus
This article introduces Fr&#233;chet derivative, a new perspective towards derivatives and gradients that also enables efficient computation of gradients w.r.t. matrices which is especially useful for deriving back propagation rules for neural networks in matrix form. Later parts of this article takes batch normalization as an example and computes the gradients in matrix form, followed by a corresponding code implementation.

## From Limits to Fr&#233;chet Derivative
For real-valued function $f: \mathbb{R} \to \mathbb{R}$, if the derivative at point $x$ exists, it is defined as

$$
\begin{align*}
    f'(x) = \lim_{\epsilon \to 0}\frac{f(x+\epsilon) - f(x)}{\epsilon}.
\end{align*}
$$

While this definition generalizes to vector- and matrix-valued functions, and is often related with the slope of $f$ at some point of interest, the concept of derivative or gradient could be viewed from another perspective, which is to approximate the difference in function value with a term that is linear w.r.t. the difference in the variable:

$$
\begin{align*}
    f(x+\epsilon) = f(x) + f'(x)\epsilon + o(\epsilon)
\end{align*}
$$

where $o(\epsilon)$ is a term that shrinks much faster than $\epsilon$. Technically, it refers to some term $g(\epsilon)$ that satisfies

$$
\begin{align*}
    \lim_{\epsilon \to 0}\frac{g(\epsilon)}{\epsilon} = 0.
\end{align*}
$$

This view of derivative is called the Fr&#233;chet derivative. The implication of which on finding the derivative at a point is that if we could manage to separate a linear term from the difference in the function value while the residual is of $o(\epsilon)$, then the separated weights is exactly the derivative at this point. For neural networks, the function that we would like to take gradient of is often linear w.r.t. the variable (since non-linearity is always applied element-wise, we could easily take care of them separately using the chain rule). In this case, the $o(\epsilon)$ term would become 0, which further eases the derivation. Now that we've grasped the idea of the Fr&#233;chet derivative by looking at real-valued functions, we'll move on to see the generalized form to vector- and matrix-valued functions, which is achieved via the inner product.

For vector-valued function $f: \mathbb{R^n} \to \mathbb{R}$, the linear term is computed as the vector inner-product between the gradient and the difference in the variables:

$$
\begin{align*}
    f(\mathbf{x}+\boldsymbol\epsilon)
    &= f(\mathbf{x}) + \langle\nabla_\mathbf{x}f, \boldsymbol\epsilon\rangle + o(\lVert\boldsymbol\epsilon\rVert) \\
    &= f(\mathbf{x}) + (\nabla_\mathbf{x}f)^\top\boldsymbol\epsilon + o(\lVert\boldsymbol\epsilon\rVert).
\end{align*}
$$

Each dimension here is assigned with a partial derivative, and the effects in the dimensions are summed to get the approximation. 