# The Chain Rule

The chain rule is used in calculus to compute the derivatives of composite functions.

This section is heavily derived from [AI explained](https://explained.ai/matrix-calculus/#sec:1.3)

## Scalar Calculus

### Function Composition

We start with two terms expressing two functions:

$$ z = g(x) $$

and:

$$ y = f(g(x)) $$

The second term represents a *composite* function. In other words, it's a combination of two functions.

A composite function is sometimes written as:

$$ f(g(x)) = (f \circ g)(x) $$

### Scalar Chain Rule

For scalar values and functions, the chain can be given as:

**Leibniz Notation**

$$ \frac{dy}{dx} = \frac{dy}{dz} \frac{dz}{dx} $$

**Lagrange Notation**

$$ (f \circ g)'(x) = f'(g(x))g'(x) $$

What these formulae state is that the derivative of the composite function as a whole is given by the product of the dirivative of the two functions of which it is comprised.

## Vector Calculus

In the worlds of statistics and machine learning, we are not however typically dealing with scalar values. We need functions that can pack more punch.

### Partial Derivatives

- dervatives of functions with multiple variables

consider the function

$$ f(x,z) = 3x^2z $$

we can take the dervative with respect to x:

$$ \frac{\delta f(x,z)}{\delta x} = 6xz $$

and with respect to z:

$$ \frac{\delta f(x,z)}{\delta z} = 3x^2 $$

### Motivating Vectors

We move to the use of vectors one we want to begin organising our derviatives. In the case above, let's say we want to organise the derivatives with respect to $x$ abd $z$ into a vector. We call this vector the *gradient* of $f(x,z)$ and denote it as:

$$
\Delta f(x,z) = \begin{bmatrix} \frac{\delta f(x,z)}{\delta x} & \frac{\delta f(x,z)}{\delta z} \end{bmatrix} = \begin{bmatrix} 6xy & 3x^2 \end{bmatrix}
$$

## Matrix Calculus

With vector calculus, we moved from one variable to many variables. With matrix calculus, we are now introducing the additional complication of more than function.

In addition to the $f(x,z)$ above, we'll now also introduce:

$$ g(x,z) = 2x + y^8 $$

### Jacobian Matrices

As before, we'll store the partial derivatives of each function in a vector. The difference this time is that these gradient vectors will form the rows of a matrix. This matrix is called a *Jacobian Matrix* (if you've heard of a *Jacobian* before, that's typically the determinant of this matrix). We can represent it as below:

$$
\boldsymbol{J} = \begin{bmatrix} \Delta f(x,z) \\ \Delta g(x,z) \end{bmatrix} = \begin{bmatrix} \frac{\delta f(x,z)}{\delta x} & \frac{\delta f(x,z)}{\delta z} \\ \frac{\delta g(x,z)}{\delta x} & \frac{\delta g(x,z)}{\delta z} \end{bmatrix} = \begin{bmatrix} 6xz & 3x^2 \\ 2 & 8z^7 \end{bmatrix}
$$

Note that this matrix is represented in the *numerator* layout (i.e. the numerator of the derivative is on the rows). However, Jacobian matrices are also often organised in the *denominator* layout, which is the transpose of the numerator layout.

### Generalised Jacobian Matrices

Let's say we have the following $n$ variables:

$$ \boldsymbol{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$

And let's also say we have $m$ functions through which each feature is passed:

$$ \begin{matrix} y_1 = f_1(\boldsymbol{x}) \\ y_2 = f_2(\boldsymbol{x}) \\ \vdots \\ y_m = f_m(\boldsymbol{x}) \end{matrix} $$

The Jacboian matrix is then the collection of all $m \times n$ partial derviatives, or the stack of all $m$ gradient vectors with respect to $\boldsymbol{x}$:

$$
\boldsymbol{J} = \frac{\delta \boldsymbol{y}}{\delta \boldsymbol{x}} = \begin{bmatrix} \Delta f_1(\boldsymbol{x}) \\ \Delta f_2(\boldsymbol{x}) \\ \vdots \\ \Delta f_m(\boldsymbol{x}) \end{bmatrix} = \begin{bmatrix} \frac{\delta}{\delta x_1} f_1(\boldsymbol{x}) & \frac{\delta}{\delta x_2} f_1(\boldsymbol{x}) & \ldots & \frac{\delta}{\delta x_n} f_1(\boldsymbol{x}) \\ \frac{\delta}{\delta x_1} f_2(\boldsymbol{x}) & \frac{\delta}{\delta x_2} f_2(\boldsymbol{x}) & \ldots & \frac{\delta}{\delta x_n} f_2(\boldsymbol{x}) \\ & \vdots & & \\ \frac{\delta}{\delta x_1} f_m(\boldsymbol{x}) & \frac{\delta}{\delta x_2} f_m(\boldsymbol{x}) & \ldots & \frac{\delta}{\delta x_n} f_m(\boldsymbol{x}) \end{bmatrix}
$$