#### 1st derivative

Assume `column vector convention`, which is the most commonly used convention in matrix calculus

For `vectors` $a$ and $x$, general matrix $A$ and symmetric matrix $S$

$$\begin{align*}
\frac{\partial a^Tx}{\partial x}&=a \\
& \\
\frac{\partial Ax}{\partial x}&=A \\
& \\
\frac{\partial x^TSx}{\partial x}&=2Sx
\end{align*}$$

#### Jacobian

Suppose $f: \mathbf{R}^n \rightarrow \mathbf{R}^m$ and is differentiable, then for a point $z$ close to $x$, we have the following 1st order approximation

$$f(z)=f(x)+Df(x)(z-x)$$

where $Df(x)$ is known as the `Jacobian` of $f$ at $x$

$$Df(x)_{ij}=\frac{\partial f_i(x)}{\partial x_j}, i=1, \cdots, m, \,\,j=1, \cdots, n$$

#### Gradient

When $f$ is `scalar-valued`, that is $f: \mathbf{R}^n \rightarrow \mathbf{R}$, then the Jacobian is a $1 \times n$ matrix (i.e., a `row` vector) and its `transpose` is called the gradient of $f$ at $x$

$$\nabla f(x) = Df(x)^T\in \mathbf{R}^n$$

This is a `column` vector

$$\left(\nabla f(x)\right)_i=\frac{\partial f(x)}{\partial x_i}, i=1, \cdots, n$$

and the 1st approximation becomes

$$f(z)=f(x)+\nabla f(x)^T (z-x)$$

#### Chain rule for 1st derivative

Suppose $f: \mathbf{R}^n \rightarrow \mathbf{R}^m$ and $g: \mathbf{R}^m \rightarrow \mathbf{R}^p$

Define $h: \mathbf{R}^n \rightarrow \mathbf{R}^p$ by $h(x)=g(f(x))$ and assume no problem in domain feasibility and differentiability, then

$$Dh(x)=Dg(f(x))Df(x)$$

and if $f: \mathbf{R}^n \rightarrow \mathbf{R}$ and $g: \mathbf{R} \rightarrow \mathbf{R}$, we have

$$\nabla h(x)=g'(f(x))\nabla f(x)$$

For composition with `affine` function, $f: \mathbf{R}^n \rightarrow \mathbf{R}^m$, $A\in \mathbf{R}^{n \times p}, b\in \mathbf{R}^n$

If $g(x)=f(Ax+b)$, then the Jacobian

$$Dg(x)=Df(Ax+b)D(Ax+b)=Df(Ax+b)A$$

When $f$ is scalar-valued, we have the gradient

$$\nabla g(x)=Dg(x)^T=A^T\nabla f(Ax+b)$$

##### Example

We want to compute the gradient of the following $$f(x)=\log\sum_{i=1}^m \exp(a_i^Tx+b_i)$$

First, for $g(y)=\log (\sum_{i=1}^m \exp y_i$), we have

$$\begin{align*}\nabla g(y)&=g'(\sum_{i=1}^m \exp y_i)\nabla (\sum_{i=1}^m \exp y_i)\\
&=\frac{1}{\sum_{i=1}^m \exp y_i}\begin{bmatrix}\frac{\partial (\sum_{i=1}^m \exp y_i) }{\partial y_1}\\ \vdots \\ \frac{\partial (\sum_{i=1}^m \exp y_i) }{\partial y_m}\end{bmatrix}\\
&=\frac{1}{\sum_{i=1}^m \exp y_i}\begin{bmatrix}\exp y_1 \\ \vdots \\ \exp y_m \end{bmatrix}
\end{align*}$$

Then, we use the rule for composition with affine function

$$\begin{align*}\nabla f(x) &= A^T\nabla g(Ax+b) \\
&=\frac{1}{\mathbf{1}^Tz}A^Tz
\end{align*}$$

where

$$z=\begin{bmatrix}\exp (a_1^Tx+b_1) \\ \vdots \\ \exp (a_m^Tx+b_m) \end{bmatrix}$$

#### 2nd derivative

For `scalar`-valued function $f: \mathbf{R}^n\rightarrow \mathbf{R}$, its second derivative or Hessian at $x$ is given by

$$\nabla^2 f(x)_{ij}=\frac{\partial^2 f(x)}{\partial x_i \partial x_j}, i,j=1, \cdots, n$$

The 2nd derivative can be interpreted as derivative of 1st derivative

For $\nabla f: \mathbf{R}^n\rightarrow \mathbf{R}^n$, we have

$$D\nabla f(x)=\nabla^2 f(x)$$

The 2nd order approximation of $f$ at $z$ close to $x$ is

$$f(z)=f(x)+\nabla f(x)^T(z-x)+\frac{1}{2}(z-x)^T\nabla^2 f(x) (z-x)$$

#### Chain rule for 2nd derivative

We will skip the general case $f: \mathbf{R}^n \rightarrow \mathbf{R}^m$ and $g: \mathbf{R}^m \rightarrow \mathbf{R}^p$ and focus only the following cases that are more relevant

Suppose $f: \mathbf{R}^n \rightarrow \mathbf{R}$ and $g: \mathbf{R} \rightarrow \mathbf{R}$, and $h(x)=g(f(x))$, then we have

$$\nabla^2 h(x)=g'(f(x))\nabla^2 f(x)+g''(f(x))\nabla f(x) \nabla f(x)^T$$

For composition with `affine` function, $f: \mathbf{R}^n \rightarrow \mathbf{R}$, $A\in \mathbf{R}^{n \times m}, b\in \mathbf{R}^n$ and $g(x)=f(Ax+b)$, then we have

$$\nabla^2 g(x)=A^T\nabla^2 f(Ax+b) A$$

##### Example

We still use the same example

$$f(x)=\log\sum_{i=1}^m \exp(a_i^Tx+b_i)$$

First, for $g(y)=\log (\sum_{i=1}^m \exp y_i$), we have

$$\begin{align*}
\nabla^2g(y)&=g'(\sum_{i=1}^m \exp y_i)\nabla^2 (\sum_{i=1}^m \exp y_i)+g''\nabla(\sum_{i=1}^m \exp y_i)\nabla (\sum_{i=1}^m \exp y_i)^T \\
&=\frac{1}{\sum_{i=1}^m \exp y_i}\begin{bmatrix}\exp y_1 & 0 &\cdots & 0 \\ 0 & \exp y_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots &  \exp y_m \end{bmatrix} - \frac{1}{\left(\sum_{i=1}^m \exp y_i\right)^2}\begin{bmatrix}\exp y_1 \\ \vdots \\ \exp y_m \end{bmatrix}\begin{bmatrix}\exp y_1 \\ \vdots \\ \exp y_m \end{bmatrix}^T\\
& \nabla g(y)=\frac{1}{\sum_{i=1}^m \exp y_i}\begin{bmatrix}\exp y_1 \\ \vdots \\ \exp y_m \end{bmatrix}\\
&=\text{diag}\left(\nabla g(y)\right)-\nabla g(y) \nabla g(y)^T
\end{align*}$$

Then, we use the rule for composition with affine function

$$\begin{align*}\nabla^2 f(x) &= A^T\nabla^2 g(Ax+b)A \\
&=A^T\left(\frac{1}{\mathbf{1}^Tz}\text{diag}(z)-\frac{1}{\left(\mathbf{1}^Tz\right)^2}zz^T\right)A
\end{align*}$$

where

$$z=\begin{bmatrix}\exp (a_1^Tx+b_1) \\ \vdots \\ \exp (a_m^Tx+b_m) \end{bmatrix}$$