## Matrix Differentiation (Part 1)
Thanks to [长躯鬼侠](https://zhuanlan.zhihu.com/p/24709748) from zhihu

**Notation:**
- $\mathbf{x}$ vector, $X$ matrix, $f$ scaler function  

**Definition**
$$\frac{\partial{f}}{\partial X} = \left[\frac{\partial f}{\partial \mathbf{x}_{ij}}\right]$$  
The inner product of two matices $A$ and $B$ is:
$$tr(A^{T}B)$$

**Calculation**  
let $X$ and $Y$ be two matrices:  
- $d(X \pm Y) = dX \pm dY$
- $d(XY) = (dX)Y + X(dY)$
- $d(X^{T}) = (dX)^T$
- $d(tr(X)) = tr(dX)$
- $dX^{-1} = -X^{-1}(dX)X^{-1}$ (1)  

(1) comes from:
$$
\begin{aligned}
XX^{-1} &= I \\
d(XX^{-1}) &= dI \\
(dX)X^{-1} + XdX^{-1} &= 0 \\
dX^{-1} &= -X^{-1}(dX)X^{-1}
\end{aligned}
$$

- $d|X| = tr(X^*dX)$, when $X$ is invertible $d|X| = |X|tr(X^{-1}dX)$
- $d(X \odot Y) = dX \odot Y + X \odot dY$
- $d\sigma(X) = \sigma^{'}(x) \odot dX$ (2)  

An example of (2):
$$X = 
\left[
\begin{matrix}
x_{11} & x_{12}\\
x_{21} & x_{22}
\end{matrix}
\right]
$$  

$$dsin(X) = 
\left[
\begin{matrix}
cos(x_{11})dx_{11} & cos(x_{12})dx_{12}\\
cos(x_{21})dx_{21} & cos(x_{22})dx_{22}
\end{matrix}
\right] = cosX \odot dX
$$  

**Trace Trick**  
- For a scaler $a$, $tr(a) = a$  
- $tr(A^T) = tr(A)$  
- $tr(A \pm B) = tr(A) \pm tr(B)$
- $tr(AB) = tr(BA)$
- $tr(A^T(B \odot C)) = tr((A \odot B)^TC)$



### How to get the derivative?
$$df = tr\left(\frac{\partial{f}^T}{\partial{Y}}dY\right)$$   
In some notes, do not take the transopose of $\frac{\partial{f}}{\partial{Y}}$ 

### Chain Rule
It is not as simple as calculus, let's take an example:
$Y = AXB$, we want $\frac{\partial{f}}{\partial{X}}$, where $f$ is a scaler function of $Y$  
$$
\begin{aligned}
df &= tr(\frac{\partial{f}}{\partial{Y}}^TdY) \\
&= tr(\frac{\partial{f}}{\partial{Y}}^TA(dX)B)\\
&= tr(B\frac{\partial{f}}{\partial{Y}}^TA(dX))\\
&= tr((A^T\frac{\partial{f}}{\partial{Y}}B^T)^T(dX))
\end{aligned}
$$  
and we therefore have:
$$
\frac{\partial{f}}{\partial{X}} = A^T\frac{\partial{f}}{\partial{Y}}B^T
$$

## Examples  
1. $f = a^TXb$, calculate $\frac{\partial{f}}{\partial{X}}$, where the shapes are $a : (m , 1)$, $X : (m , 1)$, $b : (n , 1)$ respectively

$$
\begin{aligned}
df &= d(a^TXb)\\
&= a^T(dX)b\\
\end{aligned}
$$
then use **trace trick**:
$$
tr(df) = df = tr(a^T(dX)b) = tr((ab^T)^TdX)
$$  
Thus, $\frac{\partial{f}}{\partial{X}} = ab^T$  
<br/>

2. $f = a^Te^{Xb}$

$$
\begin{aligned}
df &= a^Td(e^{Xb})\\
&= a^T(e^{Xb} \odot d(Xb))
\end{aligned}
$$  
then use **trace trick**:
$$
\begin{aligned}
tr(df) &= tr(a^T(e^{Xb} \odot d(Xb)))\\
&= tr((a^T \odot e^{Xb})^T(dXb))\\
&= tr((a^T \odot e^{Xb})^T(dX)b)\\
&= tr(b(a^T \odot e^{Xb})^T(dX))\\
&= tr(((a^T \odot e^{Xb})b^T)^TdX)
\end{aligned}
$$  
therefore,
$$
\frac{\partial{f}}{\partial{x}} = (a^T \odot e^{Xb})b^T
$$  
<br/>

3. Linear Regression  
$l = \Arrowvert Xw - y \Arrowvert ^2$, calculate $\frac{\partial{l}}{\partial{w}}$  
$$
\begin{aligned}
l &= (Xw - y)^T(Xw - y)\\
dl &= d(Xw - y)^T(Xw - y) + (Xw - y)^Td(Xw - y)\\
&= d(Xw)^T(Xw - y) + (Xw - y)^Td(Xw)\\
\end{aligned}
$$  
using **trace trick**
$$
\begin{aligned}
tr(dl) &= tr(d(Xw)^T(Xw - y)) + tr((Xw - y)^Td(Xw)) \\
&= tr(((Xw - y)^Td(Xw))^T) + tr((Xw - y)^Td(Xw)) \\
&= 2tr((Xw - y)^Td(Xw))\\
&= 2tr((Xw - y)^TXdw)
\end{aligned}
$$  
hence $\frac{\partial{l}}{\partial{w}} = 2X^T(Xw - y)$, if we let $\frac{\partial{l}}{\partial{w}} = 0$, we have:
$$
\begin{aligned}
2X^T(Xw - y) &= 0\\
X^TXw - X^Ty &= 0\\
w &= (X^TX)^{-1}X^Ty 
\end{aligned}
$$

For more information, you can visit the homepage of [长躯鬼侠](https://zhuanlan.zhihu.com/p/24709748), and [matrix cookbook](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3274)