In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

import os 

import seaborn as sns; sns.set()

import unsupervised_helper
%aimport unsupervised_helper


# Derivatives, Gradients, Jacobians

From basic calculus we are (hopefully) familiar with the *derivative*
$$
\frac{\partial y}{\partial x}
$$

where
$y = f(x)$ for some univariate functions $f$.

But what about
$$
\frac{\partial \y}{\partial \x}
$$
where $\y = f(\x)$ is a multivariate function (on vector $\x$) with range that is *also* a vector.

In general, $\y$ and $\x$ may be vectors and we need to define the *Jacobian* $
\frac{\partial \y}{\partial \x}
$


Before giving the general form for the Jacobian, we illustrate it in steps

## Scalar $y$, vector $\x$

$$
\frac{\partial y}{\partial \x}
$$
- is the vector of length $ |\x|$ of defined as
$$
\left( \frac{\partial y}{\partial \x} \right)_j = \frac{\partial y}{\partial \x_j}
$$


**Example**

$| \x | = 2$ and $y = \x_1 * \x_2$

$$
\begin{array}{lll}\\
\frac{\partial y}{\partial \x} &  = & 
\begin{pmatrix}
 \frac{\partial y}{\partial \x_1} & \frac{\partial y}{\partial \x_2}
\end{pmatrix}\\
& = & 
\begin{pmatrix}
 \x_2 &  \x_1
\end{pmatrix}\\
\end{array}
$$

To be even more concrete: consider a Regression Task using the Mean Squared Error (MSE) loss function.

$$
\loss_\Theta = \text{MSE}(\y, \hat{\y}, \Theta) = { 1\over{m} } \sum_{i=1}^m (  \y^\ip  - \hat{\y}^\ip )^2
$$

Using $\Theta$ to denote the vector of parameters
- $\Theta_0$ is the intercept
- $\Theta_j$ is the sensitivity of the loss to the independent variable (feature) $j$


The derivative (gradient) of the scalar $\loss_\Theta$ with respect to vector $\Theta$ is:
$$
\nabla_\Theta \loss_\Theta =
\begin{pmatrix}
 \frac{\partial}{\partial \Theta_0} \text{MSE}(\y, \hat{\y}, \Theta) \\
 \frac{\partial}{\partial \Theta_1} \text{MSE}(\y, \hat{\y}, \Theta) \\
 \vdots \\
 \frac{\partial}{\partial \Theta_n} \text{MSE}(\y, \hat{\y}, \Theta)
\end{pmatrix}
$$

Here are the details of the derivative of $\loss_\Theta$ with respect to independent variable $j$
$$
\begin{array}{lll}\\
 \frac{\partial}{\partial \Theta_j} \text{MSE}(\y, \hat{\y}, \Theta) & = &
{ 1\over{m} } \sum_{i=1}^m  \frac{\partial}{\partial \Theta_j} (  \y^\ip  - \hat{\y}^\ip )^2 & \text{definition}\\
& = & { 1\over{m} } \sum_{i=1}^m  {2 * (  \y^\ip  - \hat{\y}^\ip ) \frac{\partial}{\partial \Theta_j}} \hat{\y}^\ip & \text{chain rule}\\
& = & { 1\over{m} } \sum_{i=1}^m  {2 * (  \y^\ip  - \hat{\y}^\ip ) \frac{\partial}{\partial \Theta_j}} (\Theta * \x^\ip) & \hat{\y}^\ip = \Theta^T \cdot \x^\ip\\
& = & { 1\over{m} } \sum_{i=1}^m  {2 * (  \y^\ip  - \hat{\y}^\ip ) \x^\ip_j}   \\
& = & { 2 \over{m} } \sum_{i=1}^m  { (  \y^\ip  - \hat{\y}^\ip ) \x^\ip_j}   \\
\\
\end{array}
$$

## Vector $\y$, scalar $x$

$$
\frac{\partial \y}{\partial x}
$$
- is a column vector with $|\y|$ rows
- defined as

$$
\left( \frac{\partial \y}{\partial x} \right)^\ip = \frac{\partial \y^\ip}{\partial x}
$$

Technically (and this will be important when we define higher dimensional gradients recursively)
- is the vector of length $1$
- whose *element* is a vector of length $|\y|$

**Example**
$ \y = ( x^0, x^1, x^2 )$

$$
\begin{array}{lll}\\
\frac{\partial \y}{\partial x} &  = & 
\begin{pmatrix}
 \frac{\partial \y^{(1)}}{\partial x} \\
 \frac{\partial \y^{(2)}}{\partial x} \\
 \frac{\partial \y^{(3)}}{\partial x}
\end{pmatrix}\\
& = & 
\begin{pmatrix}
 0 \\
 1 \\
 2
\end{pmatrix}\\
\end{array}
$$

## Vector $\y$, vector $\x$

$$
\frac{\partial \y}{\partial \x}
$$
- is the vector of length $| \x |$
- whose *element* is a vector of length $|\y]$
- defined as

$$
\left( \frac{\partial \y}{\partial \x} \right)^\ip_j = \frac{\partial \y^\ip}{\partial \x_j}
$$

**Example**
$ | \x | = 2, y = ( \x_1 + \x_2, \x_1 * \x_2)$

$$
\begin{array}{lll}\\
\frac{\partial \y}{\partial \x} &  = & 
\begin{pmatrix}
 \frac{\partial \y^{(1)}}{\partial \x_1} & \frac{\partial \y^{(1)}}{\partial \x_2}\\
 \frac{\partial \y^{(2)}}{\partial \x_1} & \frac{\partial \y^{(2)}}{\partial \x_2}
\end{pmatrix}\\
& = & 
\begin{pmatrix}
 1 & 1 \\
 \x_2 & \x_1 
\end{pmatrix}\\
\end{array}
$$

# Tensors and Generalized Jacobians

A *tensor* is multi-dimensional collection of values.

We are familiar with special cases
- a vector is a tensor with $1$ dimension
- a matrix is a tensor with $2$ dimensions

A $D$-dimensional tensor is a collection of numbers with *shape*
$$
( n_1 \times n_2 \times \ldots \times n_D )
$$


We can define the *Generalized Jacobian* 
$$
\frac{\partial \y}{\partial \x}
$$

analogous to how we defined the Jacobian.

The main difference is that now  the indices $i$ and $j$ change from *scalars* to *tensors*

Let
- the shape of $\x$ be $(n_{x_1} \times n_{x_2} \times \ldots n_{x_{D_x}})$
- the shape of $\y$ be $(n_{y_1} \times n_{y_2} \times \ldots n_{y_{D_y}})$

$$
\left( \frac{\partial \y}{\partial \x} \right)^\ip_j
$$
- is 
the tensor with shape $\left( (n_{y_1} \times n_{y_2} \times \ldots n_{y_{D_y}}) \times (n_{x_1} \times n_{x_2} \times \ldots n_{x_{D_x}}) \right)$
- defined *recursively* as

$$
\left( \frac{\partial \y}{\partial \x} \right)^\ip_j = \frac{\partial \y^\ip}{\partial \x_j}
$$



Note that 
- the number of dimensions of $\y^\ip$ is $|\y| -1$
- the number of dimensions of $\x_j$ is $|\x| -1$

so the recursive call (RHS of equation) operates on an object of lesser dimension and hence will reduce to a base case (derivatives involving only vectors and scalars)


# Where do these higher dimensional tensors come from ?

They are omnipresent !
- The mini batch index
- multi-dimensional input data



## Mini batch index

When TensorFlow shows you the shape of an object, it typically has one more dimension
than "natural" and the leading dimension is `None`.

That is because TensorFlow computes *on every element of the mini batch* simultaneously.

So the leading index points to an input example.

Hence the extra dimension.

## Multidimensional data

Lots of data is multi-dimensional.

For examples images have a height, width and depth (number of color channels).

Before we introduced Tensors, we "flattened" higher dimensional images into vectors.

We then had to "unflatten" the scalar derivatives in order to rearrange them so as to correspond
to the same index in the input from which they originated.

For the most part, this flatten/unflatten paradigm is not necessary if we operate over Tensors.

# Conclusion

The derivatives that are needed for Gradient Descent often involve tensors.

This module formalized what it means to take derivatives of higher dimensional objects.

In [4]:
print("Done")

Done
