## Introduction 

 Real valued functions with real inputs are defined as $f: \mathbb{R} \to \mathbb{R}$. So the mapping is from the real numbers to the real numbers. Most of the famous functions follow this rule. For instance $\sin, \cos, \log, \exp$ . The derivatives with respect to these functions is well-known in all calculus books. 

$$\begin{array}{|c|c|}  
\hline
 \textbf{$f(x)$} & \textbf{$f'(x)$} \\
 \hline
 \sin(x) & \cos(x)\\  
 \cos(x) & - \sin(x)\\
 \log(x) & \frac{1}{x}\\
  e^x & e^x\\
  \hline
\end{array}$$

In calculus the derivative with respect to $x$ is evaluated using this limit 

$$ f(x_0) = \lim_{x \to x_0} \frac{f(x) - f(x_0)}{x - x_0}$$

Geometrically the derivative at a point approximates the slope of the tangent line at that point

<center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Tangent-calculus.svg/250px-Tangent-calculus.svg.png"/> </center>

The slope of a function is a representation of how fast the function accelerates/decelerates at a certain point $x$. If the derivative/slope at a certain point is positive then the function is increasing, if it is negative then the function is decreasing and if it is zero then the function is constant. 

This concept is very important in function **optimization**.  *Usually*, we want to find the point where the function achieves its minimum. We can set $f'(x) = 0 $ and find the zeros of such function. We can check the neighborhood of each root to decide if the function is a local minimum or maximum. The minimum of all local minma  is called a global minimum. 

<center> <img src="https://media1.shmoop.com/images/calculus/calc_hghderiv_locglob_narr_graphik_1.png"/> </center>


**Problems** 


1.   The derivative can be undefined for instance $|x|$ has no derivative at $x = 0$ because the limit from the left and the right are not equal 
$$f'(0^+) = \lim_{x \to 0 ^+} \frac{- x - 0 }{x - 0} = -1 \neq f'(0^-) = \lim_{x \to 0^-} \frac{x - 0}{x - 0} = 1$$
2.   The point could be a suddle point where the neighborhood of the function could be increasing  from the left and decreasing from the right or vice virsa.  For instance, the function $x^3$ has a saddle point at $x = 0 $. 



## Derivative Evaluation Algorithms
 There are bascially three approaches to implement gradients in computers 

1.   **Numerical Differentiation**, which basically uses the finite difference rule for small $h$
$$f'(x) \approx \frac{f(x+h)-f(x)}{h}$$ This formula suffers for numerical instability for small values of $h$.
2.   **Symbolic Differentiation**, this calculates a symbolic expression for the derivative of the function. This approach is bascially used in matlab and mathematica.  This approach is quite slow and requires symbols parsing and manipulation. 

3. **Automatic Differentiation**, this approach is the base that is used in most deep learning libraries like TensorFlow and Pytorch. Basically, the mathematical expressions are divided into primitve blocks and the derivative is evaluated using the chain rule. 



## Automatic Differentiation

Mainly, there are two main parts for automatic differentiation  

*   **Forward-mode automatic differentiation**  Evaluates the gradient in a froward manner for all the input variables. 

*   **Reverse-mode automatic differentiation**  Evaluates the gradient with respect to the ouput first then back-probogate the gradient to the input. 

In most machine learning libraries the reverse-mode is mostly used because the forward mode is more expensive in terms of evaluating the gradient with respect to many inputs. For instance, Tensorflow uses reverse-mode differentiation as explained in this [post](https://www.tensorflow.org/tutorials/eager/automatic_differentiation). 

### Computation Graph

Machine learning libraries like Tensorflow build computational graphs where each node represents a simple computation function. This makes reverse-mode differentiation very easy to compute as we back-progbogate the gradieht from the outputs to the inputs. 

<center> <img src="https://www.easy-tensorflow.com/files/1_1.gif"/> </center>

Basically, TensorFlow defines the primitive functions which are the functions that cannot be reduced further. This includes $x^n, e^x, \sin(x), \log(x), \frac{1}{x}\cdots$. We assume that any other function is a composition of these functions. The composition function is defined as 

$$g(x) = f_1 \circ f_2 \circ \cdots f_n(x) = f_1(f_2(\cdots f_n(x)))$$

Note that the evaluation of the primitives start from the inner function to the outer function.  In addition, the derivaitve of each of these operations are already defined inside TensorFlow. After that TensorFlow uses the chain rule which states that the derivative of $g(x) = f_1 \circ f_2 \cdots \circ f_n(x)$ is 

$$g'(x_0) = \frac{df_n(x_0)}{dx} \times \frac{df_{n-1}(f_n(x_0))}{dx} \times \cdots \times \frac{df_1(f_2(\cdots(f_n(x_0)))}{dx}$$


First we import the necessary libraries 

In [0]:
import tensorflow as tf
import tensorflow.contrib.eager as tfe

We will use eager execution to evaluate the operations directly 

In [0]:
tf.enable_eager_execution()

### Real-Valued Functions

Here we look at the functions of the form $f: \mathbb{R} \to \mathbb{R}$. So the input is a real number and the output is a real number.  

**Example**  
Let us evaluate the gradient of a simple function $f(x) = x^2$. In mathematics we evaluate the gradient using the limit of the difference

$$ f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} = \lim_{h \to 0} \frac{x^2 + 2hx + h^ 2 - x^ 2}{h} = \lim_{h \to 0} \frac{h(2x + h)}{h} = 2x  $$

In [0]:
#define the function to differentiate
def f(x):
  return x**2

#evaluate the gradient as a yet-to-be evaluated function
g = tfe.gradients_function(f)

Evaluate the derivative of the function at $x = 5.0$

In [0]:
x = tf.Variable(5.)

g(x)[0].numpy()

10.0

**Example**

Suppose that we want to evaluate the gradient of the sigmoid function. 

$$ \sigma (x) = \frac{1}{1+e^{-x}} $$

We can easily proof using calculus that 

$$\sigma'(x) = \sigma(x) (1-\sigma(x))$$

But, we will go the long way! We first decompose it into the primitives

$$ \sigma (x) = \frac{1}{1+e^{-x}} = f_1 \circ f_2 \circ f_3\circ f_4(x)$$

where we have $f_1 = 1/ x, f_2 = 1+x , f_3 = e^{x}, f_4 = -1 \cdot x$. 

Then TensorFlow will construct the following graph of these primitives. 

![alt text](https://www.researchgate.net/profile/Igor_Macedo_Quintanilha/publication/325694563/figure/fig3/AS:636339552784386@1528726580172/Computational-graph-of-sigmoid-The-values-in-black-red-on-the-top-bottom-of-the-arrows.png)


Now we can evaluate the forward pass by composition and the backward pass by chain rule according to the following table where we evaluate $\sigma'(1)$
$$\begin{array}{|c|c|}  
\hline
 \textbf{Forward} & \textbf{Backward} \\
 \hline
 f_4(1) = -1 & f'_1(f_2(f_3(f_4(1)))) =-0.534 \\  
 f_3(f_4(1)) = 0.368 & f'_2(f_3(f_4(1))) =  1\\
 f_2(f_3(f_4(1))) = 1.368  & f'_3(f_4(1)) = 0.368\\
 f_1(f_2(f_3(f_4(1)))) = 0.731 & f'_4(1) = -1\\
  \hline
\end{array}$$

From the table we see that 

$$\sigma'(1) \approx -0.534 \times 1 \times 0.368 \times 1 = 0.19664$$

Now let us evaluate the derivative using TensorFlow

In [0]:
def sigmoid(x):
  return 1/(1+tf.exp(-x))

g = tfe.gradients_function(sigmoid)

In [0]:
g(1.)[0].numpy()

0.19661194

## Shape Input Convention 
Most machine learning libraries force the gradient to have the same dimension as the input to update the parameters. Conventionally, this might contradict with the basic rules of the matrix calculus. In a calculus form  given $x \in \mathbb{R}^{n_1 \cdots n_k}$ and suppose that $f(x) = y$ where $y \in \mathbb{R}$ then $f'(x) = w$ then we enforce that $w \in  \mathbb{R}^{n_1 \cdots n_k}$. 



## High Dimension Gradient

### Gradient of vectors 

In mathematics we usually use the gradient term to generalize the derivative to higher dimensions. Mainly we define a real valued function with vector inputs as 

$$f: \mathbb{R}^n \to \mathbb{R} $$

Hence we could say $y = f(x)$ where $x = (x_1, x_2, \cdots, x_n)$ and $y \in \mathbb{R}$. Then we can define the derivative as 

$$\nabla f(x) = \left( \frac{ \partial y}{\partial x_1}, \frac{\partial y}{\partial x_2}, \cdots, \frac{\partial y}{\partial x_n}\right)$$


**Example**

The norm of a function operates on vectors 

$$\Vert x \Vert = \sqrt{\sum_{i=1}^n x^2_i}$$

So this function sums the squares of the components and takes the root. What is the gradient of the squared norm $f(x) = \Vert x \Vert ^2$ ? 


$$\frac{\partial f}{\partial x_1} = 2x_1, \frac{\partial f}{\partial x_2} = 2x_2, \cdots , \frac{\partial f}{\partial x_n} = 2x_n$$

In a simpler format we have 

$$\nabla f = (2x_1, \cdots 2 x_n ) = 2 x $$

In [0]:
#create a variale with three components 
x = tf.Variable([1., 2. , 3.])

#define the norm 
def norm(x):
  return tf.reduce_sum(tf.square(x))

#evaluate the gradient
g = tfe.gradients_function(norm)

In [0]:
g(x)[0].numpy()

array([2., 4., 6.], dtype=float32)

We can compute the second derivative in a similar approach

In [0]:
#create a variale with three components 
x = tf.Variable([[1.], [2.], [3.]])

#define the operation 
def op(x):
  return tf.square(x)

dx = tfe.gradients_function

#compute the second order derivative
g = dx(dx(op))

In [0]:
g(x)[0].numpy()

array([[2.],
       [2.],
       [2.]], dtype=float32)

We can also compute the gradient of functions with two variables

In [0]:
#create a variale with three components 
x = tf.Variable([[1.], [2.], [3.]])
y = tf.Variable([[2.], [4.], [6.]])

#define the operation 
def op(x, y):
  return x + y 

g = tfe.gradients_function(op)

In [0]:
g(x, y)

[<tf.Tensor: id=190, shape=(3, 1), dtype=float32, numpy=
 array([[1.],
        [1.],
        [1.]], dtype=float32)>,
 <tf.Tensor: id=190, shape=(3, 1), dtype=float32, numpy=
 array([[1.],
        [1.],
        [1.]], dtype=float32)>]

## The gradient of a matrix

 Mainly we define a real valued function with matrix inputs as 

$$f: \mathbb{R}^{n \times m} \to \mathbb{R} $$

where we define $n$ as the number of rows and $m$ as the number of columns. We usually prefer to work with square matrices because of their nice properties. However, this approach should generalize easily to rectangular matrices 

**Example**

Let us define the [frobenious norm](http://mathworld.wolfram.com/FrobeniusNorm.html) for a given matix $A$

$$\Vert A \Vert_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m |a_{ij}|^2} $$

To make things simpler we could rewrite $\Vert A \Vert_F  = a_{11}^2 + \dots + a_{nm}^2$.  Using this format we can easily deduce that the gradient with respec to the matrix can be evaluated as 

$$\frac{\partial \Vert A \Vert_F}{\partial a_{ij}} = 2 a_{ij}$$

Or in simpler terms 

$$\frac{\partial \Vert A \Vert_F}{\partial A} = 2 A$$





In [0]:
#create a variale with three components 
A = tf.constant([[2., 3., 4.], [5., 6., 7.], [8., 9. , 10.]])

#define the norm 
def frobenious_norm(A):
  return tf.reduce_sum(tf.square(A))

#evaluate the gradient
g = tfe.gradients_function(frobenious_norm)

In [0]:
g(A)[0].numpy()

array([[ 4.,  6.,  8.],
       [10., 12., 14.],
       [16., 18., 20.]], dtype=float32)

## Jaccobian Matrix

The Jaccobian matrix is the first order derivatives of a vector valued function. Vector valued functions are defined as $f: \mathbb{R}^n \to \mathbb{R}^m$.


Given $x \in \mathbb{R}^n$ and $f_j : \mathbb{R}^n \to \mathbb{R}$ we have

$$f(x) = \begin{bmatrix}
    f_1(x) \\
    f_2(x) \\
    \vdots \\
    f_m(x)
\end{bmatrix}$$
 
 We could then define the jaccobian  as 
 
 $$J(x) = \begin{bmatrix}
    \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2}  & \dots  & \frac{\partial f_1}{\partial x_n} \\
    \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2}  & \dots  & \frac{\partial f_2}{\partial x_n}  \\
    \vdots & \vdots & \ddots & \vdots \\
    \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2}  & \dots  & \frac{\partial f_m}{\partial x_n} 
\end{bmatrix} $$
 
Hence each row represents the derivative of a real valued function with input vectors. Note that using shape convention we must reshape that to have the same output as the vector input. 

**Example**

define $x \in \mathbb{R}^n$ and $f: \mathbb{R}^n \to \mathbb{R}^2$ where $f(x) = ( x_1 , x_2)$. It is always customary to write it in an expanded form

$$f \left( \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{n}
\end{bmatrix} \right) = \begin{bmatrix}
     x_1 \\
     x_2 \\
\end{bmatrix}$$

The gradient can be evaluated as 

$$J \left(x \right) = \begin{bmatrix}
     \frac{\partial x_1}{\partial x} \\
     \frac{\partial x_2}{\partial x} \\
\end{bmatrix} = \begin{bmatrix}
     1 & 0 & 0 & \cdots & 0 \\
    0 & 1 & 0 & \cdots & 0  \\
\end{bmatrix}$$

Since we must have the same shape as the input we sum the rows and pad zeros for the other variables

$$J(x) = \begin{bmatrix}
   1 & 1 & 0 & \cdots & 0 
\end{bmatrix}^ T$$


In [0]:
#create a variale with three components 
x = tf.Variable([[1.], [2.], [3.]])

#define the operation 
def slice(x):
  return x[0:2]

g = tfe.gradients_function(slice)

In [0]:
g(x)[0].numpy()

array([[1.],
       [1.],
       [0.]], dtype=float32)

**Example**

Let $A$ be an $m \times n$ constant matrix and $x$ be $n\times 1$ vector 

$$f(x) = A x $$

In an expanded form this is like 

$$f(x) = \begin{bmatrix}
    a_{11} & a_{12} & a_{13} & \dots  & a_{1n} \\
    a_{21} & a_{22} & a_{23} & \dots  & a_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    a_{m1} & a_{m2} & a_{m3} & \dots & a_{mn}
\end{bmatrix} \times \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{n}
\end{bmatrix}  = \begin{bmatrix}
    A_1 \cdot x \\
    A_2 \cdot x \\
    \vdots \\
    A_n \cdot x
\end{bmatrix} $$

Where $A_i$ is the ith row and $\cdot$ is the dot product of vectors. If we fix the other variables and took the derivative with $x_1$ for instance then we see that the partial derivative is in terms of the column vector of the form $A^T_1$ which is the 1st column of the array. Since we have multiple values we just sum them togehter. Generally $$ \frac{\partial f}{\partial x_i} = \sum_{j=1}^m a_{ji}$$. 

We see that the derivative with respect to each variable is just the sum of the corrosponding column vector in the matrix

In [0]:
#create a variale with three components 
x = tf.Variable([[1.], [2.], [3.]])

#define the multiplication 
def op(y):
  A = tf.constant([[2., 3., 4.], [5., 6., 7.], [8., 9. , 10.]])
  return tf.matmul(A, y)

#evaluate the gradient
g = tfe.gradients_function(op)

In [0]:
g(x)[0].numpy()

array([[15.],
       [18.],
       [21.]], dtype=float32)

### References
http://www.columbia.edu/~ahd2125/post/2015/12/5/

https://www.easy-tensorflow.com/tf-tutorials/basics/graph-and-session

https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation