# Gradients, a work in progress...
---
This notebook will in detail go through the most essential operators and their forward/backward pass respectively. The operators are partitioned based on some categorical criteria which can be seen as sub-titles below. For the sake of simplicity, lets assume that the hypothetical tensors $x$ and $y$ are $n$ dimensional, i.e. $x = \{x_1, ..., x_n\}$, for all defined operations.


---
## Unary ops
The unary operations only operate on one tensor $x$, as such, only one gradient has to be produced. A majority of these operators are either activation functions directly, or enable the calculation of the activation functions. Furthermore, majority of these simply rely on exponentiation and logarithmic operations and are rather straight forward to differentiate.

---
### Log
Apply the natural logarithm on $x$,<br><br>
$\large f(x) = log(x)$<br><br>
and the gradient is achieved by applying the derivative rule of the logarithm<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}log(x) = \frac{1}{x}$

---
### Exp
Raise the natural number to $x$, i.e. apply the exponent on the tensor, which is a special case of the pow operation,<br><br>
$\large f(x) = e^x$<br><br>
and the gradient is achieved by applying the derivative rule of the exponentation operator<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}e^x = e^x$

---
### ReLU (Rectified Linear Unit)
Apply the maximum operator on $x$, i.e. all values $x_i < 0$ is set to $0$,<br><br>
$\large f(x) = max(0, x)$<br><br>
and this gradient is a bit more non-trivial, since this is not a normal derivative rule. The ReLU operator does not modify values that are $x_i>0$, as such, the gradient will be a tensor filled with ones but $\forall i \in \{x_i < 0\}$ indices are set to $0$ as to only propagate the gradient to scalars $x_i > 0$<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}max(0, x) = \frac{1}{\partial x}(max(0, x_1), ..., max(0, x_n)) = 1_{[x_i > 0]}$

---
### Sigmoid
Apply the standard logistic function to $x$,<br><br>
$\large \sigma(x) = \frac{1}{1 + e^{-x}}$<br><br>
to find the gradient of the sigmoid function we need to understand the *reciprocal rule* which can be defined as<br><br>
$\large g'(x) = \frac{\partial}{\partial x} (\frac{1}{f(x)}) = -\frac{f'(x)}{f(x)^2}$<br><br>
and simply put gives the derivative of the reciprocal of a function $f$ in terms of the derivative of $f$. Furthermore, it is required that $f$ is differentiable at a point $x$ and that $f(x) \neq 0$ then $g(x) = \frac{1}{f(x)}$ is subsequentially also differentiable at $x$. Then the gradient can be derived accordingly,<br><br>
$\large \frac{\partial \sigma}{\partial x} = \frac{\partial}{\partial x}\frac{1}{1 + e^{-x}} = \frac{\partial}{\partial x}(1 + e ^{-x}) = \{\mathrm{reciprocal\;rule}\} = -(1+e^{-x})^{-2}\frac{\partial}{\partial x}(1+e^{-x}) =$<br><br>$=\large-(1+e^{-x})^{-2}\frac{\partial}{\partial x}e^{-x} = (1+e^{-x})^{-2}\cdot e^{-x} = \frac{e^{-x}}{(1 + e^{-x})^2} = \frac{1}{(1 + e^{-x})}\cdot \frac{e^{-x} + 1 - 1}{(1 + e^{-x})} = \sigma(x)(1 - \sigma(x))$<br><br>
As we can see the gradient of the sigmoid function can be expressed by its initial definition, great! So when performing the forward pass $\sigma(x)$ we just store the results for the backwards pass and use it to quickly acquire the gradient.

---
## Binary ops
The binary operations apply the specified operator between two tensors $x$ and $y$. The dimensionalities of $x$ and $y$ do not have to be the same, and in practice these operators usually support what is called *broadcasting*. It is a term that basically means that the smaller tensor if $x$ and $y$ is applied to the larger, so that the resulting tensor has compatible shapes with both. It is in practice the effect of vectorizing and operation, which leads to simplee, robust, and fast code. Becuase these operations have two tensors part of producing the result, there are two gradients which can be derived.

---
### Add
Perform broadcastable additive operation on $x$ with $y$,<br><br>
$\large f(x, y) = x + y$<br><br>
and the gradients can easily be derived as<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}(x + y) = 1$<br><br>
$\large \frac{\partial f}{\partial y} = \frac{1}{\partial y}(x + y) = 1$

---
### Sub
Perform broadcastable subtraction on $x$ with $y$,<br><br>
$\large f(x, y) = x - y$<br><br>
and the gradients are respectively<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}(x - y) = 1$<br><br>
$\large \frac{\partial f}{\partial y} = \frac{1}{\partial y}(x - y) = -1$

---
### Mul
Perform broadcastable multiplication on $x$ with $y$,<br><br>
$\large f(x, y) = x \cdot y$<br><br>
and the gradients are easily derived from the multiplication rule as,<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}(x \cdot y) = y$<br><br>
$\large \frac{\partial f}{\partial y} = \frac{1}{\partial y}(x \cdot y) = x$

---
### Pow
Perform broadcastable exponentiation on $x$ with $y$,<br><br>
$\large f(x, y) = x ^ y$<br><br>
and the gradients are now a bit more tricky, but by using the normal derivative rules we can derive them to be,<br><br>
$\large \frac{\partial f}{\partial x} = \frac{1}{\partial x}(x ^ y) = y \cdot x ^{(y - 1)}$<br><br>
$\large \frac{\partial f}{\partial y} = \frac{1}{\partial y}(x ^ y) = log(x) \cdot x^y$

---
## Reduce ops

---
### Mean
Assuming that we take the mean over the entire ndarray $x$,<br>
$f(x) = \frac{1}{N}\sum_i x_i$<br>
$\frac{\partial f}{\partial x} = \frac{1}{N}(1_1 + \cdots + 1_N)$<br>

Now, lets assume that $x$ is n-dimensional, with varying number of elements for each dimension, and that we want to take the mean over some specifix axis, lets denote the product of the input shape and output shape as $\alpha$ and $\beta$ respectively, then<br>
$f(x) = \frac{\alpha}{\beta} \sum_i^\alpha x_i$<br>
$\frac{\partial f}{\partial x} = \frac{\alpha}{\beta}(1_1 + \cdots + 1_\alpha)$

---
### Sum
Once again, lets assume that we summarize the full ndarray $x$,<br>
$f(x) = \sum_i x_i$<br>
$\frac{\partial f}{\partial x} = 1 + \cdots + 1$
