<a href="https://colab.research.google.com/github/jjennings955/CSE5368-Spring-2019/blob/master/Computational_Graphs_and_Backpropagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computational Graphs
We can create a computational graph for any mathematical function

For example:
$f(x,y,z) = 2 x^2 cos(yz)$
<pre>












</pre>

or 
$f(x; W, b) = \sigma(Wx + b)$
<pre>












</pre>

We call $x, y, z$ **variables**, and the functions we apply to those variables **operations** (or functions). Variables and functions together we will call **nodes** in the graph.

- **variables** - can be scalars, vectors, matrices, tensors, or any mathematical object (but we'll stick to those in this course)
- **operations** can be any functions of those mathematical objects (so functions mapping scalars -> scalars, vectors -> vectors, or any other combination)

Some libraries may add further classifications of **nodes** (such as 'placeholders' in tensorflow), but those are implementation details (important to tensorflow, but not important to the concept of a computational graph)

Together these nodes and edges form a **directed acyclic graph** (or **DAG**), and we call this graph a computational graph.

## Forward Propagation
Forward propagation is the process of iterating over the graph, feeding values into variable, and evaluating each operation, collecting the result, and feeding it into any remaining operations down stream, until we have fully evaluated the graph and computed the final output. It may also be called **feed forward** or **the forward pass**.

We go through the graph in **topological order**. That is, we start by processing a node that has no predecessor in the graph, mark it as 'complete', and only process a node if all its predecessors have already been processed. A graph may have many valid topological orders.

This is intuitively what we do when we evaluate a mathematical expression without necessarily realizing it.

# Backpropagation
Backward propagation is a process we implement to compute gradients in computational graphs. Because in deep learning we use **gradient based optimization**, a method for efficiently and automatically computing the gradient of a complex function is something desirable. 

Backpropagation in other fields may be called **reverse mode automatic differentiation**.

In backward propagation, we iterate over the graph in **reverse topological order**, compute gradients, and feed those gradients backwards using **the chain rule**. You can think of backpropagation as simply a method for efficiently implementing the chain rule in multi-dimensions, for potentially very deeply nested chains of functions.

## The Chain Rule
In a single dimension, as in Calculus 101, the chain rule is simply as follows:
Given:

$y = f(x)$ and $x = g(t)$

$$\frac{dy}{dt} = \frac{dy}{dx} \frac{dx}{dt}$$

In multi-dimensions (but still with scalar valued functions), it gets slightly more complex, because chains of functions may not be completely linear (like in the examples we've seen).

Given $u = f(x,y)$ where $x = f_1(r,t)$ and $y = f_2(r,t)$

$$\frac{\delta u}{\delta t} = \frac{\delta u}{\delta x} \frac{\delta x}{\delta t} + \frac{\delta u}{\delta y} \frac{\delta y }{\delta t}$$

In english, as we go backwards in a computational graph, we **multiply along linear chains**, and **add where two chains converge**.
<pre>








</pre>


## Backpropagation Motivation
Rather than writing down a mathematical expression, symbolically computing an abstract formula for its gradient using calculus, and attempting to simplify it into something elegant, then implementing it into code, we can implement a library of general purpose operations, we just need to be able to define their **forward** behavior and their **backward** behavior.

The forward behavior simply computes the output of the function (as we've seen), the backward behavior computes the gradient of the output with respect to its inputs.

```python
class Operation:
  def forward(self, x):
    return f(x)
  def backward(self):
    return gradient_f(x)
 ```


## Vector Chain Rule
Generally speaking for deep learning, most of our functions will be vector -> vector (or even tensor->tensor) functions. The rules for these can be derived from vector->scalar functions, but we'll skip the proof:

$f : R^n \rightarrow R^m$ \\

You can think of a vector to vector function as $m$ functions of the vector $x$, stacked into a vector.

$ f(\textbf{x}) = \begin{bmatrix} f_1(\textbf{x}) \\ f_2(\textbf{x}) \\ ... \\ f_m(\textbf{x})\end{bmatrix}$


The Jacobian of f is defined as:

$\textbf{J}_f = \begin{bmatrix}{\dfrac {\partial f_{1}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{1}}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\dfrac {\partial f_{m}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{m}}{\partial x_{n}}}\end{bmatrix}$

It can be seen that row $i$ of the Jacobian is the gradient (transposed) of $f_i(x)$ with respect to $\textbf{x}$

We introduce this rule, because it is used in the most general form of the chain rule (so far).

$y = f(u) = (f_1(u), …, f_k(u))$ 

and 

$u = g(x) = (g_1(x), …, g_m(x))$
