# Automatic Differentiation Explained 
### using PyTorch and Python with Applications to Linear Regression solved using Gradient Descent

In this notebook, you can learn about:

* what is a tensor in deep learning
* how are scalar and array tensors created and used in PyTorch
* how PyTorch `autograd` framework can be used for linear regression
* how reverse mode automatic differentiation can be implemented using Python
* the design principles behind PyTorch `autograd` application programming interface

## Import the __`torch`__ package and __`matplotlib`__ for graphics.

Initialize the pseudo-random number generator and report the installed version of `PyTorch`

In [None]:
import torch as pt
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15, 8)
pt.manual_seed(42)
pt.__version__

## Tensor is the core PyTorch data structure (container)

* in deep learning, a tensor is a multi-dimensional array
* the number of indices needed to access a value in the tensor is equal to the tensor dimension (rank)
* the number of values along every dimension (tensor shape) is constant, in other words, you can't have a matrix with 42 columns in the 1st row and 3 columns in the 2nd row.

<img height="480" src="https://i.imgur.com/sbHDGMs.png"></img>

Let's initialize `X` as a tensor array, with values from an evenly spaced sequence of 50 numbers betweeen -5 and 5, inclusive on both ends.

Assuming that `X` holds the features for the model, lets initialize the targets using the equation: 

$ y = 2 * X  + \epsilon $

where $ \epsilon $ is just normal noise. The value of `2.` is arbitrary, just to generate some sample data.

Now you can plot a scatter graph of `X` and `y` as well as a line plot of `2 * X` without the noise.

## Solve using PyTorch `autograd` and gradient descent

Initialize `w`, the model parameter to a scalar tensor with `requires_grad=True` (more on this in a moment). 

The value of `w` can be changed in place by sampling from a uniform distribution using `pt.nn.init.uniform_` More complex initializations like Kaiming or Xavier do not apply since `w` is just a single parameter.

Define a `forward` function to return an estimate of `y` from `X` and `w` 

Define an `mse` function to compute the mean squared error loss based on `y` and the estimate of `y`

Use constant values for the gradient descent learning rate and the number of epochs

In [None]:
EPOCHS = 100
LEARNING_RATE = 0.01

Repeating gradient descent for `EPOCHS` iterations should be enough to obtain a close estimate of the analytical solution to the linear regression problem. 

Recall from linear algebra that the ordinary least squares solution is $ (X^TX)^{-1}X^Ty $

In PyTorch `@` is the operator for matrix multiplication.

**Self-check:** Why doesn't the analytical solution return `2.0`?

## Understanding reverse mode accumulating automatic differentiation


Automatic differentiation (autodiff) is different from:

* numeric differentiation, which is based on an approximation of $   \lim_{\epsilon \to 0} \frac{f(x + \epsilon) - f(x)}{\epsilon}$, which can be numerically unstable at the extreme values of the differentiated functions and  accumulate small errors introduced by floating point number approximations to real numbers.

* symbolic differentiation which attempts to derive a general symbolic expression of the differentiated function for arbitrary values requiring more computation and memory. 

Autodiff differentiates a function for specific values of the function's input variables one at a time, with a computation complexity $ O(n) $ where `n` is the number of the mathematical operations used by the differentiated function. Notice that $ O(n) $ holds only when the number of outputs of the function is fewer than the number of inputs.


Create a `Scalar` class, a rank 0 tensor, with support for `__add__`, `__mul__`, and `__repr__` methods.

## Create a `Scalar` instance for `x = 2.0`

## Define `y = x` 

## Prepare for and call `backward` on `y`
* Use floating point values
* Zero out the accumulating gradients
* Initialize $ \frac{\partial y}{ \partial y} $


* check that $ \frac{\partial y}{ \partial x} = 1.0 $

* **Self-check:** Why the did the implementation return the correct answer?

## Implement `backward` support in the` __add__` function
* **hint:** given $ y = x + z $, $ \frac{\partial y}{ \partial x} = 1.0 $ and $ \frac{\partial y}{ \partial z} = 1.0 $


* **hint:** don't forget the recursive `backward`

  ## Define `y = 3 * x` for `x = 3.0`
* **hint:** recall that $ 3 * x = x + x + x $

## Prepare for and run the backward pass

* check that $ \frac{\partial y}{ \partial x} = 3.0 $

## Implement `backward` support in the` __mul__` function
* **hint:** given $ y = c * x$, $ \frac{\partial y}{ \partial x} = c $

## Use `y = x^3 + 2*x` for `x = 4.0`
* **hint:** recall that $ x^3 = x * x * x $

* given $ y = x^3 + 2x $ the analytical solution to $ \frac{\partial y}{ \partial x} = 3x^2+2 $
* check that your implementation of `Scalar` returns the correct value of $ \frac{\partial y}{ \partial x} $ when $ x = 4.0 $

## Apply `Scalar` to linear regression
* set the random seed to `42`
* randomly init the model parameter `w`

## Make linear regression data


## Implement a `forward` function using `w`
* **hint:** the function should return $ w * X $

## Implement the mean squared error calculation
* **hint:** Python `sum` can use a starter value as the 2nd argument

## Confirm that gradient descent recovers the value used to generate `y` values.

Copyright 2020 CounterFactual.AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.