# The Math Behind Deep Learning
Notes taken from Siraj's video: https://youtu.be/N4gDikiec8E

The math that we will mainly need to fully understand Deep Learning are 1) statistics, 2) calculus, and 3) linear algebra. 

## Math cheat sheets

**Statistics:**
http://web.mit.edu/~csvoss/Public/usabo/stats_handout.pdf

**Linear Algebra:**
http://www.souravsengupta.com/cds2016/lectures/Savov_Notes.pdf   

**Calculus:**
http://tutorial.math.lamar.edu/pdf/Calculus_Cheat_Sheet_All.pdf

## 4 Step Pipeline

1. Collect data
2. Build Model
3. Train Model
4. Test Model

## Statistics

### Weight initialization

- We usually randomly initialize our weights and biases by drawing from a Normal distribution


### Normalization

 - Scaling the features of our data
 
Example:

- min max scaling (popular)
    - scale the each feature such that each data point ranges between [0,1]
    $$z = \frac{x - min(x)}{max(x) - min(x)}$$
    
- z-scoring
    - scale such that the mean of each feature is zero and standard deviation is 1
    - for each feature, subtract each datapoint by the mean and divide by the standard deviation

$$z = \frac{x - \bar{x}}{\sigma_x}$$

## Linear Algebra
Essential linear algebra terms often used in Deep learning:

1. Scalar
    - single number (e.g. $x$)
2. Vector
    - a 1-dimensional array of numbers
    
    \begin{equation}
     X=\begin{bmatrix}
         x_{1} \\
         x_{2} \\
         \vdots \\
         x_{n}
        \end{bmatrix}
  \end{equation}
  
3. Matrix
    - 2-dimensional array of numbers
    
    \begin{equation}
    \mathbf{X} = 
    \begin{bmatrix}
 x_{11} & x_{12} & \cdots & x_{1n} \\
 x_{21} & x_{22} & \cdots & x_{2n} \\
 \vdots & \vdots & \ddots & \vdots \\
 x_{m1} & x_{m2} & \cdots & x_{mn}
 \end{bmatrix}
   \end{equation}

4. Tensor
    - N-dimensional array of numbers
    - Tensors are actually the general form of all the above mentioned math objects.
        - 0-dimensional $\rightarrow$ scalar
        - 1-dimensional $\rightarrow$ vector
        - 2-dimensional $\rightarrow$ matrix
        
Linear algebra-based math is heavily used for both forward and back propagation of neural networks. 

For example:

- Matrix multiplication
- Dot products

## Calculus

- The vast majority of the calculus used in Deep Learning is (multivariable) differential calculus
- This would be covered from calc 1-3 in most post-secondary schools
- We use derivatives to compute the _gradient_ of our error and cost function

For example:

- One of the main activation functions we have used so far is the sigmoid function

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

- The derivative of this function, which we require in order to backpropagate the model's calculated error, is

$$ \sigma'(x) = \sigma(x)  (1-\sigma(x))$$

## Example
Let's use a 3-layer feedforward neural net to illustrate what we've talked about so far. 

In [1]:
import numpy as np

In [2]:
#Step 1: Collect Data

#input data
x = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])

#output data
y = np.array([[0],
             [1],
             [1],
             [0]])

In [3]:
print(x)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]


In [4]:
print(y)

[[0]
 [1]
 [1]
 [0]]


## Hyperparameter examples

- When our models get more complicated, we'll have to worry about lots of hyperparameters and how to tune/optimize them.
- Like tiny user-defined tuning knobs of our network
- e.g. how many fwd/bckwd iterations, how many hidden layers, how many neurons per layer, etc.

<img src="./images/week4/hyperparam.png" alt="Hyper parameters" style="width:400px">

### Random Search

- define ranges of values for each hyperparameter
- e.g.
    - 0.001 < Learning Rate < 0.003
    - 3 < # of layers < 120
- then create a search algorithm that randomly picks values between these values
    - uniformly distributed between the lower and upper bound that we defined
        - all possible values have the same probability of being chosen
        
<img src = "./images/week4/uniform.png" alt="uniform distribution" style="width:250px">        

One common way is to draw from uniform distribution with low deviation (values are close together)

In [5]:
#Step 2: build model
num_epochs = 60000

#initialize weights

# -1 < weight values < 1 
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

In [6]:
#Every node in input layer will be connected to 
#every node in the next layer
print(syn0)

[[-0.22174918  0.2370225  -0.71222459 -0.11069419]
 [-0.8950748  -0.89992915 -0.40739545  0.93409891]
 [ 0.73636171 -0.94204698 -0.95417025 -0.38501225]]


In [7]:
#dimension 4x1, size of our output
print(syn1)

[[-0.39602391]
 [-0.84302219]
 [-0.13352481]
 [-0.84793483]]


## Training the model

- Initialize random weights for our neurons

#### Forward propagation
- Compute forward pass 
- Pass inputs to hidden layer weighted by our randomly initialized weights
- Use sigmoid function (defined above) as our activation function in order to squash values into probabilities bounded between 0 and 1
<img src="./images/week4/fwd.png" alt="forward-pass" style="width: 200px">

#### Gradient descent and backpropagation
- Calculate the difference between our model's output and the true $y$ values
    - i.e. calculate the error
    - think of the error as a n-dimensional bowl that we want to get to the bottom of (minimize)
    - we can compute the derivative of where we are in the bowl to help guide us towards the minimum
    - repeat this process many times until our model converges (error stops decreasing)
<img src="./images/week4/minimize-error.png" alt="minimize" style="width: 400px">        

In [8]:
#define sigmoid activation function
#add boolean argument to use its derivative as well
# (useful for backwards pass)
def nonlin(x, deriv = False):
    if(deriv):
        return x*(1-x)
    
    return 1/(1+np.exp(-x))

In [9]:
#Step 3: train model

for jj in range(num_epochs):
    #feed forward through layers 0,1,2
    l0 = x
    #squash into output probabilities using our nonlinearity
    l1 = nonlin(np.dot(l0,syn0))
    l2 = nonlin(np.dot(l1,syn1))
    
    #how much did we miss the target value?
    l2_error = y - l2
    
    if (jj% 10000) == 0:
        print("Error:" + str(np.mean(np.abs(l2_error))))
    
    #we want to minimize this error
    #the smaller the error, the better our prediction
    #we can't change our input data but we CAN change our
    #weights to help minimize this error
    
    l2_delta = l2_error * nonlin(l2,deriv=True)
    
    #BACKPROP
    #how much did each l1 value contribute to the l2 error
    l1_error = l2_delta.dot(syn1.T)
    
    l1_delta = l1_error * nonlin(l1,deriv=True)
    
    #update weights 
    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)    

Error:0.50076597667
Error:0.0101335121582
Error:0.00671074939929
Error:0.00531179978552
Error:0.004511936752
Error:0.00398076795327


After each iteration, we can see that the error decreased. This means our model was learning!

## Conclusion

1. Deep learning uses Linear Algebra, Statistics, and Calculus
2. A Neural net performs a series of operations on an input *tensor* to make a prediction
3. We can optimize a prediction using gradient descent to backpropagate errors recursively and update weights accordingly

## Supplemental resources:
Math for ML:

- https://people.ucsc.edu/~praman1/static/pub/math-for-ml.pdf
- http://www.vision.jhu.edu/tutorials/ICCV15-Tutorial-Math-Deep-Learning-Intro-Rene-Joan.pdf
- http://datascience.ibm.com/blog/the-mathematics-of-machine-learning/

Part 1 of Goodfellow's graduate-level textbook

- http://www.deeplearningbook.org/