# Theoratical View to Neural Networks
I'm more application based when it comes to Data Science -- I'm pretty good with applying the mechanics of it, but I must be honest, I'm not the best with memorizing and remembering the theory.

And yet, we'd miss a lot of the magic on why Neural Networks work if we ignore the theory. So let's dive into a super basic premise of what a Neural Network actually is.

I would be a liar if I said I did this entire Notebook by myself. A lot of it -- especially the images -- were taken from Seth Weidman, who presented a tutorial at PyData NYC 2017. I encourage you to follow him at www.sethweidman.com.

----
# Why Neural Networks?
So why a Neural Network? Why not Machine Learning?

### Visual Example
Let's say you have the world's most simplistic dataset which looks like this, and you're trying to classify the colored points.
![DataPlot](img/xor.png)

There isn't a linear classification or logistic classification method to really classify these plots, without being totally absurd or incorrect -- we cannot draw a line, or even a region, to seperate these two classes. 


### Math Example
Alternatively, let's say you have three points of data, each one containing up to three boolean (i.e. True/False) inputs like this:
$$ N(1, 0, 0) = 1 $$
$$ N(0, 1, 0) = 1 $$
$$ N(1, 1, 0) = 1 $$

The third function messes up our ability to use Logistic Regression. In other words, there is no parameter b, w1, w2, or w3 such that:
$$N(x_1, x_2, x_3) = \frac{1}{1 + e^{b + w_1 * x_1 + w_2 * x_2 + w_3 * x_3}}$$


### Feature Engineering
We could use **Feature Engineering** to solve this. To explain Feature Engineering in a few words, it's manually designing what the input should be. You might try to add some discrimination algorithms or emphasize some key features, to modify your inputs so that its key features become more obvious.

Deep Learning -- which includes Neural Networks -- is honestly something like **Architecture Engineering** (don't Google that term, I made it up). In Neural Networks, we're effectively playing around either with the parameters or the architecture of our Neural Network so that it better fits our data. In effect, we end up training the computer to do Feature Engineering on its own.

Which solution is better? The new hotness is Deep Learning and letting the computer do the feature engineering rather than us. In practice though, there is a time and place for feature engineering. If you suspect your dataset could be defined with a linear or logistic regression by finetuning some of the inputs, it'll probably be faster from a computation perspective to stick to feature engineering and then slide into machine learning. There's not a good 'one rule' here -- whether you go Feature Engieering + ML or go straight to DL is up to your data.

----
# What is a Neural Network: Forward Propogation
Introducing one of the most classic diagrams in visualizing what a Neural Network looks like:
![Neural_Intro](img/neural_net_basic.png)
**Disclaimer: I had a typo on this diagram. The second set of weights on the right should be 'w', not 'v'**

### Weights
The first step most Neural Networks take is to take the inputs, multiply it by some weight, to obtain a feature.

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$
$$ a_2 = x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} $$
$$ a_3 = x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} $$
$$ a_4 = x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} $$

I'll talk about how these weights are refined later.

### Activation Functions
The second step is passing the feature through an **Activation Function**. This is important because if we stuck with the above functions through the entire process, we'd stay Linear and create a Linear Transformation. Chances are, you're here because you want a non-linear transformation and this is where Activation Function comes in.

There are many 'premade' Activation Functions and many Activation Functions that are being researched. Here is (what I believe is) the most important Activation Function, used at the end of most classification problems. This is the **Sigmoid Function** which maps numbers from 0 to 1.

$$ B = \sigma(x) = \frac{1}{1 + e^{-x}}$$

If we're following the above diagram, the Weights & Activation Function approach is replicated again.

### Loss
The Loss is the difference between the estimated result and actual result. Mathematically, this is the **Mean Squared Error Loss** and the lower it is, the better.

$$ L = \frac{1}{2}(y - P)^2 $$

### Refining Weights in the next Iteration
Recall that a Neural Network is effectively, some function that takes the inputs of the last function and passes its output to the next function.

\begin{align}
A &= a(x, V) \\
B &= b(A) \\
C &= c(B, W) \\
P &= p(C) \\
L &= l(P)
\end{align}

Because these equations are so linked, we can combine them in one line such as below:

$$ L = l(p(c(b(a(x, V)), W))) $$

Notice that both Weights (W & V) are related to the functions which create the Neural Network. This implies that the calculation of those weights in the next iterations can be derived by calculating the partial derivative weight from the partial derivative Loss. Or in other words,

$$ W = W - \frac{\partial L}{\partial W}$$

$$ V = V - \frac{\partial L}{\partial V}$$

This works because...
* If $\frac{\partial L}{\partial W}$ is a positive number, then we want to _decrease_ the weight, since increasing the weight would _increase_ our loss. That is exactly what the equation $ W = W - \frac{\partial L}{\partial W}$ does.
* Similarly, if $\frac{\partial L}{\partial W}$ is a negative number, then we want to _increase_ the weight, since increasing the weight would _decrease_ our loss. In both cases, the equation $ W = W - \frac{\partial L}{\partial W}$ works.


And we can calculate those partial derivatives via the chain rule:
$$ \frac{\partial L}{\partial W} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial W}  $$

$$ \frac{\partial L}{\partial V} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial B} * \frac{\partial b}{\partial A} * \frac{\partial a}{\partial V}  $$

----
# What is a Neural Network: Backwards Propogation
We finish this iteration of the neural network by computing backwards, from the loss, towards the original input. We can do this because every function is connected to one another and we do this to help train our weights, using information we now know about how accurate or inaccurate we were.

The image below is a reminder of where we're going... just reverse the direction of the arrows.
![Neural_Intro](img/neural_net_basic.png)

### Propogating from Loss to Result
Recall that the Loss equation is the Mean Squared Error which was:
$$ L = l(P) = \frac{1}{2}(y - P)^2 $$

Thus to obtain the partial derivative of the loss over the result, the equation becomes:
$$ \frac{\partial l}{\partial P} = -(y - P)$$

This takes us back to the Result.

### Propogating over the Activation Function
Recall that our Activation Function was the Sigmoid Function, which was:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

That means its derivative is:
$$\sigma'(x) = \sigma(x) * (1 - \sigma(x))$$

This takes us back to Feature C

### Propogating over the Weights
We now need to compute the relationship of the weight & feature which simply means:
$$ \frac{\partial c}{\partial W} $$

...Or more complicatedly, that means:
$$ \frac{\partial c}{\partial W} = \begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} $$
                  
But because we got Feature C via:
$$
\begin{align}
C &= \begin{bmatrix} c_1 \end{bmatrix} \\ 
&= c(W) \\
&= w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4
\end{align}
$$

That means that each partial derivative of c over w is equivilant to $B^T$:

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} $$

----
# Deep Learning Optimizations
Some methods to optimize a Neural Network include:
* Learning rate tuning
  * Learning rate decay
  * Varying learning rates by layer
  * Learning rate momentum
* Dropout
* Dropconnect
* Weight initializations
* Different activation functions

### Learning Rate
The Learning Rate is a number that we multiply the weight by during each iteration. 

Recall that earlier, I provided the equation to refine the Weight Vector. If we added the learning rate of $\alpha$, that equation would now look like:
$$ W = W - \alpha * \frac{\partial l}{\partial W}$$

Typically speaking, the learning rate should be higher when you're closer to your output and lower when you're closer to your input. This is because when the weight tends to be less informative near the beginning of the model (when you actually provided it data you are certain of) and more informative towards the end of the model.

### Dropout
Dropout prevents overfitting by disconnecting a portion of the neurons -- setting their values to zero -- on eaach forward pass. 

<img src="img/dropout.png">

### DropConnect
Similar to Dropout, but whereas Dropout disabled neurons, here we disable certain weights by setting them to zero.
<img src="img/drop_connect.png">