# Theoretical View to Neural Networks

**Note, I'll be starting in the Keras Notebook first**

I'm more application based when it comes to Data Science -- I'm pretty good with applying the mechanics of it, but I must be honest, I'm not the best with memorizing and remembering the theory.

And yet, we'd miss a lot of the magic on why Neural Networks work if we ignore the theory. So let's dive into a super basic premise of what a Neural Network actually is.

----
# Why Neural Networks?
These first two-to-three sections were lovingly stolen from Seth Weidman, who presented a tutorial at PyData NYC 2017. I encourage you to follow him at www.sethweidman.com. I also encourage you to go to PyData at www.pydata.org

So why a Neural Network? Why not Machine Learning?

### Visual Example
Let's say you have the world's most simplistic dataset which looks like this, and you're trying to classify the colored points.
![DataPlot](img/xor.png)

There isn't a linear classification or logistic classification method to really classify these data points, without being totally absurd or incorrect -- we cannot draw a line, or even a region, to seperate these two classes. 


### Math Example
Alternatively, let's say you have three points of data, each one containing up to three boolean (i.e. True/False) inputs like this:
$$ N(1, 0, 0) = 1 $$
$$ N(0, 1, 0) = 1 $$
$$ N(1, 1, 0) = 1 $$

The third function messes up our ability to use Logistic Regression. In other words, there is no parameter b, w1, w2, or w3 such that:
$$N(x_1, x_2, x_3) = \frac{1}{1 + e^{b + w_1 * x_1 + w_2 * x_2 + w_3 * x_3}}$$


### Feature Engineering
We could use **Feature Engineering** to solve this. To explain Feature Engineering in a few words, it's manually designing what the input should be. You might try to add some discrimination algorithms or emphasize some key features, to modify your inputs so that its key features become more obvious.

Deep Learning -- which includes Neural Networks -- is honestly something like **Architecture Engineering** (don't Google that term, I made it up). In Neural Networks, we're effectively playing around either with the parameters or the architecture of our Neural Network so that it better fits our data. In effect, we end up training the computer to do Feature Engineering on its own.

Which solution is better? The new hotness is Deep Learning and letting the computer do the feature engineering rather than us. In practice though, there is a time and place for feature engineering. If you suspect your dataset could be defined with a linear or logistic regression by finetuning some of the inputs, it'll probably be faster from a computation perspective to stick to feature engineering and then slide into machine learning. There's not a good 'one rule' here -- whether you go Feature Engieering + ML or go straight to DL is up to your data.

----
# What is a Neural Network: Forward Propogation
Introducing one of the most classic diagrams in visualizing what a Neural Network looks like:
![Neural_Intro](img/neural_net_basic.png)
**Disclaimer: I had a typo on this diagram. The second set of weights on the right should be 'w', not 'v'**

### Applying Weights
The first step most Neural Networks take is to take the inputs, multiply it by some weight, to obtain a feature.

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$
$$ a_2 = x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} $$
$$ a_3 = x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} $$
$$ a_4 = x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} $$

In the first pass, these weights are probably something absurdly silly (like all 1s, all 0s, etc.). Overtime, they get refined into more helpful units, and we'll talk more about this later.

### Activation Functions
The second step is passing the feature through an **Activation Function**. This is important because if we stuck with the above functions through the entire process, we'd stay Linear and create a Linear Transformation. Chances are, you're here because you want a non-linear transformation and this is where Activation Function comes in.

Here's a list of all the possible Activation Functions. Later, we'll talk about the **Relu Function** which is arguably the most popular activation function, although the reason it works so well will probably surprise you.

<img src='img/Activation.tiff'>

### Loss
The Loss is the difference between the estimated result and actual result. Mathematically, this is often the **Mean Squared Error Loss**. The lower the loss is, the better.

$$ L = \frac{1}{2}(y - P)^2 $$

There are other Loss Functions you can use and we'll talk more about them later.

----
# What is a Neural Network: Forwards to Backwards
We continue these algorithms noted above, repeating them until we reach the end of the diagram I showed you before.

But you might ask how the weights change -- because by default, they start at silly & unhelpful states. These weights do refine in a process we call `Backpropogation` and/or `Backwards Propogation` (depending who you talk to).

### Why does Backpropogation work?
Recall that a Neural Network is effectively, some function that takes the inputs of the last function and passes its output to the next function.

\begin{align}
A &= a(x, V) \\
B &= b(A) \\
C &= c(B, W) \\
P &= p(C) \\
L &= l(P)
\end{align}

Because these equations are so linked, we can combine them in one line such as below:

$$ L = l(p(c(b(a(x, V)), W))) $$

Notice that the Weight (W) are related to the functions which create the Neural Network. This implies that **the calculation of those weights in the next iteration (i.e. epoch) can be derived by calculating the partial derivative weight from the partial derivative Loss**. Or in other words,

$$ W = W - \frac{\partial L}{\partial W}$$

This works because...
* If $\frac{\partial L}{\partial W}$ is a positive number, then we want to _decrease_ the weight, to _decrease_ our loss.
* If $\frac{\partial L}{\partial W}$ is a negative number, then we want to _increase_ the weight, to _decrease_ our loss.

And we can calculate those partial derivatives via the chain rule. So for example, if we want to obtain the weight 'W' (which is the second set of weights), we'd do:
$$ \frac{\partial L}{\partial W} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial W}  $$

----
# What is a Neural Network: Backwards Propogation
The image below is a reminder of where we're going... just reverse the direction of the arrows.
![Neural_Intro](img/neural_net_basic.png)

### Derivatives Galore
Most of this section is about derivatives... so there's high chance I'll fly through this. So if I do, I strongly recommend the article at https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ which explains this process far better than I could.

### Propogating from Loss to Result
Recall that the Loss equation is the Mean Squared Error which was:
$$ L = l(P) = \frac{1}{2}(y - P)^2 $$

Thus to obtain the partial derivative of the loss over the result (i.e. the rate of change or the gradient of the loss over the result), the equation becomes:
$$ \frac{\partial l}{\partial P} = -(y - P)$$

This helps us compute the 'accuracy' of our result.

### Propogating over the Activation Function
Recall that our Activation Function was the Sigmoid Function, which was:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

That means its derivative is:
$$\sigma'(x) = \sigma(x) * (1 - \sigma(x))$$

This helps us compute the 'accuracy' of feature 'C' because we now know the rate of how incorrect we were.

### Propogating over the Weights
We now need to compute the relationship of the weight & feature which simply means:
$$ \frac{\partial c}{\partial W} $$

...Or more complicatedly, that means:
$$ \frac{\partial c}{\partial W} = \begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} $$
                  
But because we got Feature C via:
$$
\begin{align}
C &= \begin{bmatrix} c_1 \end{bmatrix} \\ 
&= c(W) \\
&= w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4
\end{align}
$$

That means that each partial derivative of c over w is equivilant to $B^T$:

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} $$

This guides us into creating new weights on the next forward propogation approach.

### Interactive Demo
Now that we know these concepts, take a look at http://experiments.mostafa.io/public/ffbpann/ for a quick interactive module on these propogation processes.

----
# Other Layer Types
What we described is your basic neuron layer, referred to in Keras as a 'Dense' Neural Layer. There are other layer types, and I specifically want to call out two layers. 

### Batch Normalization
Batch Normalization normalizes the data at each neuron within each epoch (i.e. iteration) of the model. It normalizes it by ensuring the mean is close to 0 and the standard deviation is close to 1.

While this additional computation adds run time, this normalization process helps data converge much quicker, which should decrease the overall run time, and give you an opportunity to increase the epochs you can use. This helps us get a better accuracy.

### Flatten
Not all data sets and models need this, but with our example, we will. The Flatten Layer flattens data into one dimension. This keeps our model consistently work as we feed it both input data (which is four dimensions) and our expected output data (which is two dimensions). 

----
# Practical Example: Go to the Keras Notebook
Start from the beginning as we build the **World's simplest Neural Network** (I mean it too).

----
# Deep Learning Optimizations
Some methods to optimize a Neural Network include:
* Learning rate tuning
  * Learning rate decay
  * Varying learning rates by layer
  * Learning rate momentum
* Loss Calculation
* Preventing Overfitting
  * Regularization
  * Dropout
  * [Not available in Keras yet] Dropconnect
* Weight initializations
* Different activation functions
* Hyperparameter Tuning

### Learning Rate + Optimizers
By default, our neural network uses a learning rate. **The learning rate defines how the model will refine each individual weight** during backpropogation. 

Recall that earlier, I provided the equation to refine the Weight Vector. If we added the learning rate of $\alpha$, that equation would now look like:
$$ W = W - \alpha * \frac{\partial l}{\partial W}$$

* When the weight is closer to the input, we want the learning rate to be lower, so that we do not change the weight as much and so we use the input data instead.
* When the weight is closer to the output, we want the learning rate to be higher, so that we change the weight more to 'amplify' the input data.

In **Keras**, the Learning Rate is defined by an Optimizer. There are several optimizers in Keras and they all either offer options or implement algorithms that control its decay, momentum, and rate per layer. In the neural network we constructed, we used the `adam` optimizer to fine tune the learning rate. The list of those optimizers are at: https://keras.io/optimizers/

#### Hyper-Quick Summaries of the primary Optimizers
Lovingly stolen and summarized from Keras' Documentation and https://datascience.stackexchange.com/questions/10523/guidelines-for-selecting-an-optimizer-for-training-neural-networks
1. `AdaGrad` is better for sparse data. It penalizes the learning rate harshly for parameters which are frequently updated but it also gives more learning rate to sparse parameters. `AdaDelta` is similar but it doesn't require an initial learning rate to be set.
2. `RMSProp` is better for recurrent neural networks and those are good for data that changes over time.
3. `Adam` is overall the best optimizer. It combines the best of Adadelta and RMSprop.
4. `Stochastic Gradient Descent` is very basic and is seldom used now because it uses a global learning rate, thus it doesn't work well when the parameters have different scales. It also generally has a hard time escaping the saddle points.



### Loss Calculation
There are several ways to calculate the loss. We used the `categorical_crossentropy` method in our last neural network.

In Keras, you can use the Loss Algorithms on https://keras.io/losses/

### Regularization
Regularization prevents overfitting by ensuring **no weight is a central point of failure** for the entire network. We do this by adding additional terms to larger weights.

`L2 Regularization` (otherwise known as Weight Decay) is the most common type. Here, we augment the the error function with the squared magnitude of all weights in the neural network. This in turn ensures that the network uses _all the weights_ rather than honing in on some of the weights.

`L1 Regularization` on the other hand is designed to do the opposite -- so _only the most important weights are used_, which helps the model ignore noise better. This typically performs worse than L2, but this does help at times, especially if you want to know which features are most important. 

In Keras, you can implement these with the functions at https://keras.io/regularizers/

### Dropout
Dropout prevents overfitting ensuring **no neuron is a central point of failure** for the entire network. We do this by disconnecting a portion of the neurons (i.e. setting their values to zero) on each forward pass.

<img src="img/dropout.png">

Based on what we talked about previously, we want to make sure our model learns as much as it can the closer we are to our original input. As such, generally speaking, it makes sense to introduce Dropout on larger networks when some of your final layers are further away from the beginning data.

Furthermore, the original paper that introduced Dropout (http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf) actually confirmed that using Dropout with some form of Regularization tends to be ideal.

In Keras, you can implement Dropout by following https://keras.io/layers/core/#dropout

### DropConnect
Similar to Dropout, but whereas Dropout disabled neurons, here we disable certain weights by setting them to zero.
<img src="img/drop_connect.png">

### Weight Intialization
You can define the initial weights your model will use. Keras lets you do this at: https://keras.io/initializers/

### Activation Functions: Relu & the Vanishing Gradient Descent Problem
In Keras, a list of Activation Functions are on https://keras.io/activations/ and https://keras.io/layers/advanced-activations/. Attached below is a summarized image of some of the most common Activation Functions from our friends at Wikipedia.

<img src='img/Activation.tiff'>

As we noted earlier, the Relu Activation Function is one of the most popular activation functions. However, recall that the Relu Activation Function is kind of a linear one. How could it work?

Well first, let's take a look at a non-linear activation function, such as tan.

<img src='img/grad_descent1.png'> 

Neural Networks tend to be better with many layers. But if we pick an activation function that is bounded by what numbers it can produce (such as tan), we wouldn't help the model identify differences any easier by adding more layers. 

<img src='img/grad_descent2.png'> 

See what I mean? In this example, we're effectively computing S(S(S(S(S(S(S(S(S(S(S(x))))))))))). When we do backpropogation to refine our weights, it'll have a hard time just because there's a lot of computation, over linear derivatives, and the data is 'squished'. This is the **Vanishing Gradient Descent Problem**.

The Relu works around this issue though because it's range is effectively 0 to the max(x). Also, because its derivative is not linear persay. Both its forward and backwards propogation approach does not 'squish' the data and this means we get better data as a result.

<img src='img/Activation.tiff'>

If you want to learn more about this particular problem, I really like the Jupyter Notebook at: https://cs224d.stanford.edu/notebooks/vanishing_grad_example.html

### Interactive Demo
Now that we know these concepts, take a look at http://playground.tensorflow.org/ for a quick interactive module on Neural Networks.

----
# Practical Example: Go to the Keras Notebook
Go to the **Demonstrating some concepts in our basic neural network** section.

----
# Convolutional Neural Networks
**Convolutional Neural Networks are typically used to do image recognition and/or classification tasks**. They work best when they're analyzing data that is structured -- that is, a particular data point has some relation with the other data points surrounding it.

They add two new primary layer types which help extract features from its input data. Those layers are `Convolutional Layers` and `Pooling Layers`.

_(All GIFs in this section are obtained from https://hackernoon.com/visualizing-parts-of-convolutional-neural-networks-using-keras-and-cats-5cc01b214e59)_



### Convolutional Neural Layers
The CNN is primarily driven by the `Convolutional Layer`, which effectively is another way to simplify the data.

![ConvolutionalLayer](https://cdn-images-1.medium.com/max/1600/1*ZCjPUFrB6eHPRi4eyP6aaA.gif)

In the image above, we produce a `convolved feature`/`activation map`/`feature map`. Here are some of the key parts of the image above:
* The sliding yellow window is the _Kernel_/_Filter_. This is the multipicative product of weights (denoted in the small red text) and whatever value was originally in that square. These weights change to accomodate what the CNN is learning.
* The _Stride_ of the kernel refers to how many 'pixels' it moves in each move

The weights are applied just like a standard neural network. So for the first nine pixels in the upper right, we're computing: `(1*1)+(1*0)+(1*1)+(0*0)+(1*1)+(1*0)+(0*1)+(0*0)+(1*1) = 4`

### Pooling Layers
A **Max Pooling** or **Average Pooling** layer creates a kernel on this convolved feature and completely moves it to seperate regions, selecting either the single highest or the average value across all the values within that kernel.

![Pooling](https://cdn-images-1.medium.com/max/800/1*Feiexqhmvh9xMGVVJweXhg.gif)

### Lay of the Land
You would typically combine these two layers in the beginning, often with several passes. These layers would then feed data into the `Dense` or `Fully Connected` layers we created in the last basic neural network.

<img src="img/CNNArchitecture.png"/>

### Interactive Demo
Now that we know these concepts, let's take a look at an online interactive visualization.
http://scs.ryerson.ca/~aharley/vis/conv/

----
# CNN Best Practices 
CNNs are a very new field. There aren't many 'best practices' and researchers are trying to discover how to develop convolutional neural networks for common datasets (such as MNIST -- not to be confused with our Fashion MNIST dataset), let alone other datasets.

If you're remotely interested in CNNs, I strongly urge you to reach some of the research papers from the most famous CNNs to understand why they exist. The models get better as the papers get newer, but each paper really builds off of the prior model.

Here's a quick overview into all of these models: https://medium.com/@siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5. All of these models have research papers which I believe are all free & open source.

Keras does have these models already pre-made for use and its documentation is at https://keras.io/applications/. I would urge you to be careful with these though, becuase these models were designed for the dataset that they were researched with, which probably isn't your dataset. It's better to understand how these models work and see how it might align to your dataset rather than use it outright. 

### Second Newest CNN: ResNets (2015)
It's claim to fame is its strange architecture. In a ResNet, data can pass into the next filter directly (which is normal) *and* via shortcuts that skip a few layers along the way (which is not normal). Other notes

* Each convolutional block uses `Relu` activation functions and does `Batch Normalization` right afterwards.
* Dashed arrows means that the dimension of the new output will increase, so we add extra zeros to pad it (This is `Identity Mapping`).
* These shortcuts do not necessarily add extra parameters or abnormal computational complexity. What they do bring to the table though is an ability to learn from the 'history' of this data sample getting analyzed from earlier layers.

<img src="img/ResNet.png"/>

Research Paper: https://arxiv.org/pdf/1512.03385.pdf.

To code this in Keras, you cannot use a Sequential model anymore. You'll have to build your CNN layer-by-layer, so that each layer accepts one or multiple inputs. You'll probably need several functions -- one for the convolutional block and one each for the two types of shortcuts you might encounter.

### First Newest CNN: DenseNets (2016)
The problem with ResNets is that the computation can get intense quickly. Researchers realized that many layers were not needed and some of it was redundant. 

DenseNets `forward feeds` the concated conclusions of the prior dense/neural layers to the next dense/neural layer. As a result, this helps:
* Reuses Features & Reduces Parameters (easier to compute)
* Alleviates the Vanishing-Gradient Problem
* This forward-feeding of data helps improve accuracy and efficiency, assisted via the shorter connections.
* Each layer has direct access to one another from the input and output, which creates an implicit deep supervision.

<img src="img/DenseNets.png"/>

Research Paper: https://arxiv.org/pdf/1608.06993.pdf.

----
# Practical Example: Go to the Keras Notebook
Go to the **Convolutional Neural Networks** section.

----
# Recurrent Neural Networks
**Recurring Neural Networks are typically used when the sequence of your data is important in how your model performs**. Examples include parsing language (because each word influences the grammatical construct of the next word) or how something might fly in the air over time.

Recall that in the World's Simplest Neural Network & Convolutional Neural Network, it really doesn't matter in what order our images come in. A RNN twists this around by taking the output from each layer of neurons and passing it into the activation functions for the next data sample to come in.

<img src='img/RNN-PreLSTM.png'/>

Image stolen from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

### The Hidden Layers
The big takeaway is that the neuron computing the Activation Function -- known in RNNs as the `Hidden Layer` -- now requires these inputs:
* The current time step's input & weights
* The prior time step's input & weights
  * Which in turn, was influenced by prior time steps.
  
Here's another GIF to help explain this (lovingly stolen from https://deeplearning4j.org/lstm.html). Here, `x` is the input data, `w` is the weight, `a` is the activation function inside the hidden layer (i.e. neuron), and `b` is the output for said input data.

<img src='img/RNN.gif'/>

### Backpropogation & the Vanishing Gradient Problem
Recall that the Vanishing Gradient Problem, for typical neural networks, exists after we add many Dense Layers. That the first few layers might impact the output greatly, but results will soon begin to stagnate with additional layers, and this makes refining our weights harder during backpropogation.

In CNNs, we came across two conclusions:
1. Using `ReLU` activation functions is ideal, because during backpropogation, the derivative of ReLU does not result in a linear function.
2. There is a point where we can have too many layers, so that our outputs begin to stagnate.

RNNs are similiar, but in a slightly different way. Let's say we are parsing a sentence in an RNN. The words used depends on the other words in the sentence, but it does not necessarily depend on the words immediately before or after.

As discussed, RNNs passes data from each data sample to the next data sample it predicts. This passing of data creates an ever increasing chain rule in size.

<img src='img/RNN_Gradient.png'/>

**In Neural Networks + CNNs, our results started to stagnate as we add more layers. In RNNs, our results will start to stagnate as this sentence (or whatever data set) gets longer.** This is because each word adds yet another 'dimension' to our calculations. And our weights adapt less efficiently during backpropogation for activation functions that do not derive well, similar to Neural Networks + CNNs.

----
# Two Types of RNNs
To address the Vanishing Gradient Problem, two new types of RNNs exist nowadays. Both treat the Activation Function -- the hidden layer -- like a black box and utilizes different computation algorithms.

### Long Short-Term Memory (LSTM) 
Without an LSTM, our Hidden Layer would compute one activation function like this.

LSTMs were introduced in 1997 (Research Paper: http://www.bioinf.jku.at/publications/older/2604.pdf). I make no doubt that http://colah.github.io/posts/2015-08-Understanding-LSTMs/ explained this process far better than I could ever dream of communicating. But here's a quick stab into what's going on. 

<img src='img/RNN-LSTM.png'/>

Each LSTM Layer takes in:
1. The input data from the current time step (x)
2. The input data from the prior time step (h)
3. The `Cell State` data from the prior time step (C). This helps the model determine if it should rely on the prior time step's data or if it needs to think differently.

Going from left to right:
* The first branch determines **if we should use the data from the last time step**.
  * We compute the sigmoid function to produce a 1 (Yay) or 0 (Boo) over the current timestep's & former timestep's data. This is multiplied against the former cell state's output.
  * For example, if we're parsing a sentence and we introduced a new subject midway, it would behoove our model to ignore the gender/traits of the prior subject.
* The second & third branch conducts the **update conclusion from the first branch**.
  * We compute the sigmoid function (returning 1 or 0) and tanh function (returning from a range of -1 to 1) between our current timestep & former timestep.
  * We multiply those values together and add it to the cell state.
  * Using our prior example, we would actually compute the effects of ignoring our former subject and focusing on this new subject here.
* The final branch determines **which part of the input should be outputted**
  * We compute the sigmoid function on our current and former timestep.
    * This becomes our output
    * This also becomes multiplied by a tanh activation function and gets appended to our cell state.

### Gated Recurrent Unit (GRU)
This is similar to an LSTM, but it's a bit more optimized by combining similar functions together and introducing `Peepholes` where our prior cell state becomes used as opposed to using the prior timestep's data.

<img src='img/RNN-GRU.png'/>

Research Paper: https://arxiv.org/pdf/1406.1078v3.pdf