# Neural Networks: Additional Resources

- https://missinglink.ai/guides/neural-network-concepts/neural-network-bias-bias-neuron-overfitting-underfitting/
- https://machinelearningmastery.com/neural-networks-are-function-approximators/
- http://www.wildml.com/deep-learning-glossary/
- https://machinelearningmastery.com/neural-networks-crash-course/
- https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
- https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/
- https://intellipaat.com/community/253/role-of-bias-in-neural-networks
- https://towardsdatascience.com/implementing-the-xor-gate-using-backpropagation-in-neural-networks-c1f255b4f20d

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

![Perceptron classification](https://images.slideplayer.com/32/9869800/slides/slide_5.jpg)

# Basic formula for a simple perceptron

### Single input formula
`y = wx + bias`

### Multi-input formula
`y = (w1*x1 + w2*x2 + w3*x3 + wi*xi) + bias`

- Where y is the output, w is the weight applied to the input, x is the input, and bias is a constant that helps activate the neuron.
- This formula creates a straight line to linearly separate the classes
- A single layer perceptron is only able to separate data points with a single line

### Single Layer Perceptron without an Activation Function

- In the image below, $a_i$ indicates the inputs, $w_i$ indicates the weights for each input, $b$ is the bias added to the sum of the products, and $n$ is the number of inputs)
- The image below does not show an activation function

![Single layer perceptron equation](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTd676zk19cW48MOWxUjOSfuwiD-xFrx78T8zF8wgd6U5-UoZdJ)

### Single Layer Perceptron with an Activation Function

- After the weights are multiplied to the inputs, the products are summed and passed to the activation function.

![Simple Layer Perceptron w/out bias](https://www.allaboutcircuits.com/uploads/articles/how-to-train-a-basic-perceptron-neural-network_rk_aac_image1.jpg)

### Bias

- The bias neuron is a special neuron added to each layer in the neural network, which simply stores the value of 1. This makes it possible to move or "translate" the activation function left or right on the graph.
- Without a bias neuron, each neuron takes the input and multiplies it by a weight, with nothing else added to the equation.
- So, for example, it is not possible to input a value of 0 and output 2.
- In many cases, it is necessary to move the entire activation function to the left or right to generate the required output values -- this is made possible by the bias.

### Image of Single Layer Perceptron with Bias

- Constant is the bias
- Step Function is the activation function

![Single layer perceptron](https://static.javatpoint.com/tutorial/pytorch/images/pytorch-perceptron2.jpg)

In [2]:
data_and = {'X1': [0, 0, 1, 1], 'X2': [0, 1, 0, 1], 'output': [0, 0, 0, 1]}
data_or  = {'X1': [0, 0, 1, 1], 'X2': [0, 1, 0, 1], 'output': [0, 1, 1, 1]}
data_xor = {'X1': [0, 0, 1, 1], 'X2': [0, 1, 0, 1], 'output': [0, 1, 1, 0]}

df_and = pd.DataFrame(data_and)
df_or  = pd.DataFrame(data_or)
df_xor = pd.DataFrame(data_xor)

In [3]:
df_and

Unnamed: 0,X1,X2,output
0,0,0,0
1,0,1,0
2,1,0,0
3,1,1,1


In [4]:
df_or

Unnamed: 0,X1,X2,output
0,0,0,0
1,0,1,1
2,1,0,1
3,1,1,1


In [5]:
df_xor

Unnamed: 0,X1,X2,output
0,0,0,0
1,0,1,1
2,1,0,1
3,1,1,0


![AND, OR, XOR Image](https://miro.medium.com/max/1920/1*CyGlr8VjwtQGeNsuTUq3HA.jpeg)

![linear separability](https://jtsulliv.github.io/images/perceptron/linsep_new.png?raw=True)

## AND, OR, XOR Gates

### AND

- Output is True only if all inputs are true, otherwise False
- The AND gate can be separated by a straight line since (1,0), (0,0), and (0,1) make up 1 class, and (1,1) make up the second class

### OR

- Output is True if any input is True, otherwise False if all inputs are False
- The OR gate can be separated by a straight line since (1,0), (1,1), and (0,1) make up 1 class, and (0,0) make up the second class

### XOR

- Output is True only if 1 (and only 1) input is True, otherwise False if both inputs are True or both are False
- XOR is not linearly separable because (0,0) and (1,1) make up one class, and (0,1) and (1,0) make up the second class (seen below)
- There is no single straight line that is able to separate the two classes of the output, hence linearly non separable.

![XOR Example](https://accu.org/content/images/journals/ol109/Lewin/Lewin-7b.png)

### What if the inputs cannot be separated by a straight line?

- The XOR Problem
- From the chart above, we can see that a straight line will not fit the data points.
- A single perceptron is not sufficient to model XOR
- The solution is to expand beyond the single-layer architecture by adding an additional layer of units known as a hidden layer (processing layer). This kind of architecture is called a multilayer perceptron (MLP).
- An MLP is the basic building block of deep learning.

### Non Linearly Separable Data

![non linearly separable data](https://image.slidesharecdn.com/winnowvsperceptron-140225083320-phpapp02/95/winnow-vs-perceptron-7-638.jpg?cb=1393317299)

### Multilayer Perceptron with 1 hidden layer

![Multilayer Perceptron](https://www.researchgate.net/profile/Alex_Capatina/publication/313169214/figure/fig3/AS:669059322486797@1536527581843/Figure-no-5-The-architecture-of-neural-network-input-output-indicators.jpg)

### Multilayer perceptron with 2 hidden layers

![MLP with 2 hidden layers](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTHr6qtOHOKrX4e085ZhM7WuHQUqbHbBf9fFZjlYWwmyWUjzjv1)

# Deep Neural Network

- A neural network with many inputs, 3 or more hidden layers with several units (neurons), and an output layer at the end.
- The number of output units (output neurons) depends on the classes we want to predict.
- Example: If we are predicting if an image is a cat/dog/bird/cow, we would have 4 neurons at the output.

### Fully Connected Deep Neural Network with 3 Hidden Layers

![fully connected deep neural network](https://miro.medium.com/max/432/1*y0pXhfaTGmvfNwaGoHnW5w.jpeg)

# Activation Functions

- We need an Activation Function `f(x)` to enable the model to learn complex data and represent non-linear relationships between the inputs and outputs.
- The activation function is a mathematical "gate" in between the input going into the current neuron and its output going to the next layer.
- Read more about activation functions here: https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

## Common Types of Activation Functions

![common activation functions](https://image.slidesharecdn.com/dlmmdcud1l02deepneuralnetworks-170427160932/95/deep-neural-networks-d1l2-insightdcu-machine-learning-workshop-2017-13-638.jpg?cb=1493309684)

### Sigmoid Activation

- Any input will be scaled to a value between 0-1

### Disadvantages:

- Gradient vanishes (zeroed out) when the input is small
- Gradient explodes when the input is big
- Optimization gets harder because the output isn't zero centered (it's centered at 0.5)

![Sigmoid Function](https://image1.slideserve.com/3099751/sigmoid-function-l.jpg)

### Hyperbolic Tan Activation tanh(x)

- The output is zero centered, so optimization is easier.
- But still have the vanishing gradient problem
- Common activation function for RNNs

![Hyperbolic Tan Activation](https://miro.medium.com/max/3450/1*aytYJ0uqNC1yhBzzAjns4A.png)

### ReLU (Rectified Linear Unit) Activation Function

- `R(z) = max(0, z)` so ReLU outputs the max value between 0 and z (z being the input value)
- If the input is negative --> Output is Zero
- If the input is positive --> Output stays same value

### Disadvantages

- If the gradients die, the ReLU never activates on any input. Therefore, we need to keep the gradients alive.
- The problem occurs due to the Zero output.

![relu activation function](https://miro.medium.com/max/357/1*oePAhrm74RNnNEolprmTaQ.png)

### Leaky ReLU

- `R(x) = max(0.1*x, x)`
- Solves the dying neuron problem, where gradients become zero by multiplying x by a near zero value (such as 0.1 or 0.01), then taking the max value of 0.1*x or x

![leaky relu activation function](https://www.i2tutorials.com/wp-content/uploads/2019/09/Deep-learning-25-i2tutorials.png)

# Which Activation Function should we use?

- *ReLU:* Cnns and hidden layers
- *Leaky ReLU:* when ReLU suffers from dead neurons (output 0)
- *Tanh:* RNNs
- *Sigmoid:* Usually used for gating operations (controlling an output). Also used at the output layer for some tasks such as multi-label classification (classify more than one label at a given time) or binary classification.

# Gradient Descent

![Gradient Descent Map](https://blog.paperspace.com/content/images/2018/05/challenges-1.png) 

![Image of learning rate](https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization_files/ball.png)

### Terminology

- *Gradient Descent:* Can be defined as the change in y (where y is the amount of loss shown on y-axis above) due to the change in x (where x is the weights along the x-axis)
- *Global Minimum (or Global minima):* The minimum point of the error function where modifying the weights no longer reduces loss

# Backpropagation

![Backpropagation image](https://www.researchgate.net/profile/Rozaida_Ghazali/publication/234005707/figure/fig2/AS:667830315917314@1536234563135/The-structure-of-single-hidden-layer-MLP-with-Backpropagation-algorithm.png)

![backpropagation and gradient descent](https://miro.medium.com/max/1836/1*6sDUTAbKX_ICVVAjunCo3g.png)

### Our objective is to minimize the error by changing the weights

- We move in the direction opposite to the derivative (opposite to the slope)
- *Negative Slope:* When we increase w, the loss function is decreasing