# Neural Networks

### Outline: 

#### Theory topics: 
* Perceptron Model to Neural Networks
* Activation Functions
* Cost Functions
* Feed Forward Networks
* Backpropagation

#### Coding topics:
* Tensorflow 2.0 with Keras Syntax
* Neural Networks with Keras 
    * Feature Engineering
    * Classification 
    * Regression
* Exercises for Keras ANN
* Tensorboard Visualizations

### Perceptron Model

`Inputs (x1, x2, ...xn)` --> `Perceptron (f(x))` --> `Output (y)`

For a simple summation function:

`f(x) = x1 + x2 = y`

-------

We need the perceptron to adjust some parameter in order to learn. So we add in weights.

`Inputs (x1, x2, ...xn)` -- parameters `(w1,w2, ...wn)`--> `Perceptron (f(x))` --> `Output (y)`

For a simple summation function:

`f(x) = w1x1 + w2x2 + ... wnxn = y`

----------

What if the input is zero. Then we multiply the weight by 0 and the perceptron will not learn. So we add in a bias term b.

`Inputs (x1, x2, ...xn)` -- parameters`(w1,w2, ...wn)` `(b)`--> `Perceptron (f(x))` --> `Output (y)`

For a simple summation function:

`f(x) = (w1x1 + b) + (w2x2 + b) + ... + (wnxn + b) = y`

----------

A single perceptron isn't enough to learn complicated systems. We can expand on the idea of a single perceptron to create a multilayer perceptron (neural network).

### Multilayer Perceptron Model

* To build a neural network. We can <b>connect layers of perceptrons</b> into a mutlilayer perceptron model.


* The outputs of one perceptron layer are then fed as inputs into the next perceptron layer.


* This allows the network as a whole to learn about interactions and relationships between features. 


* The first layer is the <b>input layer</b>. This layer directly receives the data.


* The last layer is the <b>output layer</b>. This layer outputs the final predictions.


* All the layers in between are called <b>hidden layers</b>. Hidden layers are difficult to intepret due to their high interconnectivity and distance from the known input and output values. 


* A neural network becomes a <b>deep neural network</b> when it contains two or more hidden layers.
    * Network <b>width</b>: How many neurons in a single hidden layer.
    * Network <b>depth</b>: How many layers of neurons.
    
    
* A neural network can be used to approximate any convex continuous function. 

--------

For real tasks, we'll want to set constraints on our output values to get useful predictions.
   * For classification, it would be very useful to have all the outputs fall between 0 and 1.
   * The output value can then represent the probability assignments of each class.

Let's explore how to use <b>activation functions</b> to set boundaries on the neuron's output values. 
   

### Activation Functions

In general, for each input we have: 
    ` w*x + b`
* `w` tells us how much weight to assign to the incoming input. 
* `b` is an offset value, making it so that `w*x` needs to overcome a certain threshold before having an effect.
* Next we want to set an overall boundary to `z = w*x + b` and then pass it into an activation function `f(z)`.

#### Step Function

The most simple networks rely on a basic step fucntion that outputs 0 or 1 based on the value of z. However, this function has a very sharp cutoff between 0 and 1 and small changes aren't noticed.

#### Sigmoid Function (Logistic Function)

The sigmoid function outputs values between 0 and 1 based on the values of z. 
   * It is more sensitive to small changes and has a smoother transition than the step function. 
   * Its output can also be used to represent the probability of being 0 or 1.

#### Hyperbolic Tangent Function (tanh)

tanh outputs values between -1 and 1 instead of 0 and 1.

#### Rectified Linear Unit Function (ReLu)

relu uses the simple function max(0,z). So for z < 0 it outputs 0 and for z >= 0 it outputs z.
   * ReLu has very good performance and deals with the issue of <b>vanishing gradient</b>.
   * Recently, ReLu variants (Leaky ReLus, ELUs) solve other issues.

### Mutliclass Classification Considerations

There are 2 main types of multiclass classification problems:
* <b>Non-exclusive classes:</b> A data point can be in multiple classes. 
* <b>Mutually-exclusive classes:</b> A data point can only be in one class. 

-----

#### Organizing mutliple classes: 
* The easiest way ot organize multiple classes is to have <b>one output node per class</b>.
* Then use <b>one-hot encoding</b> to transform data to 0 (not in class) or 1 (in class). 

Now that we have our data correctly organized all we need to do is choose the correct activation function. 

-----

#### Choosing activation function: 
* Non-exclusive classification --> sigmoid function 
    * Each neuron will output a value between 0 and 1 that will represent the probability of the data point being in that class.
    * Allows each neuron to output independently of the others.


* Mutually-exclusive classification --> softmax function
    * Calculates the probabilities distribution of each target class over all the K possible target classes. 
    * Since all the probabilities sum to 1. softmax chooses the target class with the highest probability. 
    
-------
Recap to this point:
- Neural networks take in inputs, multiply them by weights, add biases to them and pass them into an activation function which at the end of all layers leads to some output. 
- This output a is the model's estimation. After the model creates this prediction, how do we evaluate it?

### Cost Functions

* During model training, the cost function C takes the predicted outputs and then <b>compares them to the real target values</b>.
    * The cost function must be an average so it can output a single value. 
    * We can keep track of the loss/cost over training to monitor network performance. 
    
    
   
* A common cost function is the <b>quadratic cost function</b>.
    * We simply calculate the difference between the real value y and the predicted value ÿ, square the difference, take the sum and divide by 1/2n to average.
    * Squaring allows to keep everything positive and punish large errors. 
  
  
* For classification, we often use the <b>cross-entropy cost function</b>. 
    * The cross entropy model assumes that you have a probability distribution for each class. 


* The network's goal is to figure out which weights w <b>minimize C</b>. We can use a stochastic process to solve for the set of w that leads to the minimal C: Gradient Descent.

### Gradient Descent

* Gradient Descent:
    1. <b> Start at a random point</b> based on the initial weights w. 
    2. Calculate the <b>slope</b> at that point. 
    3. <b>Move in the downward direction</b> of the slope. 
    4. Repeat until you reach minimum.
    
    
* We can change the <b>step size (learning rate)</b> to find the minimum: 
    * Smaller step sizes take longer to find the minimum.
    * Larger step sizes are faster but risk overshooting the minimum and having trouble converging.
    
* We can use <b>adaptive gradient descent</b> to converge better: 
    * The learning rate doesn't have to be constant.
        * Start with larger learning rate and decrease as the slope gets closer to 0.
        * 'adam' is a great solver method for stochastic optimization.
        
------

After evaluation with the cost function how do we update the network's weights and biases?

### Backpropagation 

The main idea is that we can use the gradient to go back through the network and ajdust our weights and biases to 
minimize the output of the error vector on the last output layer. 


For a network with L layers where L is the output layer. 
   * The notation becomes `L-n` <-- ... <-- `L-1` <-- `L`
   * Focusing on the last layer, we have: `z^L = w^L * a^(L-1) + b^L` and `a^L = f(z^L)` and `C = (y - a^L)^2`.
   * To understand how sensitive the cost function is to changes in w, we take the partial derivative of `C` with respect to `w^L`.
       * Applying the chain rule we get a partial derivative formula with a, z, and C terms with respect to w.
   * To understand how sensitive the cost function is to changes in b, we take the partial derivative of `C` with respect to `b^L`.
       * Applying the chain rule we get a partial derivative formula with a, z, and C terms with respect to b.
   * For networks with multiple neurons per layer we put the neurons into a matrix and use the <b>Hadamard product</b> to do elementwise multiplication