# Neural Network Overview
Nema Sobhani / ML Team  
2020.12.04

## Example - Drug Dosage Efficacy

*Working through example from [StatQuest](https://www.youtube.com/watch?v=CqOfi41LfDw) NN video series*

### Setup

***We want to predict whether a drug's dosage will be effective, given the following data:***

<img src="images/capture.png" width="300" height="300">

There is a dinstinct **non-linearity** to the curve our model should produce. So we will want a model that can inject bits of non-linearity, that sum to create this final curve.  

### Terms

1. **Input Neurons**
    - Simply the input values
    - This may be as simple as a single scalar (this example), all the way up to a high-rank tensor
    
    
2. **Hidden Layer**
    - Next layer of neurons
    - These are fully connected to the previous and next layer
    - Characterized by an **activation function** which will generate some portion of our final, non-linear curve
    
    
3. **Connections between neurons/nodes (_Weights and Biases_)**
    - **weights** : which will stretch/compress and invert data from the input space
    - **biases** : shift/offset the input data (think intercept)
   
   
4. **Output Neurons**
    - Map to our output classes
    - May use an activation function that that varies from the intermediate activation functions
        - In order to map output values to probabilities between [0-1]
    - Where **loss function** is calculated, kicking off **back propogation**


<img src="images/capture02.png" width="750">

Assume fully trained model:
<img src="images/capture01.png" width="750">

<img src="images/capture03.png" width="750">

<img src="images/capture04.png" width="750">

***How do you know what your weights and biases should be?*** ðŸ‘‰ **BACK PROPOGATION**

**QUICK NOTES:**

- Weights are typically randomly selected from a standard normal, and biases are initialized to 0


- The optimal number of neurons per hidden layer can follow a rule of thumb or be derived from experimentation


- Each intermediate neuron's activation function *paired* with the stretching/compressing/shifting from the weights and biases of subsequent connections creates infinite possibilities for creating non-linear fragments which comprise your final model in its dimension-space. ([DESMOS](https://www.desmos.com/calculator/cnmsavileq))

### Back Propogation

**High Level:**

- Once we reach the end of our network, we calculate error with a **loss function** and get the resulting error, given a parameter's value. 
    - *But what are our parameters and how many do we have?*


- Every **weight** and **bias** is a parameter! In the example above, we have 7 parameters. Think of this as modeling error in 7-dimensions!


- Tools at our disposal:
    - **Loss functions** (wow cool)
    - **Chain rule of Calculus** to calculate derivatives with respect to each component's value and error
    - **Gradient Descent** which will find a local minimum of error by calculating error derivatives against the paramater value

**Walkthrough**:

<img src="images/capture05.png" width="750">

1. After feed-forward, we now calculate our error wrt the label


2. Our error is likely to be high, due to random parameter assignment


3. To reduce error, we go backwards and try to minimize error one parameter at a time


- Our last bias is *composed* of many other parameters, meaning that it is a *function* of them.
- In order to calculate the gradient of the derivative of the error wrt this parameter, we must write out a ridiculous chain rule composition

- We recursively traverse backward towards the start and unravel once the partial derivatives are calculated. Let's build up to where we are:  
$f =$ activation function

#### Top Node

$W_1 x$

$W_1 x + b_1$

$f(W_1 x + b_1)$

$W_3 (f(W_1 x + b_1))$

___

#### Bottom Node

$W_2 x$

$W_2 x + b_2$

$f(W_2 x + b_2)$

$W_4 (f(W_2 x + b_2))$

___

#### Full Composition

$(W_3 (f(W_1 x + b_1) + W_4 (f(W_2 x + b_2))) + \boldsymbol{b_3} \ \ $ ðŸ‘ˆ you are here

If you want to use gradient descent to determine where the **direction of steepest drop is for the error with regard to each parameter**, it is dependent on knowing all of the derivatives with respect to the downstream parameters.

$$\frac{d ERR}{d \boldsymbol{b_3}} = \frac{d ERR}{d \text{Predicted}} \cdot \frac{d \text{Predicted}}{d\boldsymbol{b_3}}$$

$$\frac{d ERR}{d \boldsymbol{W_1}} = \frac{d ERR}{d \text{Predicted}} \cdot \frac{d \text{Predicted}}{d (f(W_1 x + b_1))} \cdot \frac{d (f(W_1 x + b_1))}{d (W_1 x + b_1)} \cdot \frac{d (W_1 x + b_1)}{d \boldsymbol{W_1}}$$

**Long story short:**

Every parameter will adjust, proportional to their effect on reducing error, and the process will repeat until stable (ideally, prior to overfitting).

### Next Steps

- Comfortability with gradient descent (learning rate / step size / etc)
- Chain rule of calculus (understanding partial derivatives in contribution to error minimization)
- Extrapolate the math and concepts to vector inputs with matrix operations
- Tensors - https://www.tensorflow.org/guide/tensor
- Batch vs Epoch - https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
