# Sect 40: Neural Networks - Intro to Deep Learning


- 05/21/20
- online-ds-pt-100719

## Questions/ Comments


- 

## Learning Objectives

- Start By Discussing Biological Neural Networks (powerpoint)
    - `Repo Folder > references >bio_neural_networks.pptx`
- Connect back to introduction from Learn
- Discuss the steps involved in training an ANN
    - Forward propagation
    - Back
- Demonstrate / play with Neural Network with Tensorflow Playground
- Activity: Finding Trump with Keras (if there's time)

# Artificial Neural Networks 

    
- **The purpose of a neural network is to model $\hat y \approx y$ by minimizing loss/cost functions using gradient descent.**

- Neural networks are very good with unstructured data. (images, audio)

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-introduction-to-neural-networks-online-ds-ft-100719/master/images/new_first_network_num.png" width=50%%>

- **Networks are comprised of sequential layers of neurons/nodes.**
    - Each neuron applies a **linear transformation** and an **activation function** and outputs its results to all neurons in the next layer.
    - Minimizing Loss functions by adjusting parameters (weights and bias) of each connection using gradient descent (forward and back propagation).

- **Activation functions** control the output of a neuron.($\hat y =f_{activation}(x)$ )
    - Most basic activation function is sigmoid functin ($\hat y =\sigma(x)$)
    - Choice of activation function controls the size/range of the output.
- **Linear transformations** ( $z = w^T x + b$ ) are used control the output of the activation function .
    - where $w^T $ is the weight(/coefficient), $x$ is the input, and  $b$ is a bias. 
        - weights: 
        - bias:
        
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-02-introduction-to-neural-networks-online-ds-ft-021119/master/figures/log_reg.png">



- **Loss functions** ($\mathcal{L}(\hat y, y) $)  measure inconsistency between predicted ($\hat y$) and actual $y$
    - will be optimized using gradient descent
    - defined over 1 traning sample
- **Cost functions** takes the average loss over all of the samples.
    - $J(w,b) = \displaystyle\frac{1}{l}\displaystyle\sum^l_{i=1}\mathcal{L}(\hat y^{(i)}, y^{(i)})$
    - where $l$ is the number of samples


- **Forward propagation** is the calculating  loss and cost functions.
- **Back propagation** involves using gradient descent to update the values for  $w$ and $b$.
    - $w := w- \alpha\displaystyle \frac{dJ(w)}{dw}$ <br><br>
    - $b := b- \alpha\displaystyle \frac{dJ(b)}{db}$

        - where $ \displaystyle \frac{dJ(w)}{dw}$ and $\displaystyle \frac{dJ(b)}{db}$ represent the *slope* of the function $J$ with respect to $w$ and $b$ respectively
        - $\alpha$ denote the *learning rate*. 
        
<img src="https://raw.githubusercontent.com/jirvingphd/fsds_100719_cohort_notes/master/images/neural_network_steps.png">

### A Note On Shapes

- Inputs:
    - $n$: Number of inputs (columns) in the feature vector 
    - $l$: Number of items (rows) in the training set 
    - $m$: Number of items (rows) in the test set
    
- Input X:
    - Will have shape $n$ x $l$ (number of features x number of training data points/rows)

## Activation Functions (will call $f_a$ here)

- **sigmoid:**<br><img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-04-deeper-neural-networks-online-ds-ft-021119/master/index_files/index_33_1.png" width=200>
    - $ f_a=\dfrac{1}{1+ \exp(-z)}$
    - outputs 0 to +1
    
- **tanh (hyperbolic tan):**<br><img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-04-deeper-neural-networks-online-ds-ft-021119/master/index_files/index_36_1.png" width=200> <br>(Note:title is incorrect)
    - $f_a = =\dfrac{\exp(z)- \exp(-z)}{\exp(z)+ \exp(-z)}$
    - outputs -1 to +1
    - Generally works well in intermediate layers
    - one of most popular functions
    
- **arctan**<br><img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-04-deeper-neural-networks-online-ds-ft-021119/master/index_files/index_40_1.png" width=200>
    -  similar qualities as tanh, but slope is more gentle than tanh
    - outputs ~ 1.6 to 1.6
    
-  **Rectified Linear Unit (relu):**<br><img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-04-deeper-neural-networks-online-ds-ft-021119/master/index_files/index_43_1.png" width=200>
    - most popular activation function
    - Activation is exactly 0 when Z <0
    - Makes taking directives slightly cumbersome
    - $f_a=\max(0,z)$
- **leaky_relu:**<br><img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-04-deeper-neural-networks-online-ds-ft-021119/master/index_files/index_46_1.png" width=200>
    -  altered version of relu where the activatiom is slightly negative when $z<0$
    - $f_a=\max(0.001*z,z)$


## Activity: Finding Trump with Keras

- `Repo Folder > labs_from_class > sect_40_neural_networks`

# APPENDIX

### Using the chain rule for updating parameters with sigmoid activation function example:
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-40-02-introduction-to-neural-networks-online-ds-ft-021119/master/figures/log_reg_deriv.png" >
- $\displaystyle \frac{dJ(w,b)}{dw_i} = \displaystyle\frac{1}{l}\displaystyle\sum^l_{i=1} \frac{d\mathcal{L}(\hat y^{(i)}, y^{(i)})}{dw_i}$
 
 
- For each training sample $1,...,l$ you'll need to compute:

    - $ z^{(i)} = w^T x^ {(i)} +b $

    - $\hat y^{(i)} = \sigma (z^{(i)})$

    - $dz^{(i)} = \hat y^{(i)}- y^{(i)}$

- Then, you'll need to make update:

    - $J_{+1} = - [y^{(i)} \log (\hat y^{(i)}) + (1-y^{(i)}) \log(1-\hat y^{(i)})$ (for the sigmoid function)

    - $dw_{1, +1}^{(i)} = x_1^{(i)} * dz^{(i)}$

    - $dw_{2, +1}^{(i)} = x_2^{(i)} * dz^{(i)}$

    - $db_{+1}^{(i)} =  dz^{(i)}$

    - $\dfrac{J}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{db}{m}$

- After that, update: 

    $w_1 := w_1 - \alpha dw_1$

    $w_2 := w_2 - \alpha dw_2$

    $b := b - \alpha db$

    repeat until convergence!
    
    

In [3]:
# import numpy as np
# import matplotlib.pyplot as plt
# %matplotlib inline
# # q
# def sigmoid(x, derivative=False):
#     f = 1 / (1 + np.exp(-x))
#     if (derivative == True):
#         return f * (1 - f)
#     return f

# def tanh(x, derivative=False):
#     f = np.tanh(x)
#     if (derivative == True):
#         return (1 - (f ** 2))
#     return np.tanh(x)

# def relu(x, derivative=False):
#     f = np.zeros(len(x))
#     if (derivative == True):
#         for i in range(0, len(x)):
#             if x[i] > 0:
#                 f[i] = 1  
#             else:
#                 f[i] = 0
#         return f
#     for i in range(0, len(x)):
#         if x[i] > 0:
#             f[i] = x[i]  
#         else:
#             f[i] = 0
#     return f

# def leaky_relu(x, leakage = 0.05, derivative=False):
#     f = np.zeros(len(x))
#     if (derivative == True):
#         for i in range(0, len(x)):
#             if x[i] > 0:
#                 f[i] = 1  
#             else:
#                 f[i] = leakage
#         return f
#     for i in range(0, len(x)):
#         if x[i] > 0:
#             f[i] = x[i]  
#         else:
#             f[i] = x[i]* leakage
#     return f

# def arctan(x, derivative=False):
#     if (derivative == True):
#         return 1/(1+np.square(x))
#     return np.arctan(x)



# def plot_activation(fn):
#     z = np.arange(-10, 10, 0.2)
#     y = fn(z)
#     dy = fn(z, derivative=True)
#     fig,ax=plt.subplots(figsize=(6,4))
#     ax.set_title(f'{fn.__name__}')
#     ax.set(xlabel='Input',ylabel='Output')
#     ax.axhline(color='gray', linewidth=1,)
#     ax.axvline(color='gray', linewidth=1,)
#     ax.plot(z, y, 'r', label='original (y)')
#     ax.plot(z, dy, 'b', label='derivative (dy)')
#     ax.legend();
#     plt.show()
# ## Plot activation functions
# act_funcs = [sigmoid,tanh,arctan,relu,leaky_relu]
# [plot_activation(fn) for fn in act_funcs]
  