Neural Networks

* Objectives:
    * Know the best use cases for neural networks
    * Know the benefits and drawbacks of using a neural network
    * Build a simple neural network for binary classification
    * Train a neural network using backpropagation
    * Understand how neural networks can be used for regression and multi-class classification by using different loss functions and output activations
    * Explain the properties (pros/cons) of different activation functions
    * Explain some methods to avoid overfitting
    * Learn about some more complicated versions of neural networks
    * Use Keras to build neural networks in Python

1) Neural Network Basics
* Background - neural networks were introduced in the 1950's as a model which mimics the brain
    * biological neurons "fire" at a certain voltage threshold
    * an artifical neuron will be modeled by an activation function like **sign**, **sigmoid function**, or **tanh**
    * otherwise, it is bad analogy since we shouldn't be thinking of neural networks as models for the brain
* Why use Neural Networks?
    * (+) works well with high dimensional data (images, text, and audio)
    * (+) can model *arbitrarily* complicated decision functions (complex decision boundaries)
    * (-) not very interpretable
    * (-) slow to train (very computationally expensive)
    * (-) easy to overfit (complex activation that lowers bias, but has high variance)
    * (-) difficult to tune (many parameters/choices when building the neural network architecture)

2) Building Neural Networks Basics

* **Activation functions**
    * Where have we seen an activation function before? (sigmoid function in Logistic Regression!)
        * input: $x$
        * weights: $w$
        * sigmoid function: $\sigma(z)=\frac{1}{1+e^{-z}}$
        * now, classify $x$ as positive if $\sigma(w^Tx)>0.5$
        * think of $\sigma$ as an **activation function** which **activates** if the input is larger than 0
* Creating two layer neural network with sigmoid activation function
![sigmoid_two_layer](sigmoid_two_layer.png)
    * Architecture:
        * **Input layer**: the nodes which hold the inputs: $1,x_1,x_2,\dots,x_n$
        * **Output layer**: the single node that holds the output value
        * **Weights**: the weights: $w_0,w_1,\dots,w_n$ transition between the two layers
    * with the current two layer architecture, $h(x|w) = \sigma(w^Tx)$, only is able to model **linear** decision functions
        * the decision boundary is the set of points where $w^Tx=0$, creating a hyperplane
* Moving onto Neural Network with Multiple Layers (e.g. 4 layers)
![4_layer_nn_tanh_sigma](4_layer_nn_tanh_sigma.png)
    * Architecture:
        * **Input layer (layer 0)**: contains the input value $x$ and a bias term $1$
        * **Two hidden layer (layer 1 and 2)**
        * **Output layer (layer 3)**: contains the output value (or the probability of positive classification)
    * Compute output for: $x=3$
        * Layer 1:
            * The first non-bias node $\rightarrow tanh((0.1)(1)+(0.3)(3))=0.76$
            * The second non-bias node $\rightarrow tanh((0.2)(1)+(0.4)(3))=0.89$
        * Layer 2:
            * The non-bias node $\rightarrow tanh((0.2)(1)+(1)(0.76)+(-3)(0.89))=-0.94$
        * Output:
            * The value of the output layer $\rightarrow \sigma((1)(1)+(2)(-0.94))=0.29$
        * Finally, $h(3)=0.29$ for $x=3$

3) Neural Network Mathematical Notation
* Simplified Neural Network Architecture (At First)
    * Stick to networks for binary classification (a single output node)
    * Output node will use the sigmoid activation function $\sigma$
    * Hidden layers will use tanh activation function
    * $\theta$ will always represent an activation function (e.g. sign, tanh, sigmoid($\sigma$), rectifier)
* Schematic of Artificial Neuron
![artifical_neuron_schematic](artifical_neuron_schematic.png)
    * Layers are given by indices: $0,1,2,\dots,L$
        * Input layer: $0$
        * Output layer: $L$
    * For each layer $l$:
        * $s^{(l)} \rightarrow d^{(l)}$-dimensional input vector
        * $x^{(l)} \rightarrow (d^{(l)}+1)$-dimensional output vector
        * $W^{(l)} \rightarrow (d^{(l-1)}+1)\times d^{(l)}$ matrix of input weights
            * $W_{ij}^{(l)} \rightarrow$ the weight of the edge from the $i$-th node $l-1$ to the $j$-th node in $l$
* Convert Previous 4 Layer Network To Mathematical Notation
![4_layer_nn_tanh_sigma](4_layer_nn_tanh_sigma.png)
    * Steps For **Forward Propagation**:
        1. Multiple Initial Input $x^{(0)}$ by $W^{(1)}$ between layers to yield $s^{(1)}$ through matrix multiplication
        2. Apply activation function (e.g. tanh or sigmoid) on $s^{(1)}$ to yield $x^{(1)}$
        3. Take the next weights $W^{(2)}$ and apply to previous output of Layer $x^{(1)}$ and repeat cycle until reaching the final output layer $x^{(o)}$
    * **Layer 0 $\rightarrow$ 1:**
        * $x^{(0)}=\left[\begin{array}{cc}
            1 \\ 
            3 
            \end{array}\right]$
        * $W^{(1)}=\left[\begin{array}{cc}
            0.1 & 0.2 \\ 
            0.3 & 0.4 
            \end{array}\right]$
        * $s^{(1)}$ is the result of applying the weights on the edges between layer 0 and 1: (Applying Weights)
            * $\left[\begin{array}{cc}
                (0.1)(1)+(0.3)(3) \\ 
                (0.2)(1)+(0.4)(3)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                1.4
                \end{array}\right]$ 
        * $x^{(1)}$ is the output of layer 1 after applying tanh and adding a bias node: (Output of Layer 1 After Bias/Tanh Function)
            * $\left[\begin{array}{cc}
                1 \\ 
                tanh(1) \\
                tanh(1.4)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                0.76 \\
                0.89
                \end{array}\right]$ 
    * **Layer 1 $\rightarrow$ 2:**
        * $W^{(2)}=\left[\begin{array}{cc}
            0.2 \\ 
            1 \\
            -3
            \end{array}\right]$
        * $s^{(2)}=\left[\begin{array}{cc}
            (0.2)(1)+(1)(0.76)+(-3)(0.89)
            \end{array}\right] = \left[\begin{array}{cc}
                -1.71
                \end{array}\right]$
        * $x^{(2)}=\left[\begin{array}{cc}
                1 \\ 
                tanh(-1.71)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                -0.94
                \end{array}\right]$ 
    * **Layer 2 $\rightarrow$ 3:**
        * $W^{(3)}=\left[\begin{array}{cc}
            1 \\ 
            2
            \end{array}\right]$
        * $s^{(3)}=\left[\begin{array}{cc}
            (1)(1)+(2)(-0.94)
            \end{array}\right] = \left[\begin{array}{cc}
                -0.88
                \end{array}\right]$
        * $x^{(3)}=\left[\begin{array}{cc}
                \sigma(-0.88)
                \end{array}\right] = \left[\begin{array}{cc}
                0.29
                \end{array}\right]$ 

4) Forward Propagation and Backpropagation
* **Forward Propagation** - computing the output of a neural network with fixed weights (based on studying the above examples)
    * $x^{(l)}=\left[\begin{array}{cc}
        1 \\ 
        \theta(s^{(l)})
        \end{array}\right]$ (Inputs that have applied activation functions)
    * $s^{(1)}=(W^{(l)})^Tx^{(l-1)}$ (Weights that are applied to the Inputs)
    * Propagation of computations: $x^{(0)}\xrightarrow{W^{(1)}}s^{(1)}\xrightarrow{\theta}x^{(1)}\xrightarrow{W^{(2)}}s^{(2)}\cdots\rightarrow s^{(L)}\xrightarrow{\theta}x^{(L)}=h(x^{(0)})$
    * In terms of the number of nodes $V$ and weights $E$, what is the algorithmic complexity of forward propagation (in Big-O notation)?
* **Backpropagation** - finds the error based on some function via gradient descent and modifies weights (thereby improving the model) based on the predictions made by the Neural Network in Forward Propagation
    * Training data = $\{(x_i,y_i)\}$
    * Need to minimize some error function $E$ on our training set over the weights: $w = (W^{(1)},\dots,W^{(L)})$
        * Example error function, MSE: $E(w)=\frac{1}{N}\sum_{i=1}(h(x_i|w)-y_i)^2$
    * This function can be *extremely* complicated to write algebraically and has no closed form solution for minima
    * Use gradient descnet algorithm to train neural network (called Backpropagation)
        * Update step in gradient descent: $w(t+1)=w(t)-\eta\triangledown E(w(t))$ 
    * Our total error is a sum of the errors, $e_n$, on each input:
        * $E(w)=\frac{1}{N}\sum_{i=1}^n e_i$ where $e_i=(h(x_i|w)-y_i)^2$
        * take derivative with respect to weights: $\frac{\partial E}{\partial W^{(l)}}=\frac{1}{N}\sum{\frac{\partial e_n}{\partial W^{(l)}}}$ (need to review how to take derivatives)
        * can consider one data point at a time and add the results to get the total gradient
    * Backpropagation uses the **chain rule** to compute the partial derivatives of layer $l$ in terms of layer $l+1$
        * the **sensitivity vector** of layer $l$: $\delta^{(l)}=\frac{\partial e}{\partial s^{(l)}}$
        * then, we can compute: $\frac{\partial e}{\partial W^{(l)}} = x^{(l-1)}(\delta^{(l)})^T$
        * for $j$ in $1,\dots,d^{(l)}$: $\delta_j^{(l)}=\theta'(s^{(l)})_j \times [W^{(l+1)}\delta^{(l+1)}]_j$
        * can compute $\delta^{(l)}$ from $\delta^{(l+1)}$
        * must still compute $\delta^{(L)}$ to seed the process
            * depends on the error function and the output activation function
            * in this case: $\delta^{(L)}=2(h(x_i|w)-y_i)h(x_i|w)(1-h(x_i|w))$
        * $W^{(l)}=W^{(l)}-\eta\frac{\partial E}{\partial W^{(l)}}$
* Complete Computation for Forward propagation and Backpropagation example (using the 4 layer NN from above):
    * Forward propagation:
        * Data is $x=2,y=1$
        * $x^{(0)}=\left[\begin{array}{cc}
            1 \\ 
            2 
            \end{array}\right]$; 
            $s^{(1)}=\left[\begin{array}{cc}
            0.1 & 0.3 \\ 
            0.2 & 0.4
            \end{array}\right]
            \left[\begin{array}{cc}
            1 \\ 
            2
            \end{array}\right]=
            \left[\begin{array}{cc}
            0.7 \\ 
            1
            \end{array}\right]$;
            $x^{(1)}=\left[\begin{array}{cc}
            1 \\ 
            0.6 \\
            0.76
            \end{array}\right]$
        * $s^{(2)}=\left[\begin{array}{cc}
            -1.48 
            \end{array}\right]$; 
            $x^{(2)}=\left[\begin{array}{cc}
            1 \\ 
            -0.90 
            \end{array}\right]$
        * $s^{(3)}=\left[\begin{array}{cc}
            -0.8 
            \end{array}\right]$; 
            $x^{(3)}=\left[\begin{array}{cc}
            0.31 
            \end{array}\right]$
    * Backpropagation:
        * $\delta^{(3)}=2(0.31-1)(0.31)(1-0.31)=-0.30$
        * $\delta^{(2)}=(1-0.9^2)(2)(-0.30)=-0.114$
        * $\delta^{(1)}=\left[\begin{array}{cc}
            -0.072 \\
            0.144
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(1)}}=x^{(0)}(\delta^{(1)})^T=\left[\begin{array}{cc}
            -0.072 & 0.144 \\
            -0.144 & 0.288
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(2)}}=x^{(1)}(\delta^{(2)})^T=\left[\begin{array}{cc}
            -0.69 \\
            -0.42 \\
            -0.53
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(3)}}=x^{(2)}(\delta^{(3)})^T=\left[\begin{array}{cc}
            -1.85 \\
            1.67
            \end{array}\right]$
* Another example with Forward and Backpropagation:
![example_nn](https://matthewmazur.files.wordpress.com/2018/03/neural_network-9.png)

5) Stochastic Gradient Descent in Backpropagation
* Backpropagation finds the gradient at each observation, adds them up to find the total gradient:
    * $\triangledown E(w)=\frac{1}{N}\sum_{i=1}\triangledown e_i(w)$
    * $w(t+1)=w(t)-\eta\triangledown E(w(t))$
* Instead, **update weights** at **each** observation (or after a small batch of observations):
    * $w(t+1)=w(t)-\eta\triangledown e_i(w(t))$
    ![sgd_backprop](sgd_backprop.png)

6) Neural Network Parameter Tuning
* **Learning rate**
* **Number of hidden layers**
* **Number of neurons on hidden layers**
* **Initialization of weights**
    * What happens if you set weights to 0 or weights very large?
    * don't set weights to 0, instead set sample weights as normal centered around 0
    * rule of thumb:
        * sample weights from $N(0,\sigma^2_w)$
        * $\sigma^2_w max_i \Vert x_i\Vert^2 << 1$
* **Scaling** - normalize data before fitting data to neural network model (depending on the activation function)
* **Epoch / Batches** - a single sweep through all of the data
    * example: if you have 100,000 observations, and a batch size of 100. Then, each epoch will consist of 1,000 gradient descent update steps 
* **Termination** - error function is generally an **extremely** non-convex function function
    * Lots of local minimia and flat spots
    * Often best to terminate after a set number of iterations
    * Also, can terminate when the gradient is small and the total error is small
* **Momentum** - helps push backpropagation out of local minima (step size)
    * adds a fraction of the previous gradient in the new update step: $w(t+1)=w(t)-\eta\triangledown E(w(t))+m(w(t)-w(t-1))$
    * $m=0.9$ is standard
    * Too high of $m$ risks overshooting minimum
    * Too lows of $m$ risks getting stuck in local minima
* **Activation functions**
    1. **Sigmoid**
        * equation: $\sigma(z)=\frac{1}{1+e^{-z}}$
    2. **Tanh** - a hyperbolic tangent function
    ![tanh_func](tanh_func.png)
        * equation: $tanh(z)=2\sigma(2z)-1$
        * same shape as the sigmoid function
        * output values centered around 0
        * smooth or differentiable (unlike sign)
            * $tanh(z)=2\sigma(2z)-1 \rightarrow tanh'(z)=1-tanh^2(z)$
        * trains faster than sigmoid in most cases
        * (-) requires normalization of the data
    3. **Softmax**
        * equation: $\frac{e^{s_j^{(L)}}}{\sum_{i=1}^K e^{s_i^{(L)}}}$
        * useful activation function for output layer of a multi-class classification neural net
    4. **Rectifer / Hard Max / Rectified Linear Unit (ReLU)** - a function that maps $\{0,x\}$
        * gradient does not vanish as $x$ gets large
        * 0 if $x<0$, which introduces sparsity into the network
        * has faster training
        * (+) doesn't require normalization of data
* **Overfitting in NN**
    * **Regularization** - add a $l2$-regularization term: $\lambda\sum_{l,i,j}(w_{ij}^{(l)})^2$ to the error function
    * **Dropout** - randomly remove nodes from the network each time you run backpropagation

7) Regression and Multi-Class Classification
* Neural Network can also be used for regression and for data with multiple classes
    * For regression, replace the output transformation with $\theta \rightarrow id$ and use MSE
    * For multiple classes:
        * The output layer will have multiple ndoes, one for each class $1,2,\dots,K$
        * The output activation function on the $j$-th component of the output layer is the **softmax function**: $\frac{e^{s_j^{(L)}}}{\sum_{i=1}^K e^{s_i^{(L)}}}$
        * Minimize **cross-entropy**: $\sum_{i=1}-y_i log h(x_i)$

2) Types of Network Neural Architecture
1. **Fully Connected Neural Networks** - architecture where each layer is *fully connected* to next and has no missing edges between nodes
![full_nn](full_nn.png)
    * **Deep Learning Neural Network** - a buzzword referring to a neural network with more than 3 layers
2. **Recurrent Neural Networks (RNN)** - a NN where a hidden layer feeds back into itself
    * This allows the NN to exhibit dynamic temporal behavior
    * RNN's provide "internal" memory for sequence processing
    * **Long Short Term Memory (LSTM)** - a special kind of RNN capable of learning **long-term dependencies**
    * They are very applicable to handwriting/speech recognition
        * e.g. Used by Google Translate application
        * e.g. Used to train AI on human negotiations
![rnn](rnn.png)
3. **Convolutional Neural Networks (CNN)** - architecture isn't fully connected and employs convolution layers
![cnn](cnn.png)
    * used mainly for image classification (state of the art)
    * **Convolution Layers** - each node only "sees" a subset of the previous layer's nodes
        * applies convolutions (type of filter) to each sub-image to "look for" certain patterns or shapes (which are learned)

3) Deep Learning Achievements
1. Text
    * Google Translate with RNN
    * Facebook Human Negotiation AI with RNN
2. Voice
    * Google DeepMind autoregressive full-convolution WaveNet (A generative model for raw audio) with PixelRNN/PixelCNN
    * Google DeepMind Lip reading from television dataset using LSTM + CNN
    * WashU Obama Synchronize Lip with RNN
3. Computer Vision
    * Google Brain enhances Google Maps with OCR (Optical Character Recognition to recognize street signs and store signs using CNN + LSTM
    * Google Deepmind Visual Reasoning on CLEVR dataset with 95.5% accuracy using pre-train LSTM
    * Uizard pix2code (GUI interpreted by NN into code) with 77% accuracy
    * Google SketchRNN trained on detailed vector representations of drawings using Sequence-to-Sequence Variational Autoencoder (VAE) RNN
    * **Generative Adversial Networks (GANs)** - competition of two networks (generator and discriminator)
        * e.g. First network creates a picture, and the second one tries to understand whether the picture is real or generated
        * Orange Labs France Face Aging with Conditional GANs using IMDB dataset
        * Google Improves Professional Photos using GANs with Google Street View dataset
        * MichU Synthesization of an image from a text description using GANs
        * Berkeley AI Research (BAIR) Image-to-Image Translation with Conditional GANs (e.g. creating a map using a satellite image, or realistic texture of the objects using their sketch)
        * Christopher Hesse uses UNet and PatchGAN to make nightmare cat demo
        * Authors of Pix2Pix develops CycleGAN for transfer between different domains of images
        * Using Adversial Autoencoder (AAE) to find new drugs to fight cancer
        * Improvements in Adversial Attacks (tricking NN by injecting noise from recognition) using Fast Gradient Sign Method (FGSM) - important in face recognition/self-driving algorithm from being attacked
4. **Reinforcement Learning (RL)** - learn the successful behavior of the agent in an environment that gives a reward through experience (e.g. people learning throughout their lives) - used actively in games (e.g. AlphaGO) robots, and system management (e.g. traffic)
    * Google DeepMind Deep Q-network (DQN) plays arcade games better than humans (currently being taught to play complex games like Doom)
        * Introduction of additional losses (auxiliary tasks), such as the prediction of a frame change (pixel control) so that the agent better understands the consequences of the actions, significantly speeds up learning
    * OpenAI Learning Robots using RL (one-shot learning) by actively studying an agent's training by humans in a virtual environment
        * A person shows in VR how to perform a certain task, and one demonstration is enouhg for the algorithm to learn it and then reproduce it in real conditions
    * OpenAI/Google DeepMind Learning on Human Preferences using RL
        * An agent has a task, and the algorithm provides two possible solutions for the human and indicates which one is better
    * Google Deepmind Movement in Complex Environments
        * Teaching a robot complex behavior (walk, jump, etc.) using agents (body emulators) to perform complex actions by constructing a complex environment with obstacles and with a simple reward for progress in movement
5. Other
    * Google Deepmind Cools Data Center (reducing energy costs) based on info on thousands of sensors predicts Power Usage Effectiveness (PUE) using NN ensemble
    * Google Brain One Model For All Tasks (currently trained models are poorly transferred from task to task) (tensor2tensor)
        * Train a model that performs eight tasks from different domains (text, speech, images)
    * Facebook Learn Imagenet in one hour (using Tesla P100 - a cluster of 256 GPUs) using Gloo and Caffe2 for distributed learning
6. News
    * Self-driving Cars
        * Intel MobilEye
        * Google Waymo
    * Healthcare
        * Google Deepmind in Healthcare for medical diagnosis
    * Investments
        * China invests \$150 Billion in AI
        * Baudi Research employs 1,300 people
        * Alibaba runs 100 billion samples with a trillion parameters with ease