Neural Networks

* Objectives:
    * Know the best use cases for neural networks
    * Know the benefits and drawbacks of using a neural network
    * Build a simple neural network for binary classification
    * Train a neural network using backpropagation
    * Understand how neural networks can be used for regression and multi-class classification by using different loss functions and output activations
    * Explain the properties (pros/cons) of different activation functions
    * Explain some methods to avoid overfitting
    * Learn about some more complicated versions of neural networks
    * Use Keras to build neural networks in Python

1) Neural Network Basics
* Background - neural networks were introduced in the 1950's as a model which mimics the brain
    * biological neurons "fire" at a certain voltage threshold
    * an artifical neuron will be modeled by an activation function like **sign**, **sigmoid function**, or **tanh**
    * otherwise, it is bad analogy since we shouldn't be thinking of neural networks as models for the brain
* Why use Neural Networks?
    * (+) works well with high dimensional data (images, text, and audio)
    * (+) can model *arbitrarily* complicated decision functions (complex decision boundaries)
    * (-) not very interpretable
    * (-) slow to train (very computationally expensive)
    * (-) easy to overfit (complex activation that lowers bias, but has high variance)
    * (-) difficult to tune (many parameters/choices when building the neural network architecture)
* What makes Neural Network different from other machine learning techniques?
    * (+) Deep learning makes problem-solving much easier because it completely **automates feature engineering** (unlike other machine learning workflows)
        * Deep learning completely automates the feature engineering step by learning all features in one pass rather than having to engineer them yourself
    * Previous machine learning techniques (shallow learning) only involved transforming the input data into one or two successive representation spaces (e.g. SVM).
    * Can shallow machine learning methods be applied repeatedly to emulate effects of deep learning?
        * (-) In practice, there are fast-diminishing returns to successive applications of shallow-learning methods, because **optimal first representation layer in a three-layer model isn't the optimal first layer in a one-layer or two-layer model**
        * (+) What is transformative about deep learning is that it allows a model to learn all layers of representation **jointly**, at the same time, rather than in succession (**greedily**).
        * (+) With **joint feature learning**, whenever the model adjusts one of its internal features, all other features that depend on it automatically adapt to change, without requiring human intervention
* Why is deep learning popular now?
    * Hardware advances
        * NVIDIA TITAN X costs only $1000 at the end of 2015 and can deliver peark of 6.6 TFLOPS in a single precision (only takes a couple of days to train an ImageNet model)
        * In 2016, Google is developing their own **tensor processing unit (TPU)**: a new chip design developed from the ground up to run deep neural networks (reportedly 10x faster and more energy efficient than top of line GPUs)
    * Algorithm advances - key issue is **gradient propagation** through deep stacks of layers where the feedback signal used to train neural networks would fade away as the number of layers increased
        * Better **activation functions** for neural layers
        * Better **weight-initialization schemes**, starting with layer-wise pretraining, which was quickly abandoned
        * Better **optimization schemes**, such as RMSProp and Adam
* Examples For Branches of Machine Learning in Neural Networks
    * Supervised learning
        * Examples: optical character recognition, speech recognition, image classification, and language translation
        * **Sequence generation** - given a picture, predict a caption describing it. Sequence generation can sometimes be reformulated as a series of classification problems (such as repeatedly predicting a word or token in a sequence)
        * **Syntax tree prediction** - given a sentence, predict its decomposition into a syntax tree
        * **Object detection** - given a picture, draw a bounding box around certain objects inside the picture. This can also be expressed as a classification problem (given many candidate bounding boxes, classify the contents of each one) or as a joint classification and regression problem, where the bounding-box coordinates are predicted via vector regression
        * **Image segmentation** - given a picture, draw a pixel-level mask on a specific object
    * Unsupervised learning
        * Purpose: data visualization, data compression, data denoising, or understanding correlations present in data
        * Often a necessary step in better understanding a dataset before attempting to solve a supervised-learning problem (e.g. dimensionality reduction and clustering)
    * **Self-supervised learning** - supervised learning without human-annotated labels or supervised learning without any humans in the loop
        * there are still labels involved, but they're generated from the input data, typically using a heuristic algorithm (e.g. autoencoders)
        * **Autoencoders** - generated targets are the input, unmodified
        * **Temporally Supervised Learning** - supervision comes from the future input data
            * e.g. predicting the nextframe in a video, given past frames
            * e.g. predicting next word in a text, given previous words
        * Self-supervised learning can be reinterpreted as either supervised or unsupervised learning (depending on whether you pay attention to the learning mechanism or to the context of its application)
    * **Reinforcement learning** - an **agent** receives information about its environment and learns to choose actions that will maximize some reward
        * e.g. neural network "looks" at video-game screen and outputs game actions in order to maximize its score can be trained
        * real-world applications: self-driving cars, robotics, resource management, education, etc.

1.1) Data Representation For Neural Networks
* Tensor Basics:
    * Data stored in multidimensional Numpy arrays called **tensors**
    * All current machine learning systems use **tensors** as their basic data structure
    * At its core, a **tensor** is a container for data (almost always numerical data)
* **Scalars (0D tensors)**
    * A tensor that contains only one number is called a **scalar** (or scalar tensor, or 0-dimensional tensor, or 0D tensor). A scalar tensor has 0 axes. The number of axes of a tensor is also called its **rank** (a tensor of rank 0)
        ```python
        >>> x = np.array(12)
        >>> x.ndim
        0
        ```
* **Vectors (1D tensors)**
    * An array of numbers is called a **vector**, or 1D tensor. A 1D tensor is said to have exactly one axis. Numpy vector below:
        ```python
        >>> x = np.array([12, 3, 6, 14, 5])
        >>> x.ndim
        1
        ```
    * This vector has five entries and so is called **5-dimensional vector**. Don't confuse a 5D vector with a 5D tensor!
* **Matrices (2D tensors)**
    * An array of vectors is a **matrix**, or 2D tensor. A matrix has two axes (often referred to rows and columns). Numpy matrix below:
        ```python
        >>> x = np.array([[5, 78, 2, 34, 0],
                        [6, 79, 3, 35, 1],
                        [7, 80, 4, 36, 2]])
        >>> x.ndim
        2
        ```
* **3D tensors and higher-dimensional tensors**
    * If you pack such matrices in a new array, you obtain 3D tensor, which you can visually interpret as a cube of numbers. Numpy 3D tensor below:
        ```python
        >>> x = np.array([[[5, 78, 2, 34, 0],
                         [6, 79, 3, 35, 1],
                         [7, 80, 4, 36, 2]],
                        [[5, 78, 2, 34, 0],
                         [6, 79, 3, 35, 1],
                         [7, 80, 4, 36, 2]],
                        [[5, 78, 2, 34, 0],
                         [6, 79, 3, 35, 1],
                         [7, 80, 4, 36, 2]]])
        >>> x.ndim 
        3
        ```
    * By packing 3D tensors in an array, you can create a 4D tensor, and so on. In deep learning, you'll generally manipulate tensors that are 0D to 4D, although you may go up to 5D if you process video data
* Real-world examples of data tensors:
    * **Vector data** - 2D tensors of shape `(samples, features)`
        * e.g. 100,000 people with age, zipcode, and income $\rightarrow$ `(10000, 3)`
        * 2D tensors is often processed by **densely connected** layers (also called **fully connected or dense** layers)
    * **Timeseries data or sequence data** - 3D tensors of shape `(samples, timesteps, features)`
        * e.g. stock prices with current, highest past minute, and lowest past minute in entire day of trading (390 minutes) with total of 250 days $\rightarrow$ `(250, 390, 3)`
        * 3D tensors is typically processed by **recurrent** layers such as an **LSTM** layer
    ![timeseries_data](timeseries_data.png)
    * **Images** - 4D tensors of shape `(samples, height, width, channels)` or `(samples, channels, height, width)`
        * e.g. batch of 128 color images of 256x256 pixels $\rightarrow$ `(128, 256, 256, 3)`
        * 4D tensors is usually processed by 2D convolution layers (`Conv2D`)
    ![image_data](image_data.png)
        * There are two conventions for shapes of images tensors:
            * the **channels-last** convention used by Tensorflow
            * the **channels-first** convention used by Theano
    * **Video** - 5D tensors of shape `(samples, frames, height, width, channels)` or `(samples, frames, channels, height, width)`
        * e.g. a 60-second, 144x256 YouTube video clip sampled at 4 frames per second (for a total of 240 frames) $\rightarrow$ `(4, 240, 144, 256, 3)`

1.2) Tensor Operations (Gears of Neural Networks)
* Much as any computer program can be ultimately reduced to a small set of binary operations on **binary inputs** (AND, OR, NOR, and so on). All transformations learned by deep neural networks can be reduced to a handful of **tensor operations** applied to tensors of numeric data (e.g. add tensors, multiply tensors, etc.)
* Example: A layer that takes input a 2D tensor and returns another 2D tensor (a new representation for the input tensor)
    * `output = relu(dot(W, input) + b)`
        * `W` is a 2D tensor
        * `b` is a vector
    * There are three tensor operations:
        * a dot product (`dot`) between the input tensor and the `W` tensor
        * an addition (`+`) between the resulting 2D tensor and a vector `b`
        * a `relu` operation - `relu(x)` is `max(x, 0)`
            ```python
            def naive_relu(x):
                assert len(x.shape) == 2
                
                x = x.copy()
                for i in range(x.shape[0])::
                    for j in range(x.shape[1]):
                        x[i, j] = max(x[i, j], 0)
                return x
            ```

1.3) Data Preprocessing for Neural Networks
* **Vectorization** - all inputs and targets in a neural network must be tensors of floating-point data
    * **Data vectorization** - convert data that needs to be processed (e.g. sound, images, text) into tensors 
    * e.g. using one-hot encoding, turn text representations as list of integers into tensor of `float32` data
* **Value Normalization** - converting feature values to be within the same range with values between 0 and 1
    * In general, it isn't safe to feed into a neural network data that takes relatively large values (e.g. multi-digit integers which are much larger than the initial values taken by the weights of a network) or data is heterogeneous (e.g. data where one feature is in the range 0-1 and another is in the range 100-200)
        * Doing so can trigger large gradient updates that will prevent the network from converging
    * e.g. in digit classification, started with image data encoded as integers in 0-255 range, encoding grayscale values, which we cast as `float32` then divide by 255 to yield 0-1 value range
    * Making learning easier for network:
        * **Take small values** - typically most values should be in range 0-1
        * **Be homogenous** - all features should take values roughly in the same range
    * Additional stricter normalization practices:
        * Normalize each feature independently to have mean of 0
        * Normalize each feature independently to have a standard deviation of 1
* **Handling Missing Values** - labelling missing values as 0
    * In general, it's safe to input missing values as 0, with the condition that 0 isn't already a meaningful value
    * If you are expecting missing values in the test data, but the network was trained on data without any missing values, the network won't have learned to ignore missing values
        * solution: artifically generate training samples with missing entries - copy some training samples several times, drop some of the features that you expect are likely to be missing from the test data
    * e.g. house-price first feature was per capita crime rate that had missing samples in training or test data

2) Building Neural Networks Basics
![neural_network_relationship](neural_network_relationship.png)
* Summary: The neural network, composed of layers that are chained together, maps the input data to predictions. The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network's predictions match what was expected. The optimizer uses this loss value to update the network's weights.
* Neural Network Training Loop Steps: (**Mini-batch SGD**)
    1. Draw a batch of training samples $x$ and corresponding targets $y$
    2. Run the network on $x$ (**forward pass**) to obtain predictions `y_pred`
    3. Compute the loss of the network on the batch, a measure of the mismatch between `y_pred` and $y$
    4. Compute the gradient of the loss with regard to the network's parameters (**backward pass**)
    5. Move parameters a little in the opposite direction from gradient (e.g. `W -= step * gradient`), thus reducing the loss on the batch a bit
    ![sgd_1para_1samp](sgd_1para_1samp.png)
* **Activation functions** - allows for non-linear transformations
    * Keras example of activation functions in `Dense` layers
        ```python
        from keras import models
        from keras import layers

        model = models.Sequential()
        model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
        model.add(layers.Dense(16, activation='relu'))
        model.add(layers.Dense(1, activation='sigmoid'))
        ```
    * What are activation functions, and why are they necessary?
        * Without an activation function like `relu` (also called non-linearity), the `Dense` layer would consist of two linear operations (a dot product and an addition):
            * `output = dot(W, input) + b`
        * The **hypothesis space** of layer would be only the set of all possible linear transformations of input data without activation functions
            * Would prevent multiple layers of representations
            * Wouldn't extend the hypothesis space
    * Where have we seen an activation function before? (sigmoid function in Logistic Regression!)
        * input: $x$
        * weights: $w$
        * sigmoid function: $\sigma(z)=\frac{1}{1+e^{-z}}$
        * now, classify $x$ as positive if $\sigma(w^Tx)>0.5$
        * think of $\sigma$ as an **activation function** which **activates** if the input is larger than 0
* Creating two layer neural network with sigmoid activation function
![sigmoid_two_layer](sigmoid_two_layer.png)
    * Architecture:
        * **Input layer**: the nodes which hold the inputs: $1,x_1,x_2,\dots,x_n$
        * **Output layer**: the single node that holds the output value
        * **Weights**: the weights: $w_0,w_1,\dots,w_n$ transition between the two layers
    * with the current two layer architecture, $h(x|w) = \sigma(w^Tx)$, only is able to model **linear** decision functions
        * the decision boundary is the set of points where $w^Tx=0$, creating a hyperplane
* Moving onto Neural Network with Multiple Layers (e.g. 4 layers)
![4_layer_nn_tanh_sigma](4_layer_nn_tanh_sigma.png)
    * Architecture:
        * **Input layer (layer 0)**: contains the input value $x$ and a bias term $1$
        * **Two hidden layer (layer 1 and 2)**
        * **Output layer (layer 3)**: contains the output value (or the probability of positive classification)
    * Compute output for: $x=3$
        * Layer 1:
            * The first non-bias node $\rightarrow tanh((0.1)(1)+(0.3)(3))=0.76$
            * The second non-bias node $\rightarrow tanh((0.2)(1)+(0.4)(3))=0.89$
        * Layer 2:
            * The non-bias node $\rightarrow tanh((0.2)(1)+(1)(0.76)+(-3)(0.89))=-0.94$
        * Output:
            * The value of the output layer $\rightarrow \sigma((1)(1)+(2)(-0.94))=0.29$
        * Finally, $h(3)=0.29$ for $x=3$

3) Neural Network Mathematical Notation
* Simplified Neural Network Architecture (At First)
    * Stick to networks for binary classification (a single output node)
    * Output node will use the sigmoid activation function $\sigma$
    * Hidden layers will use tanh activation function
    * $\theta$ will always represent an activation function (e.g. sign, tanh, sigmoid($\sigma$), rectifier)
* Schematic of Artificial Neuron
![artifical_neuron_schematic](artifical_neuron_schematic.png)
    * Layers are given by indices: $0,1,2,\dots,L$
        * Input layer: $0$
        * Output layer: $L$
    * For each layer $l$:
        * $s^{(l)} \rightarrow d^{(l)}$-dimensional input vector
        * $x^{(l)} \rightarrow (d^{(l)}+1)$-dimensional output vector
        * $W^{(l)} \rightarrow (d^{(l-1)}+1)\times d^{(l)}$ matrix of input weights
            * $W_{ij}^{(l)} \rightarrow$ the weight of the edge from the $i$-th node $l-1$ to the $j$-th node in $l$
* Convert Previous 4 Layer Network To Mathematical Notation
![4_layer_nn_tanh_sigma](4_layer_nn_tanh_sigma.png)
    * Steps For **Forward Propagation**:
        1. Multiple Initial Input $x^{(0)}$ by $W^{(1)}$ between layers to yield $s^{(1)}$ through matrix multiplication
        2. Apply activation function (e.g. tanh or sigmoid) on $s^{(1)}$ to yield $x^{(1)}$
        3. Take the next weights $W^{(2)}$ and apply to previous output of Layer $x^{(1)}$ and repeat cycle until reaching the final output layer $x^{(o)}$
    * **Layer 0 $\rightarrow$ 1:**
        * $x^{(0)}=\left[\begin{array}{cc}
            1 \\ 
            3 
            \end{array}\right]$
        * $W^{(1)}=\left[\begin{array}{cc}
            0.1 & 0.2 \\ 
            0.3 & 0.4 
            \end{array}\right]$
        * $s^{(1)}$ is the result of applying the weights on the edges between layer 0 and 1: (Applying Weights)
            * $\left[\begin{array}{cc}
                (0.1)(1)+(0.3)(3) \\ 
                (0.2)(1)+(0.4)(3)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                1.4
                \end{array}\right]$ 
        * $x^{(1)}$ is the output of layer 1 after applying tanh and adding a bias node: (Output of Layer 1 After Bias/Tanh Function)
            * $\left[\begin{array}{cc}
                1 \\ 
                tanh(1) \\
                tanh(1.4)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                0.76 \\
                0.89
                \end{array}\right]$ 
    * **Layer 1 $\rightarrow$ 2:**
        * $W^{(2)}=\left[\begin{array}{cc}
            0.2 \\ 
            1 \\
            -3
            \end{array}\right]$
        * $s^{(2)}=\left[\begin{array}{cc}
            (0.2)(1)+(1)(0.76)+(-3)(0.89)
            \end{array}\right] = \left[\begin{array}{cc}
                -1.71
                \end{array}\right]$
        * $x^{(2)}=\left[\begin{array}{cc}
                1 \\ 
                tanh(-1.71)
                \end{array}\right] = \left[\begin{array}{cc}
                1 \\ 
                -0.94
                \end{array}\right]$ 
    * **Layer 2 $\rightarrow$ 3:**
        * $W^{(3)}=\left[\begin{array}{cc}
            1 \\ 
            2
            \end{array}\right]$
        * $s^{(3)}=\left[\begin{array}{cc}
            (1)(1)+(2)(-0.94)
            \end{array}\right] = \left[\begin{array}{cc}
                -0.88
                \end{array}\right]$
        * $x^{(3)}=\left[\begin{array}{cc}
                \sigma(-0.88)
                \end{array}\right] = \left[\begin{array}{cc}
                0.29
                \end{array}\right]$ 

4) Forward Propagation and Backpropagation
* **Forward Propagation** - computing the output of a neural network with fixed weights (based on studying the above examples)
    * $x^{(l)}=\left[\begin{array}{cc}
        1 \\ 
        \theta(s^{(l)})
        \end{array}\right]$ (Inputs that have applied activation functions)
    * $s^{(1)}=(W^{(l)})^Tx^{(l-1)}$ (Weights that are applied to the Inputs)
    * Propagation of computations: $x^{(0)}\xrightarrow{W^{(1)}}s^{(1)}\xrightarrow{\theta}x^{(1)}\xrightarrow{W^{(2)}}s^{(2)}\cdots\rightarrow s^{(L)}\xrightarrow{\theta}x^{(L)}=h(x^{(0)})$
    * In terms of the number of nodes $V$ and weights $E$, what is the algorithmic complexity of forward propagation (in Big-O notation)?
* **Backpropagation (or Reverse-mode Differentiation)** - finds the error based on some function via gradient descent and modifies weights (thereby improving the model) based on the predictions made by the Neural Network in Forward Propagation
    * Why are we using gradient descent to modify weights?
        * Naive approach would be to change one weight at a time holding all other weights same
            * This methodology would be terribly expensive and inefficient
        * Take advantage of the fact that all operations used in the network are **differentiable** and compute the **gradient** of the loss with regard to the network's coefficients
        ![derivative](derivative.png)
            * If $a$ is negative, it means a small change of $x$ around $p$ will result in a decrease of $f(x)$
            * If $a$ is positive, a small change in $x$ will result in an increase of $f(x)$
            * The absolute value of $a$ (the magnitude of the derivative) tells you how quickly this increase or decrease
will happen
            * If you're trying to update $x$ by a factor `epsilon_x` in order to minimize $f(x)$, and you know the derivative of $f$, then you can reduce the value of $f(x)$ by moving $x$ a little in the opposite direction from the derivative
            * Where the derivative of function $f(x)$ of a single coefficient can be interpreted as the slope of curve, $gradient(f)(W_0)$ can be interpreted as the tensor describing the **curvature** of $f(W)$ around $W_0$
            * With function $f(W)$ of a tensor, reduce $f(W)$ by moving $W$ in the opposite direction from the gradient (going against the curvature means putting you lower in the curve)
    * Training data = $\{(x_i,y_i)\}$
    * Need to minimize some error function $E$ on our training set over the weights: $w = (W^{(1)},\dots,W^{(L)})$
        * Example error function, MSE: $E(w)=\frac{1}{N}\sum_{i=1}(h(x_i|w)-y_i)^2$
    * This function can be *extremely* complicated to write algebraically and has no closed form solution for minima
    * Use gradient descnet algorithm to train neural network (called Backpropagation)
        * Update step in gradient descent: $w(t+1)=w(t)-\eta\triangledown E(w(t))$ 
    * Our total error is a sum of the errors, $e_n$, on each input:
        * $E(w)=\frac{1}{N}\sum_{i=1}^n e_i$ where $e_i=(h(x_i|w)-y_i)^2$
        * take derivative with respect to weights: $\frac{\partial E}{\partial W^{(l)}}=\frac{1}{N}\sum{\frac{\partial e_n}{\partial W^{(l)}}}$ (need to review how to take derivatives)
        * can consider one data point at a time and add the results to get the total gradient
    * Backpropagation uses the **chain rule** to compute the partial derivatives of layer $l$ in terms of layer $l+1$
        * the **sensitivity vector** of layer $l$: $\delta^{(l)}=\frac{\partial e}{\partial s^{(l)}}$
        * then, we can compute: $\frac{\partial e}{\partial W^{(l)}} = x^{(l-1)}(\delta^{(l)})^T$
        * for $j$ in $1,\dots,d^{(l)}$: $\delta_j^{(l)}=\theta'(s^{(l)})_j \times [W^{(l+1)}\delta^{(l+1)}]_j$
        * can compute $\delta^{(l)}$ from $\delta^{(l+1)}$
        * must still compute $\delta^{(L)}$ to seed the process
            * depends on the error function and the output activation function
            * in this case: $\delta^{(L)}=2(h(x_i|w)-y_i)h(x_i|w)(1-h(x_i|w))$
        * $W^{(l)}=W^{(l)}-\eta\frac{\partial E}{\partial W^{(l)}}$
* Complete Computation for Forward propagation and Backpropagation example (using the 4 layer NN from above):
    * Forward propagation:
        * Data is $x=2,y=1$
        * $x^{(0)}=\left[\begin{array}{cc}
            1 \\ 
            2 
            \end{array}\right]$; 
            $s^{(1)}=\left[\begin{array}{cc}
            0.1 & 0.3 \\ 
            0.2 & 0.4
            \end{array}\right]
            \left[\begin{array}{cc}
            1 \\ 
            2
            \end{array}\right]=
            \left[\begin{array}{cc}
            0.7 \\ 
            1
            \end{array}\right]$;
            $x^{(1)}=\left[\begin{array}{cc}
            1 \\ 
            0.6 \\
            0.76
            \end{array}\right]$
        * $s^{(2)}=\left[\begin{array}{cc}
            -1.48 
            \end{array}\right]$; 
            $x^{(2)}=\left[\begin{array}{cc}
            1 \\ 
            -0.90 
            \end{array}\right]$
        * $s^{(3)}=\left[\begin{array}{cc}
            -0.8 
            \end{array}\right]$; 
            $x^{(3)}=\left[\begin{array}{cc}
            0.31 
            \end{array}\right]$
    * Backpropagation:
        * $\delta^{(3)}=2(0.31-1)(0.31)(1-0.31)=-0.30$
        * $\delta^{(2)}=(1-0.9^2)(2)(-0.30)=-0.114$
        * $\delta^{(1)}=\left[\begin{array}{cc}
            -0.072 \\
            0.144
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(1)}}=x^{(0)}(\delta^{(1)})^T=\left[\begin{array}{cc}
            -0.072 & 0.144 \\
            -0.144 & 0.288
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(2)}}=x^{(1)}(\delta^{(2)})^T=\left[\begin{array}{cc}
            -0.69 \\
            -0.42 \\
            -0.53
            \end{array}\right]$
        * $\frac{\partial e}{\partial W^{(3)}}=x^{(2)}(\delta^{(3)})^T=\left[\begin{array}{cc}
            -1.85 \\
            1.67
            \end{array}\right]$
* Another example with Forward and Backpropagation:
![example_nn](https://matthewmazur.files.wordpress.com/2018/03/neural_network-9.png)

5) Stochastic Gradient Descent in Backpropagation
* Backpropagation finds the gradient at each observation, adds them up to find the total gradient:
    * $\triangledown E(w)=\frac{1}{N}\sum_{i=1}\triangledown e_i(w)$
    * $w(t+1)=w(t)-\eta\triangledown E(w(t))$
* Instead, **update weights** at **each** observation (or after a small batch of observations):
    * $w(t+1)=w(t)-\eta\triangledown e_i(w(t))$
    ![sgd_backprop](sgd_backprop.png)
* Chaining derivatives:
    * In practice, a neural network function consists of many tensor operations chained together, each of which has a simple, known derivative
        * e.g. a network $f$ composed of three tensor operations: $a$, $b$, and $c$ with weight matrices $W_1$, $W_2$, and $W_3$ $$f(W_1, W_2, W_3) = a(W_1, b(W_2, c(W_3)))$$
    * Backpropagation starts with the final loss value and works backwards from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value
* **Symbolic Differentiation** - Implementation networks in modern frameworks like Tensorflow
    * Given a chain of operations with a known derivative, they can compute a gradient **function** for the chain by applying the chain rule that maps network parameters values to gradient values
    * With this function, the backward pass is reduced to a call to this gradient function

5.5) Feature Engineering / Overfitting & Underfitting
* **Feature Engineering** - process of using your own knowledge about the data and about the machine-learning algorithm at hand to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data before in goes into the model
    * e.g. develop a model that takes as input an image of a clock and can output the time of day
        * (-) using raw pixels of image as input data is difficult ML problem using CovNN
        * instead, if you understand the problem at high level (understand how humans read time on a clock face), then you can come up with a much better input features for ML algorithm by following the black pixels of clock hands to associate with coordinates with appropriate time of day (which makes the problem easy enough that ML is not required)
        * before CovNN were successful on mnist dataset, solutions were typically based on hardcoded features such as the number of loops in the digit image, the height of each digit, and a histogram of pixel values, etc.
    * Does this mean you don't need to worry about feature engineering because neural nets are capable of automatically extracting useful features from raw data? **No**, for two reasons:
        * Good features still allow you to solve problems more elegantly while using fewer resources (e.g. ridiculous to solve reading clock face with CovNN)
        * Good features let you solve problem with far less data. The ability of deep-learning models to learn features on their own relies on having lots of training data available; if you have only a few samples, then the information value in their features becomes critical
* **Overfitting/Underfitting**
    * The fundamental issue in ML is the tension between optimization and generalization
        * **Optimization** - the process of adjusting a model to get the best performance possible on training data (learning in ML)
        * **Generalization** - how well the trained model performs on data it has never seen before
    * (Underfitting) At the beginning of training, optimization and generalization are correlated: the lower the loss on training data, the lower the loss on test data
        * There is still progress to be made
        * The network hasn't yet modeled all relevant patterns in the training data
    * (Overfitting) In the three examples, the performance of the model on the held-out validation data always peaked after a few epochs and then began to degrade which means the model started to **overfit** to the training data
        * After a certain number of iterations, generalization stops improving, and validation metrics stall, and then begin to degrade
        * Now it's beginning to learn patterns that are specific to the training data, but are misleading or irrelevant when it comes to new data
    * Best solution to prevent a model from learning misleading or irrelevant data? **Get more training data**
    * When that isn't possible, the next-best solution is to modulate the quantity of information that your model is allowed to store or to add constraints on what information it's allowed to store
        * If a network can only afford to memorize a small number of the patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well
    * Methods to reduce overfitting - **regularization**, **dropout**, and **reducing network's size**

6) Neural Network Parameter Tuning
* **Learning rate** - keys to configuring the learning process
    * **Loss Function (Objective Function)** - the quantity that will be minimized during training. It represents a measure of success for the task at hand
        * A neural network that has multiple outputs may have multiple loss functions (one per output). But, the gradient descent process must be based on a **single** scalar loss value; so, for multiloss networks, all losses are combined (via averaging) into a single scalar quantity.
        * Choosing the right objective function for the right problem is extremely important as your network will take any shortcut to minimize the loss; so if the objective doesn't fully correlate with the success for the task, your network will end up doing the wrong things (e.g. maximizing the average well-being of all humans alive $\rightarrow$ AI will kill all humans except a few)
    * Types of **objective functions** for problem:
        * **Binary Crossentropy** for two-class classification
            * **Crossentropy** - is an quantity from the field of Information Theory that **measures the distance** between probability distributions, or in this case, between the **ground-truth distribution and your predictions**
        * **Categorical Crossentropy** for many-class classification
            * It measures the distance between two probability distributions
            * e.g. between the probability distribution output by the network (46 news topics) and the true distribution of the labels
        * **Mean-Square Error (MSE)** for regression problem
        * **Connectionist Temporal Classification (CTC)** for sequence-learning problem
        * Only when you're working on a truly new research problems will you have to develop your own objective functions
        * Why aren't we using ROC/AUC metric as loss function?
            * No easy way to turn this metric into a loss function
            * Loss functions need to be computable given only a mini-batch (or a single data point) of data and must be differentiable (otherwise you can't use backpropagation to train the network)
            * We use crossentropy as proxy metric for ROC/AUC where we hope that the **lower the crossentropy** gets, the **higher the ROC/AUC** will be
        * Choosing the right last-layer activation and loss function:
        ![last_layer](last_layer.png)
    * **Optimizers / Optimization Methods (Variants of SGD)** - determines how the network will be updated based on the loss function. It implements a specific variant of SGD.
        * **SGD with Momentum**
            * **Momentum** addresses two issues with SGD: convergence speed and local minima
            * Avoid local minimum by updating the parameter $w$ based not only on the current gradient value, but also on the previous parameter update
        * **Adagrad**
        * **RMSProp**
    * **Momentum** - helps push backpropagation out of local minima (step size)
        * adds a fraction of the previous gradient in the new update step: $w(t+1)=w(t)-\eta\triangledown E(w(t))+m(w(t)-w(t-1))$
        * $m=0.9$ is standard
        * Too high of $m$ risks overshooting minimum
        * Too lows of $m$ risks getting stuck in local minima
    * Custom optimizer, loss function, metrics:
        ```python
        from keras import optimizers
        from keras import losses
        from keras import metrics
        
        model.compile(optimizer=optimizers.RMSprop(lr=0.001),
                      loss=losses.binary_crossentropy,
                      metrics=[metrics.binary_accuracy])
        ```
* **Hidden layers**
    * **Number of hidden layers** - how many layers to use
    * **Number of neurons on hidden layers** - the size of representation space (more or less hidden units)
        * Intuitively understand the dimensionality of your representation space as "how much freedom you're allowing the network to have when learning internal representations"
        * Have **more** hidden units (higher-dimensional representation space) allows network to **learn more-complex representations**, but it makes the network **more computationally expensive** and may lead to **learning unwanted patterns** (patterns that will improve performance on the training data, but not on the test data)
    * e.g. two intermediate hidden layers with 16 hidden units
* **Initialization of weights**
    * What happens if you set weights to 0 or weights very large?
    * don't set weights to 0, instead set sample weights as normal centered around 0
    * rule of thumb:
        * sample weights from $N(0,\sigma^2_w)$
        * $\sigma^2_w max_i \Vert x_i\Vert^2 << 1$
* **Scaling** - normalize data before fitting data to neural network model (depending on the activation function)
* **Epoch / Batches** - a single sweep through all of the data
    * example: if you have 100,000 observations, and a batch size of 100. Then, each epoch will consist of 1,000 gradient descent update steps 
* **Termination** - error function is generally an **extremely** non-convex function function
    * Lots of local minimia and flat spots
    * Often best to terminate after a set number of iterations
    * Also, can terminate when the gradient is small and the total error is small
* **Activation functions**
    1. **Sigmoid** - forces arbitrary values into [0,1] interval
    ![sigmoid](sigmoid.png)
        * equation: $\sigma(z)=\frac{1}{1+e^{-z}}$
        * output can be interpreted as a probability
    2. **Tanh** - a hyperbolic tangent function that forces arbitrary values into [-1,1] interval (an activation that was popular in the early days of neural networks)
    ![tanh_func](tanh_func.png)
        * equation: $tanh(z)=2\sigma(2z)-1$
        * same shape as the sigmoid function
        * output values centered around 0
        * smooth or differentiable (unlike sign)
            * $tanh(z)=2\sigma(2z)-1 \rightarrow tanh'(z)=1-tanh^2(z)$
        * trains faster than sigmoid in most cases
        * (-) requires normalization of the data
    3. **Softmax**
        * equation: $\frac{e^{s_j^{(L)}}}{\sum_{i=1}^K e^{s_i^{(L)}}}$
        * useful activation function for output layer of a multi-class classification neural net
    4. **Rectifer / Hard Max / Rectified Linear Unit (ReLU)** - a function that maps $\{0,x\}$
    ![relu](relu.png)
        * gradient does not vanish as $x$ gets large
        * 0 if $x<0$, which introduces sparsity into the network
            * zeros out negative values
        * has faster training
        * (+) doesn't require normalization of data
* **Reducing Overfitting**
    * **Reducing Network's Size** - reduce size of model
        * Reduce **Memorization Capacity** - reduce the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer)
        * By limiting memorization resources, it won't be able to learn as easily, thus in order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets
        * Smaller model - overfits later, perofrmance degrades slower
        ![smaller_model](smaller_model.png)
        * Larger model - overfits almost immediately, training loss gets to zero very quickly (more capacity allows it to quickly model the training data)
        ![bigger_model_val_loss](bigger_model_val_loss.png)
        ![bigger_model_train_loss](bigger_model_train_loss.png)
    * **Regularization** - adding weight regularization
        * **Occam's razor** - given two explanations for something, the explanation most likely to be correct is the simplest one (fewer assumptions)
            * In the analogy for Neural Nets, the simple model is less likely to overfit than complex ones
            * **Simple model** is a model where the distribution of parameter values has less entropy (or a model with fewer parameters)
        * **Weight Regularization** - mitigate overfitting by putting constraints on the complexity of network by **forcing its weights** to take only **small values**, which makes the distribution of weight values more **regular**
            * Add cost associated with having large weights to the loss function of the network
        * **$L1$ Regularization** - the cost added is proportional to the **absolute value of weight coefficients** (the $L1$ norm of the weights)
        * **$L2$ Regularization / Weight Decay** - the cost added is proportional to the **square of the value of weight coefficients** (the $L2$ norm of the weights)
            * Add a $L2$-regularization term: $\lambda\sum_{l,i,j}(w_{ij}^{(l)})^2$ to the error function
            * Example of Adding $L2$ weight regularization to the model:
            ```python
            from keras import regularizers
            
            model = models.Sequential()
            # 12(0.001) means every coefficient in the weight matrix of the layer will add 0.001 * weight cofficient_value to the total loss of the network (only added at training time)
            model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000,))) 
            model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu'))
            model.add(layers.Dense(1, activation='sigmoid'))
            ```
            ![l2_regularization](l2_regularization.png)
        * Different weight regularizers available in Keras:
        ```python
        from keras import regularizers
        
        regularizers.l1(0.001) # l1 regularization
        regularizers.l1_l2(l1=0.001, l2=0.001) # l1 & l2 regularization simultaneously
        ```
    * **Dropout** - randomly remove nodes from the network each time you run backpropagation
        * One of the most effective and most commonly used regularization for neural networks (developed by Geoff Hinton)
        * Randomly **dropping out** (setting to zero) a number of output features of the layer during training
        * e.g. layer returns `[0.2, 0.5, 1.3, 0.8, 1.1]` for a given input sample during training. Then after applying dropout, this vector will have a few zero entries distributed at random: `[0, 0.5, 1.3, 0, 1.1]`
        * **Dropout Rate** - the fraction of the features that are zeroed out (usually between 0.2 to 0.5)
        * At test time, no units are dropped out; instead, the layer's output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time
        * Dropout example:
            * At training time, zero out at random a fraction of values in the matrix:
            ```python
            # At training time, drops out 50% of the units in the output
            layer output *= np.random.randint(0, high=2, size=layer_output.shape)
            ```
            * At test time, scale down the output by the dropout rate
            ```python
            # at test time, scale by 0.5 (since we previously dropped half the units)
            layer_output *= 0.5
            ```
        * This process is normally implemented by doing both operations at training time and leaving the output unchanged at test time
        ![dropout](dropout.png)
        * The core idea is that **introducing noise** in the output values of a layer can **break up coincidence patterns that aren't significant**, which the network will start memorizing if no noise is present
        ![dropout_val_loss](dropout_val_loss.png)

7) Regression and Multi-Class Classification
* Neural Network can also be used for regression and for data with multiple classes
    * For regression, replace the output transformation with $\theta \rightarrow id$ and use MSE
    * For multiple classes:
        * The output layer will have multiple ndoes, one for each class $1,2,\dots,K$
        * The output activation function on the $j$-th component of the output layer is the **softmax function**: $\frac{e^{s_j^{(L)}}}{\sum_{i=1}^K e^{s_i^{(L)}}}$
        * Minimize **cross-entropy**: $\sum_{i=1}-y_i log h(x_i)$
* Classification/Regression Examples:
    * Classifying movie reviews as positive or negative (Binary Classification)
    * Classifying new wires by topic (Multi-class Classification)
    * Estimating the price of a house, given real-estate data (Regression)

8) Types of Network Neural Architecture
1. **Fully Connected Neural Networks** - architecture where each layer is *fully connected* to next and has no missing edges between nodes
![full_nn](full_nn.png)
    * **Deep Learning Neural Network** - a buzzword referring to a neural network with more than 3 layers
2. **Recurrent Neural Networks (RNN)** - a NN where a hidden layer feeds back into itself
    * This allows the NN to exhibit dynamic temporal behavior
    * RNN's provide "internal" memory for sequence processing
    * **Long Short Term Memory (LSTM)** - a special kind of RNN capable of learning **long-term dependencies**
    * They are very applicable to handwriting/speech recognition
        * e.g. Used by Google Translate application
        * e.g. Used to train AI on human negotiations
![rnn](rnn.png)
3. **Convolutional Neural Networks (CNN)** - architecture isn't fully connected and employs convolution layers
![cnn](cnn.png)
    * used mainly for image classification (state of the art)
    * **Convolution Layers** - each node only "sees" a subset of the previous layer's nodes
        * applies convolutions (type of filter) to each sub-image to "look for" certain patterns or shapes (which are learned)

9) Deep Learning Achievements in 2017
1. Text
    * Google Translate with RNN
    * Facebook Human Negotiation AI with RNN
2. Voice
    * Google DeepMind autoregressive full-convolution WaveNet (A generative model for raw audio) with PixelRNN/PixelCNN
    * Google DeepMind Lip reading from television dataset using LSTM + CNN
    * WashU Obama Synchronize Lip with RNN
3. Computer Vision
    * Google Brain enhances Google Maps with OCR (Optical Character Recognition to recognize street signs and store signs using CNN + LSTM
    * Google Deepmind Visual Reasoning on CLEVR dataset with 95.5% accuracy using pre-train LSTM
    * Uizard pix2code (GUI interpreted by NN into code) with 77% accuracy
    * Google SketchRNN trained on detailed vector representations of drawings using Sequence-to-Sequence Variational Autoencoder (VAE) RNN
    * **Generative Adversial Networks (GANs)** - competition of two networks (generator and discriminator)
        * e.g. First network creates a picture, and the second one tries to understand whether the picture is real or generated
        * Orange Labs France Face Aging with Conditional GANs using IMDB dataset
        * Google Improves Professional Photos using GANs with Google Street View dataset
        * MichU Synthesization of an image from a text description using GANs
        * Berkeley AI Research (BAIR) Image-to-Image Translation with Conditional GANs (e.g. creating a map using a satellite image, or realistic texture of the objects using their sketch)
        * Christopher Hesse uses UNet and PatchGAN to make nightmare cat demo
        * Authors of Pix2Pix develops CycleGAN for transfer between different domains of images
        * Using Adversial Autoencoder (AAE) to find new drugs to fight cancer
        * Improvements in Adversial Attacks (tricking NN by injecting noise from recognition) using Fast Gradient Sign Method (FGSM) - important in face recognition/self-driving algorithm from being attacked
4. **Reinforcement Learning (RL)** - learn the successful behavior of the agent in an environment that gives a reward through experience (e.g. people learning throughout their lives) - used actively in games (e.g. AlphaGO) robots, and system management (e.g. traffic)
    * Google DeepMind Deep Q-network (DQN) plays arcade games better than humans (currently being taught to play complex games like Doom)
        * Introduction of additional losses (auxiliary tasks), such as the prediction of a frame change (pixel control) so that the agent better understands the consequences of the actions, significantly speeds up learning
    * OpenAI Learning Robots using RL (one-shot learning) by actively studying an agent's training by humans in a virtual environment
        * A person shows in VR how to perform a certain task, and one demonstration is enouhg for the algorithm to learn it and then reproduce it in real conditions
    * OpenAI/Google DeepMind Learning on Human Preferences using RL
        * An agent has a task, and the algorithm provides two possible solutions for the human and indicates which one is better
    * Google Deepmind Movement in Complex Environments
        * Teaching a robot complex behavior (walk, jump, etc.) using agents (body emulators) to perform complex actions by constructing a complex environment with obstacles and with a simple reward for progress in movement
5. Other
    * Google Deepmind Cools Data Center (reducing energy costs) based on info on thousands of sensors predicts Power Usage Effectiveness (PUE) using NN ensemble
    * Google Brain One Model For All Tasks (currently trained models are poorly transferred from task to task) (tensor2tensor)
        * Train a model that performs eight tasks from different domains (text, speech, images)
    * Facebook Learn Imagenet in one hour (using Tesla P100 - a cluster of 256 GPUs) using Gloo and Caffe2 for distributed learning
6. News
    * Self-driving Cars
        * Intel MobilEye
        * Google Waymo
    * Healthcare
        * Google Deepmind in Healthcare for medical diagnosis
    * Investments
        * China invests \$150 Billion in AI
        * Baudi Research employs 1,300 people
        * Alibaba runs 100 billion samples with a trillion parameters with ease
7. Kaggle
    * In 2016 and 2017, Kaggle was dominated by two approaches: **gradient boosting machines** and **deep learning**
        * Specifically, **gradient boosting** is used for problems where structured data is available (e.g. XGBoost), whereas deep learning is used for perceptual problems such as image classification (e.g. Keras)

10) Keras - Deep-learning framework for Python
* Key Features:
    * It allows the same code to run seamlessly on CPU or GPU
    * It has a user-friendly API that makes it easy to quickly prototype deep-learning models
    * It has built-in support for convolutional network (for computer vision), recurrent networks (for sequence processing), and any combination of both
    * It supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, and so on. This means Keras is appropriate for building essentially any deep learning model from a generative adversarial network to a neural Turing machine
* Keras Framework:
![keras_framework](keras_framework.png)
    * Keras is a model-level library, providing high-level building blocks for developing deep-learning models
        * It doesn't handle low-level operations such as tensor manipulation and differentiation
    * Keras relies on a specialized, well-optimized tensor library which serves as the **backend engine** of Keras
    * There are three existing backend implementations: **TensorFlow**, **Theano**, and **Microsoft Cognitive Toolkit (CNTK)**
    * Tensorflow on CPU/GPU:
        * When running on **CPU**, TensorFlow is itself wrapping a low-level library for tensor operations called **Eigen**
        * When running on **GPU**, TensorFlow wraps a library of well-optimized deep-learning operations called the NVIDIA CUDA Deep Neural Network library (`cuDNN`)
    * Two ways to define a model:
        * **Sequential class** - only for linear stacks of layers (most common architecture)
        ```python
        from keras import models
        from keras import layers
        
        model = models.Sequential()
        model.add(layers.Dense(32, activation='relu', input_shape=(784,)))
        model.add(layers.Dense(10, activation='softmax'))
        ```
        * **Functional API** - for directed acyclic graphs of layers, which lets you build completely arbitrary architectures
        ```python
        input_tensor = layers.Input(shape=(784,))
        x = layers.Dense(32, activation='relu')(input_tensor)
        output_tensor = layers.Dense(10, activation='softmax')(x)
        model = models.Model(inputs=input_tensor, outputs=output_tensor)
        ```
            * manipulate the data tensors that the model processes and apply layers to the tensor as if they were functions
    * **Validation Set** - set aside some samples from the training data to compute loss and accuracy after every epoch
        * e.g. 25,000 training samples $\rightarrow$ 10,000 validation samples (40% of training samples)
        * Used to tune the configuration of the model (where the model will overfit on the validation set)
        * We want to avoid **information leaks** - every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into model
            * do not give the model access to **any** information on the test set (even indirectly)