Convolutional Neural Networks (CovNet/CNN)

* Objectives:
    * Understand the fundamental differences between image data and other kinds of data
    * Be aware of the tools and pipeline for working with images
    * Understand general computer vision techniques for working/transforming images
    * Be able to explain what a convolution is, and how it works
    * Understand the basic structure of a convolutional neural network
    * Comprehend the three basic ideas behind convolutional networks: Local receptive field, shared weights, and pooling
    * Be aware of general strategies for building convolutional neural networks

1) **Image Processing** - remove unnecessary details to allow for better generalization of images to image classes
* Equation: $$x_i \rightarrow y_i: argmin_y \frac{1}{2}\sum_{i=1}(x_i-y_i)^2+\lambda\sum_{i=1}|y_{i+1}-y_i|$$
    * **Fidelity**: $$\frac{1}{2}\sum_{i=1}(x_i-y_i)^2$$
    * **Variation**: $$\lambda\sum_{i=1}|y_{i+1}-y_i|$$
* What is the difficulty of image processing?
    * Images come in many different sizes
    * Viewing conditions are infinite
    * Objects are surrounded by other objects
    * Computers have hard time understanding the context of an image (e.g. Barack Obama secretly stepping on his staff member's scale as a joke)
* Why is it important to understand how to process images? Simply that there is more and more data in the form of audio, image and video that have potential for modeling
* How do we make object recognition possible?
    * Compress the data
    * Keep the search simple
    * Method of segmenting potential objects
* Python libaries for image processing:
    * **Scikit-image (skimage)**
    * **OpenCV (based on C++)**
        * be careful with package dependencies
    * Python Imaging Library
    * Pillow
* Image Pipeline: Read, Resize, and Transformations
    * Reading Images (Image types):
        * Colored images shape: (width, height, 3)
            * **Feature Maps** - convolutions operate over 3D tensors with two spatial axes and one depth axis
                * spatial axes - `(width, height)`
                * depth axis - `channels`
                * combined - `(width, height, channels)`
        * Greyscale images shape: (width, height)
        * Image Tensor example: (RGB)
        ```python
        array([
            #  R  G  B     R  G  B 
            [[108,50,13],[111,55,18]],
            #  R  G  B     R  G  B
            [[115,61,23],[130,129,127]]
        ])
        ```
        * What is the shape of this array? (2, 2, 3)
        * What if the same array is greyscaled? (2, 2)
    * Resizing Images:
        * Making the image a specified shape without cropping
            * **Downsampling** - reducing the size of the image when image is too large for processing
            ![downsampling](downsampling.png)
            * **Upsampling** - purposely increasing the size of image when image is too small for processing
            ![upsampling](upsampling.png)
            * **Interpolation** - resize or distort your image from one pixel grid to another
            ![interpolation](http://northstar-www.dartmouth.edu/doc/idl/html_6.2/images/Interpolation_Methods-14.jpg)
    * Transforming Images - converting an image from one domain to another
        * **Greyscale** - removing color from image
        * **Denoise** - removing unnecessary details of an image allowing for better generalization of the class of image
            * **Gaussian Kernel** - probability density function (called the standard deviation), and the square of it, $s^2$, the variance
            ![gaussian_kernel](gaussian_kernel.png)
    * Before convolution neural nets, image analysis (or object recognition) was focused on examining pixels (or color vectors)
        * **K-means of RGB pixels** - segment colors in an automated fashion using k-means clustering
        ![kmeans_of_pixel_colors](https://www.mathworks.com/matlabcentral/answers/uploaded_files/9604/sample.jpg)
        * **Raw Vector based methods** - ascertain features in images by looking at intensity gradients
        ![raw_vector_method](raw_vector_method.png)

2) Image Featurization in Convolutional Neural Nets
* **Convolution operation**
    * The fundamental difference between a densely connected layer and a convolution layer:
        * **`Dense` layers** - learns **global** patterns in their input feature space (e.g. learns patterns involving all pixels in mnist digit)
        * **Convolutional layers** - learns **local** patterns (e.g. for images, patterns found in small 2D windows of the inputs
        ![convolution_operation](convolution_operation.png)
    * Key characteristics gives convnets two interesting properties:
        * **The patterns learned are <u>translation invariant</u>** - after learning a certain pattern in lower-right corner of a picture, a convnet can recognize it anywhere (e.g. in the upper-left corner)
            * A densely connected network would have to learn the pattern anew if it appeared at a new location
            * While processing images with convnet, the **visual world is fundamentally translation invariant**
            * Fewer training samples to learn representations that have generalization power
        * **Can learn <u>spatial hierarchies of patterns</u>** - a first convolution layer will learn small local patterns such as edges $\rightarrow$ a second convolution layer will learn larger patterns made of the features of the first layers, and so on
            * Allows convnets to efficiently learn increasingly complex and abstract visual concepts because **visual world is fundamentally spatially hierarchical**
            ![spatial_hierarchies](spatial_hierarchies.png)
* **Filters (Kernels)** - encodes specific aspects of the input data
    * Its depth (e.g. colors) can be arbitrary because the output depth is a parameter of the layer and the different channels in that depth axis no longer stand for specific colors, but rather **filters**
    * e.g. A single filter could encode the concept "presence of a face in the input"
    * e.g. In MNIST example, the first convolution layer takes a feature map of size `(28,28,1)` and outputs a feature map of size `(26,26,32)`
        * It computes 32 filters over its input
        * Each of these 32 output channels contain a `26 x 26` grid of values, which is a **response map** of the filter over the input, indicating the response of that filter pattern at different locations in the input
        * Every dimension in the depth axis is a feature (or filter), called **feature map**, and the 2D tensor `output[:,:,n]` is the **2D spatial map** of the response of this filter over the input
        ![response_map](response_map.png)
* **Convolutions** - In image processing, a **kernel**, **convolution matrix**, or **mask** is a small matrix useful for blurring, sharpening, embossing, edge-detection, and more. This is accomplished by means of **convolution** between kernel and an image (Automatically creating kernels by treating the convolution matrix as weights that will passed forward and changed using backprop)
    * Programming language and associated library to develop convolutions using kernels/filters:
        * `python` - `conv-forward`
        * `tensorflow` - `tf.nn.conv2d`
        * `keras` - `Conv2D`
    * Applying a kernel over an image to get a convolved feature:
    ![kernel_apply](kernel_apply.png)
    * Moving a kernel in an image and getting the dot product of each pixel
        * dot product: $a \cdot b=\sum_{i=1}^n a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$
        * $\left[\begin{array}{cc}
            (1)(1) & (1)(0) & (1)(1) \\ 
            (0)(0) & (1)(1) & (1)(0) \\
            (0)(1) & (0)(0) & (1)(1)
            \end{array}\right] = \left[\begin{array}{cc}
            1 & 0 & 1 \\ 
            0 & 1 & 0 \\
            0 & 0 & 1
            \end{array}\right] = 
            1+0+1+0+1+0+0+0+1 = 4$ 
        * Convolved feature: $\left[\begin{array}{cc}
            4 & - & - \\
            - & - & - \\
            - & - & - 
            \end{array}\right]$
        * **Sobel filter** - kernels that have values that are on the vertical edges and the horizontal edges to detect edges
            * Vertical edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & 0 & +1 \\ 
                    -2 & 0 & +2 \\
                    -1 & 0 & +1
                    \end{array}\right]$
            * Horizontal edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & -2 & -1 \\ 
                    0 & 0 & 0 \\
                    +1 & +2 & +1
                    \end{array}\right]$
            * Useful for detecting outline of a door image and to recognize the essential features of a door
            ![door_detection_with_edge](door_detection_with_edge.png)
        * **Canny Filter** - is an **edge detection operator** that uses a multi-stage algorithm to detect a wide range of edges in images
            ![canny](canny.jpeg)
    * What if we could get a computer to build it's own kernels, apply those to images, then interpret those results to perform object recognition? Convolution Neural Networks (CNNs)
* Key Parameters of Convolutions:
    * **Size of the patches extracted from the inputs** - typically 3x3 or 5x5
        * e.g. 3x3 is a common choice
    * **Depth of the output feature map** - the number of filters computed by the convolution
        * e.g. a depth of 32 and ended with a depth of 64

3) Convolutional Neural Network Architecture
![cnn](http://www.mdpi.com/entropy/entropy-19-00242/article_deploy/html/images/entropy-19-00242-g001.png)
* General Structure:
    * There are three key feature that make CNN structure actualy work: **local receptive fields**, **shared weights**, and **pooling**
    * Input Layer $\rightarrow$ Convolutional Layers $\rightarrow$ Pooling Layers $\rightarrow$ Fully Connected Layers $\rightarrow$ Output Layer
    * The activation size (derived from activation shape) gradually decreases over time until the output
    * Most of the parameter counts actually come normally out of the fully connected layers towards the end of the Covnet
* Input Layer $\rightarrow$ Convolutional Layers:
    * Using a kernel, the input image is converted to multiple convolved features (learning different features of the image)
    * **Local Receptive Fields** - a group of pixels that has variety of sizes defined by the kernel (e.g. 3x3 or 5x5 kernel)
    ![local_receptive_field](local_receptive_field.png)
        * The kernel is slid across the entire image
        * Multiple kernels are applied to the image, which results in multiple learned kernels per hidden layer (yielding multiple convolutional layers)
        * The image is transformed into the **set of local receptive fields**
    * **Shared Weights** - multiple convolutions are learned or used (e.g. the same 3x3 weights are moved around the input image)
        * These weights within a convolution are shared
    ![convolutional_layers](convolutional_layers.png)
    * In Keras, `Conv2D` layers, the size of patches and depth are the first arguments passed to the layer: `Conv2D(output_depth, (window_height, window_width))`
        * **Slide** these windows of 3x3 or 5x5 over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features: `(window_height, window_width, input_depth)`
        * Each 3D patch is then transformed, via a **tensor product** with the **same learned weight matrix** called **<u>convolution kernel</u>**, into a 1D vector of shape `(output_depth,)`
        * Every spatial location in the output feature map corresponds to the same location in the input feature map (e.g. lower-right corner of the output contains information about the lower-right corner of the input)
        * e.g. 3x3 windows, the vector `output[i,j,:]` comes from the 3D patch `input[i-1:i+1, j-1:j+1, :]`
            * The kernel will have as many `depth` as the `input_depth` so to run on a grayscale image (`1 depth`) or colored image (`3 depths`)
            * The different depths of the feature map refers to different filters that the kernel is learning
        ![convolution_kernel](convolution_kernel.png)
        * The output width and height may differ from the input width and height for two reasons:
            * **Border effects** - can be countered by padding the input feature map
            * **Strides** - the use of strides
    * Understanding **Border Effects** and **Padding**
        * **Border Effects** - get a **<u>shrunken</u> version** of the feature map (eventually you'll have feature maps that are too small (e.g. 1x1) and you're also not capturing alot of edge information)
            * e.g. consider a 5x5 feature map (25 tiles), there are only 9 tiles around which you can center a 3x3 window, forming a 3x3 grid (which is the size of the output feature map)
            ![border_effects](border_effects.png)
            * The feature maps shrinks by exactly two tiles alongside each dimension (**border effect**) 
                * e.g. 5x5 (25 tiles) $\rightarrow$ 3x3 (**9 patches**)
                * e.g. border effect with 28x28 inputs which becomes 26x26 after the first convolution layer
        * **Padding** - get an output feature map with the **<u>same</u> spatial dimensions**
            * (+) Allows the kernels to capture more overlapping areas to pick up more patterns
            * Padding consists of adding an appropriate number of rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows around every input tile
            * e.g. For 3x3 window, you add one column on the right, one column on the left, one row at the top, and one row at the bottom (and for 5x5 window, you add two rows)
            * e.g. 5x5 (25 tiles) $\rightarrow$ 3x3 (**25 patches**)
            ![padding](padding.png)
            * In `Conv2D` layers, padding is configurable via the `padding` argument, which takes two values: `valid` or `same`
                * `valid` - means no padding (**only valid windows** will be used) - allowing for **border effects**
                * `same` - which means "pad in such a way as to have **an output with the same width and height as the input**" - using **padding**
    * Understanding **Convolution Strides**:
        * Strides are the other factor that can influence output size
        * The description of convolution so far has assumed that the center tiles of the convolution windows are all contiguous (sharing a common border)
        * **Stride** - the distance between two successive windows is a parameter of the convolution (defaults to 1)
        * **Strided Convolutions** - convolutions with a stride higher than 1
            * e.g. Patches extracted by a 3x3 convolution **with stride 2 (2x2)** over a 5x5 input (without padding) yields 4 patches
            ![strided_convolution](strided_convolution.png)
            * Using **stride 2 (2x2)** means the **width and height of the feature map** are **downsampled by a factor of 2** (in addition to any changes induced by border effects)
        * In practice, strided convolutions are **rarely used**, although they can come in handy for some types of models
        * To downsample feature maps, instead of strides, we tend to use the **max-pooling** operation
* Convolutional Layers $\rightarrow$ Pooling Layers:
    * **Pooling Layers** - used immediately after convolutional layers, and simplifies the information in the output from the convolutional layer (e.g. **Max Pooling**)
        * **reduces the computational complexity** for later layers
        * provides a form of **translational invariance**
    ![max_pool](max_pool.png)
    * **Max-Pooling** Operation:
        * e.g. Convnet example has the size of the feature maps **halved** after every `MaxPooling2D` layer
            * Before the first `MaxPooling2D` layers, the feature map is 26x26 (`Conv2D`), but the max-pooling operation halves it to 13x13
        * **Max Pooling** - to **aggressively downsample feature maps**, much like strided convolutions
            * Consists of extracting windows from the input feature maps and outputting the **<u>max</u> value of each channel**
            * Conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformations (convolution kernel), they're **transformed via a hardcoded `max` tensor operation**
            * A big difference from convolution is that **max pooling** is usually done with **2x2 windows and stride 2**, in order to downsample the feature maps by a **factor of 2**
            * On the other hand, **convolution** is typically done with **3x3 windows and no stride (stride 1)**
        * Why is it important to downsample feature maps? (Prevents there being too many parameters at the end causing overfitting on noise)
        * Why not remove the max-pooling layers and keep fairly large feature maps all the way up? (Computationally expensive and will result in overfitting)
    * Faults of **<u>no</u> max-pooling model**:
        * Setup of no-max pooling model:
            ```python
            model_no_max_pool = models.Sequential()
            model_no_max_pool.add(layers.Conv2D(32, (3, 3), activation='relu',
                                  input_shape=(28, 28, 1)))
            model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))
            model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))
            ```
        * Here’s the summary of no-max pooling the model:
            ```python
            >>> model_no_max_pool.summary()
            Layer (type)                     Output Shape          Param #
            ================================================================
            conv2d_4 (Conv2D)                (None, 26, 26, 32)    320
            ________________________________________________________________
            conv2d_5 (Conv2D)                (None, 24, 24, 64)    18496
            ________________________________________________________________
            conv2d_6 (Conv2D)                (None, 22, 22, 64)    36928
            ================================================================
            Total params: 55,744
            Trainable params: 55,744
            Non-trainable params: 0
            ```
        * It isn't conducive to learning a **spatial hierarchy of features**:
            * e.g. 3x3 windows in the third layer will **only contain information coming from 7x7 windows in the initial input** (it isn't learning new features from edges (lower level patterns) to higher level patterns)
            * The **high-level patterns learned** by the convnet will **still be very small** with regard to the initial input, which may **not be enough to learn to classify digits** (try recognizing a digit by only looking at it through windows that are 7x7 pixels)
            * We need the features from the **last convolution layer to contain information about the <u>totality</u> of the input**
        * The final feature would be **too large**, which would result in **intense overfitting**:
            * e.g. final feature map has 22x22x64=30,976 total coefficients per sample
            * If you were to flatten it to stick a `Dense` layer of size 512 on top, that layer would have **15.8 million parameters**
            * This is far **too large** for such a small model and would result in **overfitting**
    * **Average Pooling** - each local input patch is transformed by taking the **<u>average</u> value of each channel over the patch**, rather than the max
        * Max pooling tends to work better than these alternative solutions
    * Most reasonable subsampling strategy:
        1. Produce **dense maps of features** (via unstrided convolutions)
        2. Look at the **maximal activation of the features over small patches**, rather than looking at sparser windows of the inputs (via strided convolutions) (or average input patches, which would cause you to miss or dilute feature-presence information)
* Pooling Layers $\rightarrow$ Fully Connected Layers:
    * **Fully Connected Layers** - used to **aggregate all of the information** that has been learned in the **convolutional and pooling layers**
    * They produce **higher order features** in standard neural network manner
* Fully Connected Layers $\rightarrow$ Output Layer:
    * **Output Layer** - produces the **probability that the image is of a certain class**
    * **Softmax** can be applied to the output layer, $\eta_k$ where $k=1,\dots,K$ to estimate the one-versus-all class probabilities of $K$ classes: $$\frac{e^{\eta_k}}{\sum_{k'=1}^K e^{\eta_k}}$$

4) CNN Intuition
* **Denoising is <u>not</u> actually common to use** with CNN, but available
* **New set of image processing techniques** for **getting more images** (e.g. rotate images, flip images, etc.) - making your model **more translational invariant**
* It is **not** too common to use **dropout after convolutional layers** (instead use it after fully connected layers)
* It is common to have **multiple convolutional layers in between pooling layers**
* **ReLU activation** units are incredibly popular with CNNs
* In general, it is best to go off of a **research paper in your domain space that uses CNNs**. Start off trying to get something working that uses the **same structure** they did, and go from there.
* CNNs are the best at identifying patterns in complex data (e.g. recognizing digits from images)
    * **<u>Classifier</u>** - <u>Test Error Rate</u>
    * **Large and Deep Convolutional Network** - 0.33%
    * **SVM with degree 9 polynomial kernel** - 0.66%
    * **Gradient boosted stumps on Haar features** - 0.87%
* State of Computer Vision
![state_of_comp_vision](state_of_comp_vision.png)
    * On a scale of little data vs. lots of data, the techniques of computer vision fall in ascending order from (less data) object recognition, image recognition, and speech recognition (lots of data).
    * More hand-engineering is required when we have little data and we can enhance our networks using transfer learning
    * When we have lots of data, there usually is less hand-engineering required
* Tips for Doing Well on **Benchmarks** or **Winning Competitions**:
    * **Ensembling** - train several networks (3-15 networks) independently and average their outputs (not the weights) to gain about 1% or 2% accuracy in your results
        * The issue with ensembling is that you need to keep all of the networks going and will be very computational expensive
    * **Multi-crop at test time** - run classifier on multiple versions of the test images and average results
    ![multi_crop](multi_crop.png)
        * **10-crop** - a cropping technique that takes the center of original and mirrored image and then takes four images from original/mirrored using the top-left, top-right, lower-left, and lower-right aligned images (a total of 10 images)

5) Training a Convnet on a **<u>Small</u> Dataset** (**Data Augmentation**/**Feature Extraction**/**Fine-Tuning**)
* Having to train an image-classification model using **very little data** is a common situation, which you'll likely encounter in practice if you ever do computer vision in a professional context
    * **"Few"** samples can mean anywhere from **few 100** to **few 10,000** of images
    * Example: Classifying images as dogs or cats, in a dataset containing 4000 pictures of cats and dogs (2000 cats and 2000 dogs)
        * 2000 pictures for **training**
        * 1000 pictures for **validation**
        * 1000 pictures for **testing**
    * **Naively** train a small convnet on the 2000 training samples, **<u>without</u> any regularization (baseline)** yielding accuracy of **71%**
* Right now the main issue is overfitting:
    1. Introduce **data augmentation**, a powerful technique for **mitigating overfitting** in computer vision (by data augmentation and generation), to improve the network to reach **82%** accuracy
    2. **Feature extraction with a <u>pretrained network</u>** to reach accuracy of **90-96%**
    3. **<u>Fine-tuning</u> a pretrained network** to reach final accuracy of **97%**
* Is deep learning relevant for small-data problems?
    * Deep learning does **work well with lots of data** since it is able to **find interesting features in the training data on its own**, and this can only be achieved when lots of training examples are available (especially true for problems where the **input samples are very high dimensional** (e.g. images))
        * However, what constitutes as lots of samples is relative to size and depth of network you train
    * It **isn't possible to train a convnet** to solve a complex problem with just a **few tens of samples**, but a **few hundred** can potentially suffice if the model is **small**, **well regularized**, and task is **simple**
        * Since convnet learns **local, translation-invariant features**, they're **highly <u>data efficient</u> on <u>perceptual</u> problems**
        * (+) training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering
    * Deep learning models are by nature **highly <u>repurposable</u>** - can **take an image-classification or speech-to-text model** trained on a large-scale dataset and **reuse it on significantly different problem** with only minor changes
        * Many **pretrained models** (usually **trained on ImageNet dataset**) are publicly available for download and can be used to bootstrap powerful vision models out of **very little data**
* **Data augmentation** - takes the approach of generating more training data from existing training samples, by **augmenting the samples** via a number of **random transformations** that yield believable-looking images
    * Types of transformations:
        * Mirroring
        * Random Cropping
        * Color Shifting (knowing that in different lightning of the picture, the object still remains the same)
            * Technique: PCA Color Augmentation (by AlexNet) - It brings colors that are more prominent to the same variance as the colors with less emphasis (e.g. R/B are high and G is low, it will bring down R/B)
    * Overfitting is caused by having too few samples to learn from
    * Given infinite data, your model would never be exposed to every possible aspect of the data distribution
    * The goal is that at training time, your model will **never see the exact same picture twice**. This helps expose the model to more aspects of the data and generalize better
    * In Keras, this can be done by **configuring a number of random transformations** to be performed on the images read by the ImageDataGenerator instance
        ```python
        datagen = ImageDataGenerator(
                    rotation_range=40,
                    width_shift_range=0.2,
                    height_shift_range=0.2,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True,
                    fill_mode='nearest')
        ```
        * `rotation_range` is a value in degrees (0-180), a range within which to randomly rotate pictures
        * `width_shift` and `height_shift` are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally
        * `shear_range` is for randomly applying shearing transformations
        * `zoom_range` is for randomly zooming inside pictures
        * `horizontal_flip` is for randomly flipping half the images horizontally (relevant when there are no assumptions of horizontal asymmetry e.g. real-world pictures)
        * `fill_mode` is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift
    ![data_augmentation](data_augmentation.png)
    * If you train a network using **data-augmentation configuration**, the network will **never** see the **same input <u>twice</u>**
        * However, the inputs it sees are **still heavily intercorrelated** because they come from a **small number of original images**
        * You can only **remix existing** information, but **not** produce **new** information
        * As such, this may **not be enough** to completely get rid of **overfitting** (add **`Dropout`** layer to model to further fight overfitting)
* Utilizing **Pretrained Convnet** / **Transfer Learning** - a saved network that was **previously trained** on a large dataset, typically on a large-scale image-classification task via **feature extraction** and/or **fine-tuning**
    * A common and highly effective approach to deep learning on **small image datasets**
    * If this **original dataset** is **large enough** and **general enough**, then the **spatial hierarchy of features learned** by the pretrained network can effectively act as **generic model** of the visual world, and hence its **features can prove useful** for many different computer-vision problems (even if these new problems many involve different classes than those of the original task)
    * e.g. Train a network on **ImageNet** (where classes are mostly **animals** and **everyday objects**) and then repurpose this trained network for something as remote as **identifying furniture items in images**
    * Such **portability of learned features across different problems** is a key advantage of deep learning compared to many older, shallow-learning approaches
    * e.g. Large convnet trained on the ImageNet dataset (1.4 million labeled images and 1000 different classes). ImageNet contains many animal classes, including different species of cats and dogs (which means it should perform well on the dog-vs-cat classification problem)
    * **VGG16** architecture (developed by Karen Simonyan and Andrew Zisserman in 2014)
        * Simple and widely used convnet architecture for ImageNet
        * Older model, far from the current state of the art and somewhat heavier than many other recent models
    * Other pretrained convnets: **VGG**, **ResNet**, **Inception**, **Inception-ResNet**, **Xception**, etc.
    * Two way to use **pretrained network**: **<u>feature extraction</u>** and **<u>fine-tuning</u>**:
    * **Feature Extraction** (Using Pretrained Convnet) - consists of using the representations learned by previous network to **extract interesting features from <u>new</u> samples**
        * These features are then run through a new classifier, which is trained from scratch
        * Convnets used for image classification comprise two parts: 
            * First part: **Convolutional base** - a series of **pooling and convolution layers**
            * Second part: End with densely connected classifier
        * Feature extraction consists of **taking the convolutional base** of a previously trained network, running the **new data through** it, and **training a new classifier** on the top of the **output**
        ![feature_extraction_pretrained](feature_extraction_pretrained.png)
        * Why are we only reusing the convolutional base? Could you reuse the densely connected classifier as well?
        * Avoid re-using the densely connected classifier because the **representations learned by the convolution base** are likely to be **more generic** and therefore **more reusable**
        * (Convolution base) The feature maps of a convnet are presence maps of **generic concepts** over a picture, which is likely to be useful regardless of the computer-vision problem at hand
            * Still has information about **object location**
        * (Dense classifier) But, the representations learned by the classifier will necessarily be **specific to the set of classes** on which the model was trained (They will only contain information about the presence probability of this or that class in the entire picture)
            * Representations in densely connected layers no longer have information about **where** objects are located in the input image (**no notion of space**)
            * For problems where **object locations matters**, densely connected features are **largely useless**
        * **Level of generality and reusability** of the **representations** extracted by specific convolution layers depends on **<u>depth</u> of layer**
            * **Earlier** layer extract **local, highly generic feature maps** (e.g. visual edges, color, textures)
            * **Higher up** layers extract more **abstract concepts** (e.g. "cat ear" or "dog eye")
        * If new dataset **differs a lot** from the dataset on the original model, it may be **better off** using only the **first few layers** for feature extraction rather than entire convolutional base
        * e.g. Since ImageNet contains multiple dog and cat classes, it is likely to be beneficial to reuse info contained in densely connected layers, but we won't in order to cover more general case where the class set of new problem doesn't overlap the class set of original model
            1. Use convolutional base of VGG16 network (`keras.application.VGG16`)
            2. Train on ImageNet
            3. Extract interesting features from cat and dog images
            4. Train a dog-vs-cat classifier on top of these features
            ```python
            from keras.applications import VGG16
            
            conv_base = VGG16(weights='imagenet',
                              include_top=False,
                              input_shape=(150, 150, 3))
            ```
        * Pre-trained model arguments:
            * `weights` - specifies the weight checkpoint from which to initialize the model
            * `include_top` - refers to including (or not) the densely connected classifier on the top of the network. By default, this densely connected classifier corresponds to the 1000 classes from ImageNet.
            * `input_shape` - the shape of the image tensors that you'll feed to the network. This argument is purely optional (if you don't pass it, the network will be able to process inputs of any size)
        * Case study of convnet architectures:
            * **LeNet-5** (Classic) - developed by Yann LeCun and trained to recognize handwritten digits of size 32x32x1
                * Had 60k parameters (small compared to today's standard of 10 to 100 million parameters)
                * Used `sigmoid`/`tanh` activation functions instead of the popular `relu` activation today
                * Used average pooling instead of the popular max pooling today
                * Used a sigmoid non-linearity after the pooling layer
            * **AlexNet** (Classic) - developed by Alex Krizhevsky and trained to classify ImageNet dataset using 227x227x3 images
                * Similar architecture to LeNet, except much more layers (or bigger) and had 60 million parameters
                * Used `relu` activation function
                * Used GPUs on layers to effective compute tasks
                * Used **Local Response Normalization** - (isn't used widely today since it doesn't have a huge impact to the end result) it takes a specific convolution layer and looks at a specific width and height and all channels and normalizes the values to prevent high activation of specific functions
            * **VGG / VGG-16** (Classic) - developed by Karen Simonyan and Andrew Zisserman and took a simplified approach to their neural network architecture when to compared to AlexNet trained on the ImageNet dataset
            ![vgg16](vgg16.png)
                * Results in 138 million parameters (large by modern standards)
                * "16" part refers to the fact that there are 16 layers with weights
                * Starts with 64 filters, then 128 filters, then 256 filter, then finally 512 filters
                * Had convolutional layers with 3x3 kernels, stride = 1, and same padding
                * Had max pooling layers with 2x2 kernels and stride = 2
                * Also, there is a VGG-19 that is even larger than VGG-16, but VGG-16 performs relatively against VGG-19 so people prefer VGG-16
            * **ResNet** - developed by Kaiming He and Xiangyu Zhang; the Residual Network is a very deep neural network with 152 layers
            ![plain_vs_resnet_images](plain_vs_resnet_images.png)
                * Very deep neural networks are hard to train because of **vanishing and exploding gradient type problems**
                * Uses **Residual Blocks (Skip Connections)** - takes the activation from one layer to a much **deeper** layer allowing you to train very deep neural networks (e.g. >100 layer NN)
                    * Normally, we would go through linear transformation with input data and weights and then some activation function like `relu` and keep propagating that forward; but with residual block, we are not following the main propagation path and instead taking a **"shortcut"**/**"skip connection"** to a layer much deeper in the network
                * In order to design a Residual Network, you have to add skip connections to a "plain" network every 2 layers
                ![resnet](resnet.png)
                * ResNet will overcome the vanishing/exploding gradient issues that cause your training error to rise with a very deep neural network architecture; however it will still most likely plateau at the end where the training error will not decrease anymore
                ![plain_vs_resnet](plain_vs_resnet.png)
                * Mathematically, Residual block is able to learn **identity function** relatively easily so adding more layers with the skip connection will not hurt the performance comparative to a network without residual blocks
                ![resnet_math](resnet_math.png)
                    * Normal plain deep networks have difficulty picking parameters to learn (even the identity functions), which is why more layers in this case is worse
                    * It is important for Residual network to have a lot **same convolution layer dimensions** so that the residual block will have the same dimensions from one convolution to another convolution layer
            * **Networks in Networks / 1x1 Convolutions** - developed by Min Lin and Qiang Chen to utilize 1x1 kernels on convolutions to essentially apply **fully connected layer filters** to the output of the convolution layer
                ![1x1_conv](1x1_conv.png)
                * Uses a 1x1 kernel on image that has multiple channels or has been processed further, then applying a `relu` non-linearity function on it
                * We are essentially applying a **fully connected layer** to each of the channel positions
                * Not widely used neural network architecture, but highly influential for **Inception Network**
                * 1x1 kernel is useful for **reducing the size of a convolution layer's <u>channel</u>**
                    * e.g. We have a 28x28x**192** convolution layer and we want to reduce to 28x28x**32**. The only way to do that is using 32 1x1 kernels to reduce the channel/depth size
                    * But, if you want to keep the same channels, you allow your network to learn **more complex functions** using 192 1x1 kernels
                * 1x1 convolutions also helps **reduce computations** on some networks    
            * **Inception** - developed by Christian Szegedy and Wei Liu to include all of the different filter possibilities in one layer stacked together using padding instead of choosing specific kernel or filter sizes or choosing pooling for a specific layer
            ![inception_net](inception_net.png)
                * In the below above, we are creating an inception layer with 1x1, 3x3, 5x5, and max-pool filters all stacked with the same width and height by using padding (28x28x192 $\rightarrow$ 28x28x256)
                ![inception_motivation](inception_motivation.png)
                * The issue with this inception layer is **computational cost**
                    * For the 5x5 filter in the inception layer, we are output 28x28x32 by multiplying with the filter of 5x5x192, which is a total multiplication operation cost of 120 million
                    * However, if we use the 1x1 convolution kernel, we can reduce the computational cost by a factor of 10. We first reduce the input size of 28x28x192 to 28x28x16 using 16 1x1 filters (also, called the **bottleneck layer**), then a normal 5x5 filter to get to the final result of 28x28x32
                    ![1x1conv_for_5x5](1x1conv_for_5x5.png)
                        * The cost of the bottleneck layer (1st convolution layer) is 28x28x16 times 1x1x192 which is 2.4 million
                        * The cost of the 2nd convolution layer is 28x28x32 times 5x5x16 which is 10 million
                        * The total cost is 12.4 million (which is 10 times lower than without 1x1 filter!)
                * Putting it all together with an entire Inception module:
                ![inception_module](inception_module.png)
                    * Pass the 3x3 and 5x5 filters first with a 1x1 kernel (the 1x1 filter doesn't need an extra conv layer)
                    * Pass a 1x1 kernel after a max pool layer to reduce the number of channels or depth
                    * At the very end, perform **channel concat** to bring together all of the convolution layer outputs into one (e.g. 28x28x256)
                * **Inception side branches** - they are added branches that have a softmax component with the hidden layers
                ![inception_branches](inception_branches.png)
                    * These are added to check that the predictions are going in correctly even before we make the final prediction at the end
        * List of image-classification models (all pretrained on the ImageNet dataset)
            * **Xception**
            * **Inception V3**
            * **ResNet50**
            * **VGG16**
            * **VGG19**
            * **MobileNet**
        * Two ways to extract features using pretrained model:
            * <u>Method 1</u>: **Fast Feature Extraction Without Data Augmentation** - fast and cheap to run since it only requires running the convolutional base **once** for every input image. However, doesn't allow for data augmentation
                1. Run convolutional base over dataset
                2. Record the output on numpy array on disk
                3. Using this data as input to a standalone, densely connected classifier
            * This method yields accuracy of **90%**, but loss shows **overfitting** almost immediately despite dropout since the model **doesn't use data augmentation**
            * <u>Method 2</u>: **Feature Extraction With Data Augmentation** - expensive, but allows for data augmentation since **every input image goes through the convolutional base** every time it's seen by the model
                1. Extend model you have (`conv_base`) by adding `Dense` layers on top
                2. Run the whole model end to end on the input data
            * This technique is so **expensive** that it should only be attempted with **access to GPU**
            * Before compile and training model, it's very important to **<u>freeze</u> the convolutional base**
                * Freezing a layer or set of layers means preventing their weights from being updated during training (e.g. `conv_base.trainable = False`)
                * If you **don't do this**, then the **representations that were previously learned** by the convolutional base will be **modified during training**
                * Because the `Dense` layers on top are randomly initialized, **very large weight updates** would be propagated through the network, **effectively destorying the representations previously learned**
            * With this setup, **only the weights from the layers** that you **added** will be trained (e.g. two `Dense` layers)
            * Note that in order for these changes to take effect, you must first **compile** the model
            * This method will allow for **validation accuracy of 96%** (much better than the small convnet trained from scratch)
    * **Fine Tuning** - consists of **unfreezing a few of the top layers of a frozen model base** used for feature extraction, and **jointly training** both the newly added part of the model (e.g. the fully connected classifier) and these top layers
        * Another widely used technique for model reuse, complementary to feature extraction
        * This is called **fine-tuning** because it **slightly adjusts** the more **abstract** representations of the model being reused, in order to make them **more relevant** for the problem at hand
        ![fine_tuning](fine_tuning.png)
        * Earlier, we stated that it's necessary to freeze the convolution base of VGG16 in order to train randomly initialized classifier on top. For the same reason, it's only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained
        * If the classifier isn't already trained, then the **error signal** propagating through the network during training will be **too large**, and the **representations previously learned** by the layers being fine-tuned will be **destoryed**
        * Fine-tuning Steps:
            1. Add your custom network on top of an already-trained base network
            2. Freeze the base network
            3. Train the part you added
            4. Unfreeze some layers in the base network
            5. Jointly train both these layers and the part you added
        * e.g. With the convolutional base, we will fine-tune the last three convolutional layers, which means all layers up to `block4_pool` should be frozen, and the layers `block5_conv1`, `block5_conv2`, and `block5_conv3` should be trainable
        * Why not fine-tune more layers? Why not fine-tune the entire convolutional base? Consider these facts when tuning more layers:
            * **Earlier layers** in the convolutional base encode **more-generic, reusable** features, whereas layers higher up encode more-specialized features. 
                * It's more useful to fine-tune the more specialized features, because these are the ones that need to be **repurposed on your new problem**. 
                * There would be fast-decreasing returns in fine-tuning lower layers
            * The **more parameters** you're training, the more you're at risk of **overfitting**
                * e.g. The convolutional base has 15 million parameters, so it would be **risky** to attempt to train it on your **small dataset**
            * The **more data** you have the **more flexible you can be about fine-tuning more layers** of the network (even the entire transferred neural network architecture if you feel you have a sizable dataset)
            ![transfer_learning](transfer_learning.png)
        * Fine-tuning network with **RMSProp optimizer** using **<u>very low</u> learning rate** to **limit magnitude** of the modifications made to the representations of the three layers fine-tuned. Updates **too large** may **harm** these representations
            * You may want to **smooth out noisy curves** to see the actual trend of the training and validation accuracy/loss plots
            * Why can the accuracy stay stable or improve if the loss **isn't** decreasing?
                * What you display is an **average of pointwise loss values**, but what matters for accuracy is the **distribution of the loss values**, **not their average**, because accuracy is the result of a binary thresholding of the class probability predicted by the model. The model may **still be improving** even if this isn't reflected in the **average loss**

6) Visualizing What Convnets Learn
* Often people say deep-learning models are **"black boxes"**: learning representations that are **difficult to extract** and **present in a <u>human-readable</u> form**
    * Although, this is partially true for certain types of deep-learning models, it's **definitely not true for convnets**
* The representations learned by convnets are **highly <u>amenable</u> to visualization**, in large part because they're **representations of visual concepts**
* Since 2013, a wide array of techniques have been developed for visualizing and interpreting these representations:
    1. **Visualizing <u>intermediate</u> convnet outputs (Intermediate activations)** - useful for understanding **how successive convnet layers transform their input**, and for getting a first idea of the **meaning of individual convnet <u>filters</u>**
    2. **Visualizing convnets <u>filters</u>** - useful for understanding precisely what **<u>visual pattern</u> or concept each filter** in a convnet is **receptive to**
    3. **Visualizing <u>heatmaps of class activation</u> in an image** - useful for understanding which **parts of an image** were identified as **belonging to a given class**, thus allowing you to **localize objects** in images
* **Visualizing Intermediate Activations** - visualizing intermediate activations consists of **displaying the feature maps** that are output (of activation function) by various convolution and pooling layers in a network (e.g. sparsity, abstraction, less visual pattern, more class of image)
    * This gives a view into a how an input is decomposed into the different filters learned by the network
    * Visualize feature maps with three dimensions: width, height, depth (channels)
    * Each channel encodes relatively **independent features**, so the proper way to visualize these feature maps is by **independently plotting** the contents of every channel as a 2D image
    * In order to extract the feature maps, create a Keras model that takes batches of images as input, and outputs the **<u>activations</u> of all convolution and pooling layers (`Model`)**
    * What sets the **`Model` class** apart is that it **allows for models with <u>multiple</u> outputs**, unlike `Sequential`
    * When fed an image input, this model returns the values of the layer activations in the original model
        * e.g. One input and eight outputs (one output per layer activation)
    * Take an input image and visualize specific channels of the activation layer
    ![test_cat_input](test_cat_input.png)
    ![test_cat_diagonal_edge](test_cat_diagonal_edge.png)
    ![test_cat_eyes](test_cat_eyes.png)
    * Visualizing every channel in every intermediate activation
    ![visualize_intermediate_activation](visualize_intermediate_activation.png)
        * The **first layer** acts as a collection of various **edge detectors**. At that stage, the **activation retain <u>almost all</u> of the information** present in the initial picture
        * As you go higher, the activations become increasingly abstract and less visually interpretable
            * e.g. encode higher-level concepts such as "cat ear" and "cat eye"
            * **Higher presentations** carry increasingly **less information about the visual contents** of the image, and increasingly **more information related to the <u>class</u> of the image**
        * The **<u>sparsity</u> of the activations increases** with the depth of the layer: in the first layer, all filters are activated by the input image. But, in the following layers, more and **more filters are blank**. This means the **pattern encoded by the filter isn't found in the input image**
    * Important universal characteristic of representations learned by deep neural networks: the features extracted by a layer become **increasingly abstract with the depth of layer**
    * The activations of higher layers carry less and less information about the specific input being seen, and more and **more information about the target** (in this case, the class of image: cat or dog)
    * A deep neural network effectively acts as a **information distillation pipeline**, with a raw data going in (e.g. RGB pictures) and being repeatedly transformed so that **irrelevant information is filtered out** (e.g. specific visual appearance of the image), and **useful information is magnified and refined** (e.g. class of image)
* **Visualizing Convnet Filters** - easy way to inspect filters learned by convnets is to **display the visual pattern** that each filter is meant to respond to via **gradient ascent in input space** (e.g. from edges to textures to relevant patterns in images)
    * **Gradient Ascent in Input Space** - applying **gradient descent** to the value of the input image of a convnet so as to **maximize** the response of a specific filter, starting from a blank input image. The **resulting input image** will be one that the **chosen filter is <u>maximally</u> responsive to**
    * Process For **Visualizing Filter**:
        1. Build loss function that **maximizes** the value of a given filter in a given convolution layer
        2. Use **stochastic gradient descent** to adjust the values of the input image to **maximize activation value**
            * A non-obvious trick to use to help the gradient descent process go smoothly is to **normalize the gradient tensor by dividing it by its L2 norm** (gradient normalization trick). This ensures that the **magnitude of the updates** done to the input image is always **within the same range**
        3. Postprocess this tensor to turn it into a displayable image within utility function
    * Examples of generated grid of all filter response patterns in a layer
        ![block1_conv1](block1_conv1.png)
        ![block2_conv1](block2_conv1.png)
        ![block3_conv1](block3_conv1.png)
        ![block4_conv1](block4_conv1.png)
    * These **filter visualizations** tell you a lot about **how convnet layers see the world**
        * Each layer in a convnet **learns a collection of filters** such that their inputs can be expressed as a **combination of the filters**
        * Similar to how **Fourier transform** decomposes **signals onto a bank of cosine functions**
    * The filters in these convnet filter banks **get increasingly complex and refined** as you go higher in the model:
        * The filters from the first layer in the model (`block1_conv1`) encode **simple directional edges and colors** (or colored edges, in some cases)
        * The filters from `block2_conv1` encode **simple textures** made from combinations of edges and colors
        * The filters in higher layers begin to **resemble textures found in natural images**: feathers, eyes, leaves, etc.
* **Visualizing Heatmaps of Class Activation** - useful for understanding which **<u>parts</u> of a given image** led a convnet to its **final classification decision**
    * Helpful for **debugging the decision process** of a convnet, particularly in the case of a **classification <u>mistake</u>**
    * Allows you to **locate specific objects** in an image
    * **Class Activation Map (CAM) Visualization** - consists of producing **heatmaps of class activation** over input images
        * A class activation heatmap is a **2D grid of scores** associated with a specific output class, **computed for every location** in any input image, indicating **how important each location is** with respect to the class under consideration
        * e.g. given an image fed into a dogs-vs-cats convnet, CAM visualization allows you to generate a heatmap for the class "cat" indicating how cat-like different parts of the image are, and also a heatmap for the class "dog" indicating how dog-like parts of the image are
        * **Grad-CAM: Visual Explanation from Deep Networks via <u>Gradient-based Localization</u>** - consists of taking the **output feature map of convolution layer**, given an input image, and **weighing <u>every channel</u> in that feature map** by the **gradient of the class** with respect to the channel
            * Intuitively, one way to understand this trick is that you're **weighing a spatial map** of "how intensely the input image **activates different channels**" by "how important **each channel is with regard to the class**" resulting in a spatial map of "how **intensely the input image activates the class**"
            * e.g. Image of two African elephants
            ![test_pic_elephant](test_pic_elephant.png)
            ![heatmap_act](heatmap_act.png)
            ![heatmap_elephant](heatmap_elephant.png)
        * This visualization technique answers two important questions:
            * Why did the network think this image contained an African elephant?
            * Where is the African elephant located in the picture?
        * In particular, it's interesting to note that the ears of the elephant calf are strongly activated: this is probably how the network can tell the difference between African and Indian elephants

7) Object Detection
* **Object Localization** - the task of taking a classified image and creating boundary box that identifies the location of the object inside an image
![object_loc](object_loc.png)
    * Breakdown of types of classification/detection tasks (in increasing complexity):
        * **Image Classification** - the task of **labelling** or predicting an image (e.g. is it a car?)
            ![normal_classification](normal_classification.png)
            * e.g. A convolutional neural net that has a softmax of 4 outputs (e.g. ped, car, motorcycle, or background)
        * **Classification with localization** - the task of boxing a **single** classified image (e.g. drawing a box around where the car is inside the picture)
            ![classification_local](classification_local.png)
            * e.g. Take the convnet architecture above and add to the label output a **bounding box** output of the width, height of the object in terms of the object center (4 values)
        * **Object Detection** - the task of identifying **multiple objects** in an image in different categories (e.g. boxing multiple different cars, pedestrians, trees, landmarks, etc.)
    * Defining target label `y` for Localization:
    ![target_label_local](target_label_local.png)
        * We want to output `b_x`,`b_y`,`b_h`,`b_w` + class label (1-4) + probability of the specific label
        * If there is a car in an image, then `p_c=1` with `b_x`,`b_y`,`b_h`,`b_w` values and `c_2=1` with the rest 0
        * If there is no car in an image, then `p_c=0`
        * Mathematically, we will have an loss function (e.g. MSE) that will measure how accurately we are classifying image when `p_c=1` for all eight values
        * When `p_c=0`, we only care about how accurate we are classifying the image (1st value `p_c`)
* **Landmark Detection** - Instead of having a bounding box, we have **1 or more points that are marked** to identify a specific class of image
    ![landmark](landmark.png)
    * e.g. Face detection will have 64 landmark outputs that identify the outline of the eyes, nose, mouth, and jawline resulting in 129 outputs (1 for face classification, 64*2=128 for landmark coordinates)
    * (-) The caveat is that the identification of the landmark coordinates will need to be **manually created**
    * Practical applications: 
        * **Snapchat filters** (AR - Augmented Reality) (e.g. detects face and allows you to add a "virtual" crown on your video face)
        * **Emotion Detection** - be able to detect emotions on your face by the way the landmarks are outputted from a specific emotion (e.g. anger, sad, happy, etc.)
        * **Human Pose Detection** - using landmarks to identify parts of the body (e.g. shoulder, leg, head, arms, etc.)
        ![pose_detection](pose_detection.png)
* Object Detection using **Sliding Windows Detection Algorithm** - an boundary box is slid from the top-left corner of the an image to the top-right, then for each subsequent row until we reach the lower-right corner and for each of of these boxes, we classify if we if an object is classified or not
    ![sliding_windows](sliding_windows.png)
    * There should be a labelled dataset of cropped images (e.g. different labeled cars)
    * After running through the image once with the sliding window, we increase the size of the sliding window and run it again
    * The key is if the window passes through an object we want to detect, it marks it appropriate in the output
    * (-) The **computational cost** of using sliding windows is **huge** since you're running each of the cropped sliding windows independently through the convnet
        * If you **increase the stride**, it will cut down on the computation cost, but might **hurt your performance** since it might miss the objects or it doesn't learn as much of the complex functions
    * Before the rise of neural networks, these algorithms were using simplier classifiers (e.g. linear classifier with hand-engineered features) to perform object detection on the sliding window algorithm, which actually performed fine because it was using linear function. However, now with convnet, each classification task is much more expensive and would be too slow using this algorithm
* **Convolutional Implementation** of **Sliding Windows** - developed by Pierre Sermanet called **OverFeat** which has a **fully convolution layers end to end** without using fully connected layers, but preserves the dimensions at the end. Allows you to share the computation of sliding windows through the entire network.
    * Turning **Fully Connected Layers** into **Convolutional Layers** - instead of outputting the convolution layer into a fully connected layer at the end, we will use a **kernel of the same width/height** with the **same number of channels as we would have nodes in a fully connected layer**
    ![fc_to_conv](fc_to_conv.png)
        * In the example above, we are using 400 5x5 kernels on the "last" convolution layer of 5x5x16 to output 1x1x400 (similar to the dimensions if we used a fully connected layer of 400 nodes)
        * Then, we use a 1x1x400 and 1x1x4 to match the fully connected results of 4 possible outputs in the end
    * **Convolutional Sliding Windows Detection** - we slide our windows using padding (increasing the width/height a bit) through the layers of the network and output each sliding window result at the end with the computation cost shared
    ![conv_slide](conv_slide.png)
        * Instead of having to run **four subsets of the convet independently** for each of the sliding windows, we can run a **single one that outputs all of the results of each of the sliding windows**
    ![larger_conv_slide](larger_conv_slide.png)
        * The **max pooling layer dictates the slide value** of your windows (e.g. 28x28x3 image with max pool of 2x2 has stride of 2)
        * Running a 14x14 window on the 28x28x3 image will result in 8x8x4 output with the same covnet architecture
        * To reiterate, instead of running sequential convets of the sliding window, we are making **all of the predictions at once** using the **convolutional implementation with one forward pass**
    * (-) The **position of the bounding boxes** is still **not** too accurate in capturing the actual object (e.g. bounding box slides where the object is but never fully captures it) since you're **limited to the sliding factor** of the bounding boxes through the image
* **Bounding Box Predictions** - improving the bounding box accuracy
    * **You Only Look Once (YOLO) Algorithm** - developed by Joseph Redmon and allows for more accurate bounding boxes by **splitting the image into grids** and **classifying each grid** (normally 19x19 grid) individually and **creating specific bounding boxes**, which outputs "x" (where x is `p_c`, bounding box coordinates, and classifications) dimensional output vectors using a convolutional implmentation (non-sequential propagation)
    ![yolo_alg](yolo_alg.png)
        * In each of the grids (3x3 in the example above), the mid-left and mid-right will have a `p_c=1` with the bounding boxes coordinates (center coordinate and width and height of box) whereas the rest will be `p_c=0`, which results in 8 channels (`p_c`, 4 bounding box values, and 3 classification values)
        * Using the grids, we are outputting **precise bounding boxes** (like previously mentioned in localization topic) of where the object is based on its location within the grid
        * (-) There might be a problem if you are detecting multiple objects within a grid cell, which could be circumvented using a much larger grid (e.g. 19x19 grid)
        * (+) The output of this convnet is **relatively fast**, so it works pretty well for **real-time object detection**
    * How do we **specify the bounding boxes**?
        ![specify_boundingb](specify_boundingb.png)
        1. Find the specific grid that has the classified image (e.g. car)
        2. Label top-left corner as (0,0) and bottom-right as (1,1) and label center of object using `b_x` and `b_y` (must be between 0 and 1)
        3. Label the width and height (`b_h` and `b_w`) of the object based on the percentage width/height that the object takes (non-negative value that can be greater than 1)
            * Object is larger than the bounds of the grid allows it be greater than 1
            * There are other parameterization that work better to specify the bounding box 
                * Use **sigmoid function** to make sure that `b_x` and `b_y` are between 0 and 1
                * Use **exponential parameterization** to make sure that `b_w` and `b_h` are non-negative values
    * YOLO paper is hard to read (if you're interested in reading it)
* **Intersection Over Union (IoU)** - using the predicting bounding box vs the ground-truth bounding box, it **calculates the amount of intersection between the boxes** over the **union of the two boxes** to determine how well the predicted bounding box localized on the object
    * Problem: How do you tell if your object detection algorithm is doing well?
    ![iou](iou.png)
    * Used to evaluate the object detection algorithm or to add another component to the object detection algorithm
    * IoU is a **measure of the overlap** between two bounding boxes
    * **IoU** Equation: $\frac{\text{size of intersection}}{\text{size of union}}$
        * Where normally an IoU >= 0.5 is "correct" (where IoU = 1.0 is a perfect bounding box)
        * A **higher value than 0.5** will make the **evaluation more stringent**
* **Non-Max Suppression** - makes sure that your algorithm only detects each object once when there is a situation where there is multiple bounding boxes for a single object by ranking the probability of detection, choosing the highest probability bounding box, and suppressing the lower probabilities of at least 0.5 (since they are non-maximal)
    ![nm_suppress_issue](nm_suppress_issue.png)
    * Problem: How do we avoid the multiple detections of the same object?
    ![nm_suppress_multiple](nm_suppress_multiple.png)
    * e.g. If you have a 19x19 grid over an image, there might be **multiple grids** that the algorithm will **think** there are multiple areas where it has detected the object (**creating multiple bounding boxes over the same object**)
    ![suppress_box](suppress_box.png)
    * With three possible probability of detection (0.9,0.7,0.6), the algorithm will **select the 0.9** and **suppress the 0.7 and 0.6 bounding boxes** (which are non-maximal probabilities)
    * Steps of **Non-Max Suppression**:
        1. Get each output prediction from grids
        2. Discard all bounding boxes with `p_c <= 0.6`
        3. While there are any remaining boxes:
            * Pick the box with the **largest** `p_c` and output that as a prediction
            * Discard any remaining box with IoU >= 0.5 with the box output in the previous step
    * If you have **multi-classification** model (e.g. pedestrian, cars, and trees), then you should run the **non-max suppression independently** on each of the output classes
* **Anchor Boxes** - a method to detect multiple different objects in specific grids by using different anchor box shapes, which outputs a combined anchor box vector
    ![anchor_box](anchor_box.png)
    * Problem: What if we want the grid cell to detect multiple objects?
    * Instead of only assigning an object to a **grid cell with the object's midpoint**, we now also assign it based on the **anchor box for the grid cell with the <u>highest IoU</u>** compared to the ground truth bounding box
        * e.g. A ground-truth bounding box that is vertical will have a higher IoU with an anchor box that is vertical and not horizontal
        * This results in an output `y` with dimension 3x3x(8x2) or 3x3x16 because it contains (grid cell, anchor box) in the vector instead of only 3x3x8 (without anchor boxes)
    ![anchor_box_example](anchor_box_example.png)
        * In the grid cell with two objects, each object will be classified by their similarities to the anchor box
        * In the grid cell with only the car, the anchor box like the car IoU will be outputted 1 with the pedestrian as 0
    * (-) In a situation where there are more objects to classify than anchor boxes in a single grid cell, the algorithm does not do very well. The best solution is to implement a **tiebreaker**
    * (-) If you have a situation with two objects in the same grid cell that have the same anchor box shape, then the algorithm will not handle it very well. Again, the solution is probably to implement a **tiebreaker**
    * e.g. In a classification algorithm, if the cars are mostly fatter objects and pedestrians are tall skinny objects, then the algorithm can learn these specific functions on these objects to differentiate between them
    * How do you choose the anchor box? 
        * Manually choose anchor boxes of variety of shapes 
        * Use **K-means Clustering** algorithm to **group together two types of objects shapes you normally get**, and then using the output of the grouped shapes to select anchor boxes that **most stereotypically represent the majority of objects** you're trying to detect
* **The complete: YOLO Algorithm** (**Real-time object detection**)
    * **Training YOLO** - divides the image into grid cells and classify each of the cells using anchor boxes
    ![training_yolo](training_yolo.png)
        * Ideally have an output of 19x19x40 (19x19 grid and 5 anchors)
    * **Prediction with YOLO** - using the trained model to predict on specific parts of the image with specified bounding box coordinates
    ![predict_yolo](predict_yolo.png)
    * **Outputting Non-Max Suppression Outputs** - for each grid call, get 2 (in this case) predicted bounding boxes (based on number of anchor boxes)
        ![anchor_pred](anchor_pred.png)
        * Get **rid of the low probability predictions**:
        ![anchor_pred_low](anchor_pred_low.png)
        * For each class (e.g. pedestrian, car, motorcycle), use **non-max suppression** to generate final predictions
        ![anchor_pred_final](anchor_pred_final.png)
* **Differences in YOLOv1, YOLO9000, YOLOv2, YOLOv3**
    * **YOLOv1**
        * Divides input image into `S`x`S` grid
        * Each grid cell only predicts `one` object (one box confidence score)
            * Predicts multiple conditional class probabilities (`C`)
        * Each grid cell makes fixed number of boundary box predictions (`B`)
            * We normalize the bounding box width `w` and height `h` by the image width and height
            * `x` and `y` are offsets to the corresponding cell. Hence, `x`, `y`, `w` and `h` are all between 0 and 1
            ![yolo_boundary_box_pred](yolo_boundary_box_pred.jpeg)
        * (-) Will miss objects that are too close to each other in one grid cell
        * e.g. To evaluate PASCAL VOC, YOLO uses `7`×`7` grids (`S`×`S`), 2 boundary boxes (`B`) and 20 classes (`C`)
            * YOLO’s prediction has a shape of `(S, S, B×5 + C)` = `(7, 7, 2×5 + 20)` = `(7, 7, 30)`
            * YOLOv1 CovNet:
                ![yolov1_arch](yolov1_arch.jpeg)
            * It uses a CNN network to reduce the spatial dimension to `7×7` with 1024 output channels at each location
            * YOLO performs a linear regression using two fully connected layers to make `7×7×2` boundary box predictions
            * To make a final prediction, we keep those with high box confidence scores (greater than 0.25) as our final predictions (the right picture)
                ![yolov1_pred](yolov1_pred.png)
        * **Class confidence score** - is computed for each prediction box
            * `class confidence score = box confidence score * conditional class probability`
            * It measures the confidence on both the classification and the localization (where an object is located)
                ![class_confid_score_math](class_confid_score_math.png)
        * YOLOv1 CNN Architecture:
            ![yolov1_network](yolov1_network.png)
            * YOLO has 24 convolutional layers followed by 2 fully connected layers (FC). 
            * Some convolution layers use `1×1` reduction layers alternatively to reduce the depth of the features maps. 
            * For the last convolution layer, it outputs a tensor with shape `(7, 7, 1024)`. 
            * The tensor is then flattened. Using 2 fully connected layers as a form of linear regression, it outputs `7×7×30` parameters and then reshapes to `(7, 7, 30)`
                * e.g. 2 boundary box predictions per location.
            * A faster but less accurate version of YOLO, called **Fast YOLO**, uses only 9 convolutional layers with shallower feature maps.
        * **Loss Function** - To compute the loss for the **true positive**, we only want **one** of them to be responsible for the object
            * YOLO predicts multiple bounding boxes per grid cell
            * Select the one with the **highest IoU (intersection over union)** with the ground truth
            * This strategy leads to specialization among the bounding box predictions
            * Each prediction gets better at predicting certain sizes and aspect ratios
            * YOLO uses **sums of squared error** (SSE) between the **predictions** and the **ground truth** to calculate loss. The loss function composes of:
                * The **classification loss** - did it classify the image correctly?
                    * **If an object is detected**, the classification loss at each cell is the squared error of the class conditional probabilities for each class:
                        ![yolov1_classification_loss](yolov1_classification_loss.png)
                * The **localization loss** - errors between the predicted boundary box and the ground truth
                    * We only count the box responsible for detecting the object
                        ![yolov1_localization_loss](yolov1_localization_loss.png)
                    * We **do not** want to weight absolute errors in **large boxes** and **small boxes equally**.
                        * e.g. A 2-pixel error in a large box is the same for a small box. (Bad)
                    * To partially address this, YOLO predicts the **square root of the bounding box width and height** instead of the width and height. 
                    * In addition, to put more emphasis on the boundary box accuracy, we **multiply the loss by λ coord** (default: 5).
                * The **confidence loss** - the objectness of the box
                    * **If an object is detected in the box**, the confidence loss (measuring the objectness of the box) is:
                        ![yolov1_confidence_loss](yolov1_confidence_loss.png)
                    * If an object is **not detected** in the box, the confidence loss is:
                        ![yolov1_confidence_loss_not](yolov1_confidence_loss_not.png)
                    * (-) Most boxes do not contain any objects. 
                        * This causes a class imbalance problem 
                        * e.g. We train the model to detect background more frequently than detecting objects
                        * To remedy this, we weight this loss down by a factor $λ_{noobj}$ (default: 0.5).
                * **Final loss** - The final loss adds localization, confidence and classification losses together.
                    ![yolov1_final_loss](yolov1_final_loss.png)
        * Inference with **Non-maximal Suppression (NMS)** - YOLO applies non-maximal suppression to remove duplications with lower confidence
            * Problem: YOLO can make duplicate detections for the same object
            * Application Benefits: Non-maximal suppression adds 2-3% in mAP
            * e.g. One of the possible non-maximal suppression implementation:
                1. Sort the predictions by the confidence scores.
                2. Start from the top scores, ignore any current prediction if we find any previous predictions that have the same class and `IoU > 0.5` with the current prediction.
                3. Repeat step 2 until all predictions are checked.
        * Benefits of YOLOv1:
            * Fast. Good for real-time processing.
            * Predictions (object locations and classes) are made from one single network. Can be trained end-to-end to improve accuracy.
            * YOLO is more generalized. It outperforms other methods when generalizing from natural images to other domains like artwork.
            * Region proposal methods limit the classifier to the specific region. YOLO accesses to the whole image in predicting boundaries. With the additional context, YOLO demonstrates **fewer false positives in background areas**.
            * YOLO detects one object per grid cell. It enforces spatial diversity in making predictions.
        * Cons of YOLOv1:
            * SSD is a strong competitor for YOLO which at one point demonstrates higher accuracy for real-time processing. 
            * (-) Comparing with region based detectors, **YOLO has higher localization errors and the recall** (measure how good to locate all objects) is **lower**
    * **YOLOv2** - is the second version of the YOLO with the objective of **improving the accuracy** significantly while **making it faster**.
        * Accuracy Improvements:
            * **Batch Normalization** - Add batch normalization in convolution layers. This removes the need for dropouts and pushes mAP up 2%.
            * **High-resolution classifier** - Pretrain CNN on VGG16, then use convolution layer instead of FC layers at the end
                * The YOLO training composes of 2 phases:
                    * First, we train a classifier network like VGG16. 
                    * Then, we replace the fully connected layers with a convolution layer and retrain it end-to-end for the object detection.
                * Differences in YOLOv1 vs YOLOv2:
                    * YOLOv1 trains the classifier with `224 × 224` pictures followed by `448 × 448` pictures for the object detection. 
                    * YOLOv2 starts with `224 × 224` pictures for the classifier training, but then retune the classifier again with `448 × 448` pictures using much fewer epochs. 
                        * This makes the detector training easier and moves mAP up by 4%.
            * **Convolutional with Anchor Boxes** - Use specific anchor (or priors) boxes shapes to make initial training more stable
                * In YOLOv1, the early training is susceptible to unstable gradients. 
                    * Initially, YOLO makes arbitrary guesses on the boundary boxes. 
                    * (-) These guesses may work well for some objects, but badly for others resulting in **steep gradient changes**. 
                    * (-) In early training, predictions are fighting with each other on what shapes to specialize on.
                    * e.g. In the real-life domain, the boundary boxes are not arbitrary. 
                        ![yolov1_boundary_box_issue](yolov1_boundary_box_issue.jpeg)
                        * Cars have very similar shapes and pedestrians have an approximate aspect ratio of 0.41.
                * Since we only need one guess to be right, the initial training will be more stable if we start with diverse guesses that are common for real-life objects. (Anchor Boxes)
                    ![yolov2_anchor_boxes](yolov2_anchor_boxes.jpeg)
                    * Instead of predicting 5 arbitrary boundary boxes, we predict **offsets to each of the anchor boxes**. 
                    * If we constrain the offset values, we can **maintain the diversity of the predictions** and have **each prediction focuses on a specific shape**. So the initial training will be more stable.
        * Changes to YOLO CNN Architecture:
            ![yolov2_remove_fc_layers](yolov2_remove_fc_layers.jpeg)
            * Remove the fully connected layers responsible for predicting the boundary box.
            * We move the class prediction from the cell level to the **boundary box level**. 
                * Now, each prediction includes **4 parameters for the boundary box**, **1 box confidence score (objectness)** and **20 class probabilities**. 
                    * e.g. 5 boundary boxes with 25 parameters: 125 parameters per grid cell. 
                    * Same as YOLOv1, the objectness prediction still predicts the IOU of the ground truth and the proposed box.
            * To generate predictions with a shape of `7×7×125`, we replace the last convolution layer with **three** `3×3` convolutional layers each outputting 1024 output channels. 
                * Then, we apply a final `1×1` convolutional layer to convert the `7×7×1024` output into `7×7×125`. (See the section on DarkNet for the details.)
                ![yolov2_num_parameters](yolov2_num_parameters.jpeg)
            * Change the input image size from `448×448` to `416×416`. 
                * This creates an **odd number spatial dimension** (`7×7` v.s. `8×8` grid cell). 
                * The center of a picture is often occupied by a large object. 
                * (+) With an odd number grid cell, it is more certain on where the object belongs.
                    ![yolov2_oddvseven](yolov2_oddvseven.jpeg)
            * Remove one pooling layer to make the spatial output of the network to `13×13` (instead of `7×7`).
            * Anchor boxes decrease mAP slightly from `69.5` to `69.2` but the recall improves from `81%` to `88%`. 
                * e.g. Even the accuracy is slightly decreased but it increases the chances of detecting all the ground truth objects.
        * **Dimension Clustering**
    * **YOLO9000**
    * **YOLOv3**
* **Region Proposals using R-CNN** - developed by Russ Girshik and Jeff Donahue (2013) called **Regions with Convolutional Networks (R-CNN)** to be **more selective about the regions** that your covnet classifer should **run sliding windows** (excluding areas that are not showing any relevant information) using **segmentation algorithm** and also outputs its own bounding box
    * R-CNN Steps:
        1. Propose regions using **segmentation algorithm**
        2. Classify proposed regions one at a time
        3. Output label and **bounding box** (**allows for a more accurate bounding box instead of choosing the one that is given through segmentation algorithm**)
    ![r_cnn](r_cnn.png)
    * **Segmentation Algorithm** - transforms the image into **specific colored sections** to run classification on
        * e.g. in the example above, there are ~2000 colored sections to run classification on
    * Prevents having to run the algorithm on every single sliding window
    * Allows you to run classification on specific shapes and sizes (e.g. fat cars, tall/skinny pedestrians, etc.)
    * (-) This algorithm is unfortunately **very slow to run**
    * **Improvements to R-CNN** algorithm
        * **Fast R-CNN** - developed by Russ Girshik (2015) to have a faster implementation of R-CNN using convolution implementation to classify all proposed regions all at the same time
            * Steps:
                1. Propose regions using **segmentation algorithm**
                2. Use convolution implementation of sliding windows to classify all the proposed regions
            * (-) The clustering step to proposed region is still slow to run
        * **Faster R-CNN** - developed by Shaoqing Ren and Kaiming He (2016) to use convolutional network to propose regions instead of the traditional segmentation algorithms 
    * Although, the **Faster R-CNN** is much faster, it is still not as fast as the **YOLO algorithm**
    * What are the issues still facing **R-CNN**?
        * (-) Training the data is unwieldy and too long
        * (-) Training happens in multiple phases (e.g. training region proposal vs. classifier)
        * (-) Network is too slow at **inference time** (e.g. when dealing with non-training data)
* **Single Shot MultiBox Detector (SSD)** - developed by Wei Liu and Christian Szegedy (2016) to predict category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps. Uses a single network and uses different bounding box (no region proposals) and alters the bounding box as part of the prediction
    ![ssd_arch](ssd_arch.png)
    * Breakdown of SSD:
        * **Single Shot** - means that the tasks of object localization and classification are done in a **single forward pass** of the network
        * **MultiBox** - technique for bounding box regression
        * **Detector** - the network is an object detector that also classifies those detected objects
    * **SSD Architecture** -takes alot of parts of the VGG-16 architecture, but discards the fully connected layers (basically uses the base network of VGG-16)
        * (From Conv1 to Conv5) Uses the VGG-16 base network because of its strong performance on high quality image classification and its popularity in problems where **transfer learning** helped improved results
        * (From Conv6 and onward) Instead of the VGG-16 fully connected layer, a set of **auxiliary convolutional layers** were added, thus allowing the **extraction of features at <u>multiple</u> scales** and progressively **decreasing the size** of the input to each subsequent layer
        * Each of the last few layers progressively shrinks the bounding box predictions (making it more accurate) 
        * Whereas **YOLO** uses an intermediate fully connected layer (not convolution filter)            
    * **MultiBox** - the bounding box regression technique of SSD
    ![multibox](multibox.png)
        * A method for fast **class-agnostic** bounding box coordinate proposals
        * **Inception-style** convolutional network is used via 1x1 convolutions which reduce the number of dimensions while allowing width and height to remain the same
        * MultiBox's loss function combines two critical components:
            * **Confidence Loss** - measures how confident the network is of the **objectness** of the computed bounding box using **categorical cross-entropy**
            * **Location Loss** - measures how **far away** the network's predicted bounding boxes are from the ground truth ones from the training set via **L2-Norm**
            * Simplified Mathematical Expression for Loss: `multibox_loss = confidence_loss + alpha * location_loss`
            * The `alpha` term helps balance the contribution of location loss
    * **MultiBox Priors and IoU** - starts with **priors** as predictions as attempts to regress closer to the ground truth bounding boxes
        * **Priors** (or **Anchor** in Faster R-CNN) - pre-computed, fixed size bounding boxes that closely match the distribution of the original ground truth boxes using the **Intersection over Union ratio (IoU)** or sometimes referred to as the **Jaccard Index** that are greater than 0.5
        ![iou_example](iou_example.png)
        * This strategy is not good enough, but better than starting predictions with random coordinates
        * The resulting architecture contains 11 priors per feature map cell (8x8, 6x6, 4x4, 3x3, 2x2) and only one 1x1 feature map (1420 priors per image) to detect objects of various sizes
            * This strategy enables **robust coverage of input images at <u>multiple scales</u>** 
        * Unlike **OverFeat** or **YOLO** that use a single scale feature map
    * **SSD Improvements**
        * **Fixed Priors** - unlike MultiBox, every feature map cell is associated with a set of **default** bounding boxes of different dimensions and aspect ratios
        ![ssd_feature](ssd_feature.png)
            * These priors are **manually, but carefully chosen** (whereas in MultiBox, they were chosen because their IoU score with respect to the ground truth was over 0.5)
            * This should allow SSD to generalize for any type of input, **without requiring a pre-training phase** for **prior generation**
        * **Location Loss** - SSD uses **smooth L1-Norm** to calculate the location loss
            * While not as precise as L2-Norm, it is still highly effective and gives SSD more room for error as it does not try to be 100% zero error in its bounding box prediction (e.g. we would hardly notice a few pixels difference)
        * **Classification** - MultiBox does not perform object classification, whereas **SSD does**
            * Thus, for each predicted bounding box, a set of `c` class predictions are computed, for every possible class in the dataset
    * **SSD** vs. **YOLO**
    ![ssd_vs_yolo](ssd_vs_yolo.png)
        * **SSD** is **faster** than YOLOv1 and significantly **more accurate** than slower techniques like Faster R-CNN
    * Reached new record performance and precision for object detection:
        * Accuracy: 74% MAP (mean average precision)
        * Speed: 59 frames per second
        * Dataset: **PascalVOC** (Pattern Analysis, Statistical Modeling and Computational Learning) and **COCO** (Common Objects in Context)
            * **COCO** dataset has 330K images with more than 200K labeled images
    * **Training & Running SSD** - will need training and test datasets with ground truth bounding boxes and assigned class labels (only one per bounding box)
    ![training_ssd](training_ssd.png)
        * **Pascal VOC** and **COCO** datasets are great starting points
        * **Default Bounding Boxes** - it is recommended to configure a varied set of default bounding boxes, of different scales and aspect ratios to ensure most objects could be captured (e.g. SSD paper has around 6 bounding boxes per feature map cell)
        * **Feature Maps** - the feature maps (results from convolutional blocks) are a representation of the dominant features of the image at different scales. Thus, runnning MultiBox on **multiple feature maps increases the likelihood** of any object to be eventually detected, localized and appropriately classified
        * **Hard Negative Mining** - during training, as most of the bounding boxes will have **low IoU** and therefore be interpreted as **negative** training examples resulting in disproportionate amount of negative examples in our training set
        ![hard_negative](hard_negative.png)
            * Thus, instead of using all negative predictions, it is advised to **keep ratio of negative to positive examples of around 3:1**
            * The reason to **keep negative samples** is because the network also needs to **learn and be explicitly told what constitutes an incorrect detection**
        * **Data Augmentation** - it is crucial, like other deep learning applications, to teach the network to become more robust to various object sizes in the input
            * In the end, they generate additional training examples with patches of original image at different IoU ratios (e.g. 0.1, 0.3, 0.5, etc.) and random patches as well
            * Moreover, each image is also randomly horizontally flipped with a probability of 0.5, thereby making sure potential objects appear on left and right with similar likelihood
        * **Non-Maximum Suppression (NMS)** - given the large number of boxes generated during a forward pass of SSD at inference time, it is essential to prune most of the bounding box
        ![nms](nms.png)
            * Boxes with confidence loss threshold less than `ct` (e.g. 0.01) and IoU less than `lt` (e.g. 0.45) are discarded, and only top `N` predictions are kept
            * This ensures only the most liley predictions are retained by the network, while the more noisier ones are removed
    * Other Notes on SSD:
        * More default boxes results in more accurate detection, but more impact on speed
        * Having MultiBox on multiple layers results in better detection due to detector running on features at **multiple resolutions**
        * 80% of the time is spent on VGG-16 base network, which means if **there was another base network could be ran faster and equally accurate**, then SSD performance can be much better
        * SSD confuses objects with similar categories (e.g. animals) due to locations that are shared for multiple classes
        * **SSD-500** - the highest resolution variant using 512x512 input images of SSD achieves best mAP on Pascal VOC2007 at 76.8%, but at the expense of speed, where its frame rate drops to 22 fps
            * **SSD-300** is thus a much better trade off with 74.3 mAP at 59 fps
        * SSD produces **worse performance on smaller objects**, as they may **not appear across all feature maps**
    * Playing with SSD:
        * Tensorflow implementation by Paul Balanca
        * Original Caffe code from authors
    * Beyond SSD:
        * **YOLO9000: Better, Faster, Stronger**
        * **Mask R-CNN** - very accurate instance segmentation at pixel level

8) Must-know techniques for building state-of-the-art deep learning models
* **Batch Normalization** - developed by Ioffe and Szegedy (2015) that can adaptively normalize data even as the mean and variance change over time during training
    * **Normalization** - a broad category of methods that seek to make different samples seen by a ML model more similar to each other, which helps the model learn and generalize well to new data
        * The most common form of data normalization: centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation
        * In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and makes sure this distribution is centered and scaled to unit variance: `normalizaed_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)`
    * Previous examples normalized data before feeding it into models, but data normalization should be a concern **after every transformation operated by the network**: even if the data entering a `Dense` or `Conv2D` network has a 0 mean and unit variance, there's no reason to expect a priori that this will be the case for **the data coming out**
    * **Batch normalization** is a type of layer (`BatchNormalization` in Keras) that can **adaptively normalize data** even as the mean and variance change over time during training
        * Works by internally maintaining an **exponential moving average of the batch-wise mean and variance** of the data seen during training
        * The main effect of batch normalization is that it **helps with gradient propagation (much like residual connection)** and thus allows for **deeper networks**
        * e.g. `BatchNormalization` is used liberally in many of the advanced convnet architectures that come packaged with Keras (e.g. ResNet50, Inception V3, and Xception)
    * The `BatchNormalization` layer is typically used after a convolution or densely connected layer
        * The `BatchNormalization` layer takes an `axis` argument, which specifies the feature axis that should be normalized (defaults to -1, the last axis in the input tensor)
    * Improvements to Batch Normalization:
        * **Batch Renormalization** - developed by Ioffe (2017) that makes improvements over regular batch normalization at no apparent cost
        * **Self-Normalizing Neural Network** - developed by Klambauer (2017) to keep data normalized after going through any `Dense` layer by using a specific activation function (`selu`) and a specific initializer (`lecun_normal`)
            * Unfortunately, this NN is limited to densely connected networks for now
* **Depthwise Separable Convolution** - a layer that is used as a drop-in replacement for `Conv2D` that will make your model lighter (fewer trainable weight parameters), faster (fewer floating-point operations), and cause it to perform a few percentage points better on its task
    ![depthwise_conv](depthwise_conv.png)
    * Depthwise separable convolution layer (`SeparableConv2D`) performs a **spatial convolution on each channel** of its input, **independently**, before mixing output channels via a pointwise convolution (a 1x1 convolution)
    * This is equivalent to separating the learning of spatial features and the learning of the channel-wise features, which makes a lot of sense if you assume that **spatial locations in the input are highly correlated**, but **different channels are fairly independent**
    * It requires **significantly fewer parameters** and involves **fewer computations**, thus resulting in smaller, speedier models
    * Because it's a more representationally efficient way to perform convolution, it tends to learn better representations using less data, resulting in better-performing models
    * These advantages become especially important when you're training small models **from scratch** on **limited data**
    * e.g. `SeparableConv2D` is the basis of Xception architecture (a larger-scale model with depthwise separable convolutions)
* **Hyperparameter Optimization**
    * Hyperparameter Questions:
        * How many layers should you stack?
        * How many units or filters should go in each layer?
        * Should you use `relu` as activation, or a different function?
        * Should you use `BatchNormalization` after a given layer?
        * How much dropout should you use?
    * These architecture-level parameters are called **hyperparameters** to distinguish them from the parameters of a model, which are trained via backpropagation
    * Although, you will build intuition over time, your initial decisions are almost always suboptimal
    * It is better to explore the space of possible decisions automatically, systematically, in a principled way
    * Search the architecture space find the best-performing ones empirically
    * The field of automatic hyperparameter optimization is about making this less human, more automated
    * The process of optimizing hyperparameters:
        1. Choose a set of hyperparameters (automatically)
        2. Build the corresponding model
        3. Fit it to your training data, and measure the final performance on the validation data
        4. Choose the next set of hyperparameterrs to try (automatically)
        5. Repeat
        6. Eventually, measure performance on your test data
    * The key to this process is the **algorithm that uses this history of validation performance**, given various set of hyperparameters, to **choose the next set of hyperparameters to evaluate**. Many different techniques are possible:
        * Bayesian optimization
        * Genetic algorithms
        * Simple random search
    * Training the weights of a model is relatively easy:
        * Compute a loss function on a mini-batch of data
        * Use backpropagation algorithm to move the weights in the right direction
    * Updating hyperparameters, on the other hand, is extremely challenging:
        * Computing the **feedback signal** (does this set of hyperparameters lead to a high-performing model on this task?) can be **extremely expensive**: it requires creating and training a new model from scratch on your dataset
        * The hyperparameter space is typically made of **discrete decisions** and thus **isn't continuous or differentiable**. Hence, typically can't do gradient descent in hyperparameter space. Instead, you must rely on **gradient-free optimization techniques**, which naturally are **far less efficient than gradient descent**
    * Often, it turns out that **random search** (choosing hyperparameters to evaluate at random, repeatedly) is the **best solution, despite being the most naive one**
        * But, there is one tool reliably better is **Hyperopt** - a Python library for hyperparameter optimization that internally uses trees of Parzen estimators to predict sets of hyperparameters that are likely to work well
        * Another called **Hyperas** integrates Hyperopt for use with Keras models
    * One important issue to keep in mind when doing automatic hyperparameter optimization at scale is validation-set overfitting
        * Because you're updating hyperparameters based on a **signal that is computed using your validation data**, you're effectively training them on the validation data, and thus they will **quickly overfit to the validata data** (Always keep this mind)
* **Model Ensembling** - consists of pooling together the predictions of a set of different models, to produce better predictions
    * If viewing ML competitions on Kaggle, you'll see that the winners use **very large ensembles of models** that inevitably beat any single model, no matter how good
    * Ensembling relies on the assumption that different good models trained independently are likely to be good for **different reasons**: each model looks at slightly different aspects of the data to make its predictions, getting part of the "truth" but not all of it
    * These are essentially ML models trying to understand the traiing data, each from its own perspective, using its **own assumptions** provided by the **unique architecture** of the model and the **unique random weight initialization**
        * Each of them gets part of the truth of the data, but not the whole truth
        * By pooling their perspectives together, you can get a far more accurate description of the data
    * The easiest way to pool the predictions of a set of classifiers (to **ensemble the classifiers**) is to average their predictions at inference time
        * This will work only if the classifiers are more or less equally good
        * (-) If one of them is significantly worse than the others, the final predictions may not be as good as the best classifier of the group
    * A smarter way to ensemble classifiers is to do a **weighted average**, where the weights are learned on the validation data
        * Typically, the better classifiers are **given a higher weight**
        * Worse classifiers are given a lower weight
        * To **search** for a good set of **ensembling weights**, you can use **random search** or a simple optimization algorithm as **Nelder-Mead**
        * There are many possible variants:
            * Average of an exponential of the predictions
            * A simple weighted average with weights optimized on the validation data provides a very strong baseline
        * The key to making ensembling work is the **diversity** of the set of classifiers (Diversity is strength)
            * If all of your models are biased in the same way, then your ensemble will retain this same bias
            * If your models are **biased in different ways**, the biases will cancel each other out, and the ensemble will be **more robust and more accurate**
        * Ensemble models that are **as good as possible** while being **as different as possible**
            * Use very different architectures or even different brands of ML approaches
        * One thing that is largely **not** worth doing is ensembling the same network trained several times independently, from different random initializations
            * If the only difference between your models is their random initialization and the order in which they were exposed to the training data, then your ensemble will be **low-diversity** and will provide only a tiny improvement over any single model
        * Use of an ensemble tree-based methods (e.g. random forest or gradient-boosted trees) and deep neural networks (doesn't always generalize to every problem domain)
            * In 2014, use of a ensemble of various tree models and deep neural networks allowed for a 4th place finish on the Higgs Boson decay detection challenge on Kaggle
            * Remarkably, one of the models in the ensemble originated from a different method than others (it was a regularized greedy forest). It was significantly worse than others, but helped overall ensemble by a large factor because it was so different from every other model
            * In recent times, one style of basic ensemble that has been very successful in practice is the **wide and deep** category of models: blending a deep neural network with a large linear model. The joint training of a family of diverse models is yet another option to achieve model ensembling