Convolutional Neural Networks (CovNet/CNN)

* Objectives:
    * Understand the fundamental differences between image data and other kinds of data
    * Be aware of the tools and pipeline for working with images
    * Understand general computer vision techniques for working/transforming images
    * Be able to explain what a convolution is, and how it works
    * Understand the basic structure of a convolutional neural network
    * Comprehend the three basic ideas behind convolutional networks: Local receptive field, shared weights, and pooling
    * Be aware of general strategies for building convolutional neural networks

1) **Image Processing** - remove unnecessary details to allow for better generalization of images to image classes
* equation: $x_i \rightarrow y_i: argmin_y \frac{1}{2}\sum_{i=1}(x_i-y_i)^2+\lambda\sum_{i=1}|y_{i+1}-y_i|$
    * **Fidelity** - $\frac{1}{2}\sum_{i=1}(x_i-y_i)^2$
    * **Variation** - $\lambda\sum_{i=1}|y_{i+1}-y_i|$
* What is the difficulty of image processing?
    * Images come in many different sizes
    * Viewing conditions are infinite
    * Objects are surrounded by other objects
    * Computers have hard time understanding the context of an image (e.g. Barack Obama secretly stepping on his staff member's scale as a joke)
* Why is it important to understand how to process images? Simply that there is more and more data in the form of audio, image and video that have potential for modeling
* How do we make object recognition possible?
    * Compress the data
    * Keep the search simple
    * Method of segmenting potential objects
* Python libaries for image processing:
    * **Scikit-image (skimage)**
    * **OpenCV (based on C++)**
        * be careful with package dependencies
    * Python Imaging Library
    * Pillow
* Image Pipeline: Read, Resize, and Transformations
    * Reading Images (Image types):
        * Colored images shape: (width, height, 3)
            * **Feature Maps** - convolutions operate over 3D tensors with two spatial axes and one depth axis
                * spatial axes - `(width, height)`
                * depth axis - `channels`
                * combined - `(width, height, channels)`
        * Greyscale images shape: (width, height)
        * Image Tensor example: (RGB)
        ```python
        array([
            #  R  G  B     R  G  B 
            [[108,50,13],[111,55,18]],
            #  R  G  B     R  G  B
            [[115,61,23],[130,129,127]]
        ])
        ```
        * What is the shape of this array? (2, 2, 3)
        * What if the same array is greyscaled? (2, 2)
    * Resizing Images:
        * Making the image a specified shape without cropping
            * **Downsampling** - reducing the size of the image when image is too large for processing
            ![downsampling](downsampling.png)
            * **Upsampling** - purposely increasing the size of image when image is too small for processing
            ![upsampling](upsampling.png)
            * **Interpolation** - resize or distort your image from one pixel grid to another
            ![interpolation](http://northstar-www.dartmouth.edu/doc/idl/html_6.2/images/Interpolation_Methods-14.jpg)
    * Transforming Images - converting an image from one domain to another
        * **Greyscale** - removing color from image
        * **Denoise** - removing unnecessary details of an image allowing for better generalization of the class of image
            * **Gaussian Kernel** - probability density function (called the standard deviation), and the square of it, $s^2$, the variance
            ![gaussian_kernel](gaussian_kernel.png)
    * Before convolution neural nets, image analysis (or object recognition) was focused on examining pixels (or color vectors)
        * **K-means of RGB pixels** - segment colors in an automated fashion using k-means clustering
        ![kmeans_of_pixel_colors](https://www.mathworks.com/matlabcentral/answers/uploaded_files/9604/sample.jpg)
        * **Raw Vector based methods** - ascertain features in images by looking at intensity gradients
        ![raw_vector_method](raw_vector_method.png)

2) Image Featurization in Convolutional Neural Nets
* **Convolution operation**
    * The fundamental difference between a densely connected layer and a convolution layer:
        * `Dense` layers - learns **global** patterns in their input feature space (e.g. learns patterns involving all pixels in mnist digit)
        * Convolutional layers - learns **local** patterns (e.g. for images, patterns found in small 2D windows of the inputs
        ![convolution_operation](convolution_operation.png)
    * Key characteristics gives convnets two interesting properties:
        * **The patterns learned are translation invariant** - after learning a certain pattern in lower-right corner of a picture, a convnet can recognize it anywhere (e.g. in the upper-left corner)
            * A densely connected network would have to learn the pattern anew if it appeared at a new location
            * While processing images with convnet, the **visual world is fundamentally translation invariant**
            * Fewer training samples to learn representations that have generalization power
        * **Can learn spatial hierarchies of patterns** - a first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on
            * Allows convnets to efficiently learn increasingly complex and abstract visual concepts because **visual world is fundamentally spatially hierarchical**
            ![spatial_hierarchies](spatial_hierarchies.png)
* **Filters (Kernels)** - encodes specific aspects of the input data
    * Its depth can be arbitrary because the output depth is a parameter of the layer and the different channels in that depth axis no longer stand for specific colors, but rather **filters**
    * e.g. a single filter could encode the concept "presence of a face in the input"
    * e.g. in mnist example, the first convolution layer takes a feature map of size `(28,28,1)` and outputs a feature map of size `(26,26,32)`
        * It computes 32 filters over its input
        * Each of these 32 output channels contain a `26 x 26` grid of values, which is a **response map** of the filter over the input, indicating the response of that filter pattern at different locations in the input
        * Every dimension in the depth axis is a feature (or filter), called **feature map**, and the 2D tensor `output[:,:,n]` is the 2D spatial **map** of the response of this filter over the input
        ![response_map](response_map.png)
* **Convolutions** - in image processing, a kernel, convolution matrix, or mask is a small matrix useful for blurring, sharpening, embossing, edge-detection, and more. This is accomplished by means of **convolution** between kernel and an image
    * Applying a kernel over an image to get a convolved feature:
    ![kernel_apply](kernel_apply.png)
    * Moving a kernel in an image and getting the dot product of each pixel
        * dot product: $a \cdot b=\sum_{i=1}^n a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$
        * $\left[\begin{array}{cc}
            (1)(1) & (1)(0) & (1)(1) \\ 
            (0)(0) & (1)(1) & (1)(0) \\
            (0)(1) & (0)(0) & (1)(1)
            \end{array}\right] = \left[\begin{array}{cc}
            1 & 0 & 1 \\ 
            0 & 1 & 0 \\
            0 & 0 & 1
            \end{array}\right] = 
            1+0+1+0+1+0+0+0+1 = 4$ 
        * Convolved feature: $\left[\begin{array}{cc}
            4 & - & - \\
            - & - & - \\
            - & - & - 
            \end{array}\right]$
        * **Sobel filter** - kernels that have values that are on the vertical edges and the horizontal edges to detect edges
            * Vertical edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & 0 & +1 \\ 
                    -2 & 0 & +2 \\
                    -1 & 0 & +1
                    \end{array}\right]$
            * Horizontal edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & -2 & -1 \\ 
                    0 & 0 & 0 \\
                    +1 & +2 & +1
                    \end{array}\right]$
            * Useful for detecting outline of a door image and to recognize the essential features of a door
            ![door_detection_with_edge](door_detection_with_edge.png)
        * **Canny Filter**
    * What if we could get a computer to build it's own kernels, apply those to images, then interpret those results to perform object recognition? Convolution Neural Networks (CNNs)
* Key Parameters of Convolutions:
    * **Size of the patches extracted from the inputs** - typically 3x3 or 5x5
        * e.g. 3x3 is a common choice
    * **Depth of the output feature map** - the number of filters computed by the convolution
        * e.g. a depth of 32 and ended with a depth of 64

3) Convolutional Neural Network Architecture
![cnn](http://www.mdpi.com/entropy/entropy-19-00242/article_deploy/html/images/entropy-19-00242-g001.png)
* General Structure:
    * There are three key feature that make CNN structure actualy work: local receptive fields, shared weights, and pooling
    * Input Layer $\rightarrow$ Convolutional Layers $\rightarrow$ Pooling Layers $\rightarrow$ Fully Connected Layers $\rightarrow$ Output Layer
* Input Layer $\rightarrow$ Convolutional Layers:
    * Using a kernel, the input image is converted to multiple convolved features (learning different features of the image)
    * **Local Receptive Fields** - a group of pixels that has variety of sizes defined by the kernel
    ![local_receptive_field](local_receptive_field.png)
        * The kernel is slid across the entire image
        * Multiple kernels are applied to the image, which results in multiple learned kernels per hidden layer (yielding multiple convolutional layers)
        * The image is transformed into the set of local receptive fields
    * **Shared Weights** - multiple convolutions are learned or used
        * These weights within a convolution are shared
    ![convolutional_layers](convolutional_layers.png)
    * In Keras, `Conv2D` layers, the size of patches and depth are the first arguments passed to the layer: `Conv2D(output_depth, (window_height, window_width))`
        * **Slide** these windows of 3x3 or 5x5 over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features: `(window_height, window_width, input_depth)`
        * Each 3D patch is then transformed, via a tensor product with the same learned weight matrix called **convolution kernel**, into a 1D vector of shape `(output_depth,)`
        * Every spatial location in the output feature map corresponds to the same location in the input feature map (e.g. lower-right corner of the output contains information about the lower-right corner of the input)
        * e.g. 3x3 windows, the vector `output[i,j,:]` comes from the 3D patch `input[i-1:i+1, j-1:j+1, :]`
        ![convolution_kernel](convolution_kernel.png)
        * The output width and height may differ from the input width and height for two reasons:
            * **Border effects** - can be countered by padding the input feature map
            * **Strides** - the use of **strides**
    * Understanding **Border Effects** and **Padding**
        * **Border Effects** - get a shrunken version of the feature map
            * e.g. consider a 5x5 feature map (25 tiles), there are only 9 tiles around which you can center a 3x3 window, forming a 3x3 grid (which is the size of the output feature map)
            ![border_effects](border_effects.png)
            * The feature maps shrinks by exactly two tiles alongside each dimension (border effect)
            * e.g. border effect with 28x28 inputs which becomes 26x26 after the first convolution layer
        * **Padding** - gett an output feature map with the same spatial dimensions
            * Padding consists of adding an appropriate number of rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows aroundd every input tile
            * e.g. for 3x3 window, you add one column on the right, one column on the left, one row at the top, and one row at the bottom (and for 5x5 window, you add two rows)
            ![padding](padding.png)
            * In `Conv2D` layeres, padding is configurable via the `padding` argument, which takes two values: `valid` or `same`
                * `valid` - means no padding (only valid windows will be used)
                * `same` - which means "pad in such a way as to have an output with the same width and height as the input"
    * Understanding **Convolution Strides**
        * Strides are the other factor that can influence output size
        * The description of convolution so far has assumed that the center tiles of the convolution windows are all contiguous (sharing a common border)
        * **Stride** - the distance between two successive windows is a parameter of the convolution (defaults to 1)
        * **Strided Convolutions** - convolutions with a stride higher than 1
        * e.g. patches extracted by a 3x3 convolution with stride 2 over a 5x5 input (without padding)
        ![strided_convolution](strided_convolution.png)
            * using stride 2 means the width and height of the feature map are **downsampled by a factor of 2** (in addition to any changes induced by border effects)
        * In practice, strided convolutions are **rarely used**, although they can come in handy for some types of models
        * To downsample feature maps, instead of strides, we tend to use the **max-pooling** operation
* Convolutional Layers $\rightarrow$ Pooling Layers:
    * **Pooling Layers** - used immedidately after convolutional layers, and simplifies the information in the output from the convolutional layer (e.g. **Max Pooling**)
        * reduces the computational complexity for later layers
        * provides a form of translational invariance
    ![max_pool](max_pool.png)
    * Max-Pooling Operation
        * e.g. Convnet example has the size of the feature maps **halved** after every `MaxPooling2D` layer
            * Before the first `MaxPooling2D` layers, the feature map is 26x26 (`Conv2D`), but the max-pooling operation halves it to 13x13
        * **Max Pooling** - to aggressively downsample feature maps, much like strided convolutions
            * Consists of extracting windows from the input feature maps and outputting the max value of each channel
            * Conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformations (convolution kernel), they're **transformed via a hardcoded `max` tensor operation**
            * A big difference from convolution is that max pooling is usually done with 2x2 windows and stride 2, in order to downsample the feature maps by a factor of 2
            * On the other hand, convolution is typically done with 3x3 windows and no stride (stride 1)
        * Why is it important to downsample feature maps?
        * Why not remove the max-pooling layers and keep fairly large feature maps all the way up?
    * Faults of no max-pooling model:
        * It isn't conducive to learning a spatial hierarchy of features:
            * e.g. 3x3 windows in the third layer will only contain information coming from 7x7 windows in the initial input
            * the high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7x7 pixels)
            * we need the features from the last convolution layer to contain information about the totality of the input
        * The final feature would be too large, which would result in intense overfitting:
            * e.g. final feature map has 22x22x64=30,976 total coefficients per sample
            * if you were to flatten it to stick a `Dense` layer of size 512 on top, that layer would have 15.8 million parameters
            * this is far too large for such a small model and would result in overfitting
    * **Average Pooling** - each local input patch is transformed by taking the average value of each channel over the patch, rather than the max
        * Max pooling tends to work better than these alternative solutions
    * Most reasonable subsampling strategy:
        1. produce dense maps of features (via unstrided convolutions)
        2. look at the maximal activation of the features over small patches, rather than looking at sparser windows of the inputs (via strided convolutions) (or average input patches, which would cause you to miss or dilute feature-presence information)
* Pooling Layers $\rightarrow$ Fully Connected Layers:
    * **Fully Connected Layers** - used to aggregate all of the information that has been learned in the convolutional and pooling layers
    * They produce higher order features in standard NN manner
* Fully Connected Layers $\rightarrow$ Output Layer:
    * **Output Layer** - produces the probability that the image is of a certain class
    * Softmax can be applied to the output layer, $\eta_k$ where $k=1,\dots,K$ to estimate the one-versus-all class probabilities of $K$ classes: $\frac{e^{\eta_k}}{\sum_{k'=1}^K e^{\eta_k}}$

4) CNN Intuition
* Denoising is not actually common to use with CNN, but available
* New set of image processing techniques for getting more images (e.g. rotate images, flip images, etc.) - making your model more translational invariant
* It is **not** too common to use dropout after convolutional layers (instead use it after fully connected layers)
* It is common to have multiple convolutional layers in between pooling layers
* ReLU activation units are incredibly popular with CNNs
* In general, it is best to go off of a research paper in your domain space that uses CNNs. Start off trying to get something working that uses the same structure they did, and go from there.
* CNNs are the best at identifying patterns in complex data (e.g. recognizing digits from images)
    * Classifier - Test Error Rate
    * Large and Deep Convolutional Network - 0.33%
    * SVM with degree 9 polynomial kernel - 0.66%
    * Gradient boosted stumps on Haar features - 0.87%

5) Training a Convnet on a Small Dataset (Data Augmentation/Feature Extraction/Fine-Tuning)
* Having to train an image-classification model using very little data is a common situation, which you'll likely encounter in practice if you ever do computer vision in a professional context
    * "Few" samples can mean anywhere from few 100 to few 10,000 of images
    * Example: classifying images as dogs or cats, in a dataset containing 4000 pictures of cats and dogs (2000 cats and 2000 dogs)
        * 2000 pictures for training
        * 1000 pictures for validation
        * 1000 pictures for testing
    * Naively train a small convnet on the 2000 training samples, without any regularization (baseline) yielding accuracy of 71%
* Right now the main issue is overfitting:
    1. Introduce **data augmentation**, a powerful technique for mitigating overfitting in computer vision, to improve the network to reach 82% accuracy
    2. **Feature extraction with a pretrained network** to reach accuracy of 90-96%
    3. **Fine-tuning a pretrained network** to reach final accuracy of 97%
* Is deep learning relevant for small-data problems?
    * Deep learning does work well with lots of data since it is able to find interesting features in the training data on its own, and this can only be achieved when lots of training examples are available (especially true for problems where the input samples are very high dimensional (e.g. images))
        * However, what constitutes as lots of samples is relative to size and depth of network you train
    * It isn't possible to train a convnet to solve a complex problem with just a few tens of samples, but a few hundred can potentially suffice if the model is small, well regularized, and task is simple
        * Since convnet learns local, translation-invariant features, they're **highly data efficient on perceptual problems**
        * (+) training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering
    * Deep learning models are by nature **highly repurposable** - can take an image-classification or speech-to-text model trained on a large-scale dataset and reuse it on significantly different problem with only minor changes
        * Many pretrained models (usually trained on ImageNet dataset) are publicly available for download and can be used to bootstrap powerful vision models out of very little data
* **Data augmentation** - takes the approach of generating more training data from existing training samples, by **augmenting** the samples via a number of random transformations that yield believable-looking images
    * Overfitting is caused by having too few samples to learn from
    * Given infinite data, your model would never be exposed to every possible aspect of the data distribution
    * The goal is that at training time, your model will **never see the exact same picture twice**. This helps expose the model to more aspects of the data and generalize better
    * In Keras, this can be done by **configuring a number of random transformations** to be performed on the images read by the ImageDataGenerator instance
        ```python
        datagen = ImageDataGenerator(
                    rotation_range=40,
                    width_shift_range=0.2,
                    height_shift_range=0.2,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True,
                    fill_mode='nearest')
        ```
        * `rotation_range` is a value in degrees (0-180), a range within which to randomly rotate pictures
        * `width_shift` and `height_shift` are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally
        * `shear_range` is for randomly applying shearing transformations
        * `zoom_range` is for randomly zooming inside pictures
        * `horizontal_flip` is for randomly flipping half the images horizontally (relevant when there are no assumptions of horizontal asymmetry e.g. real-world pictures)
        * `fill_mode` is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift
    ![data_augmentation](data_augmentation.png)
    * If you train a network using data-augmentation configuration, the network will never see the same input twice
        * However, the inputs it sees are still heavily intercorrelated because they come from a small number of original images
        * You can only remix existing information, but not produce new information
        * As such, this may not be enough to completely get rid of overfitting (add `Dropout` layer to model to further fight overfitting)
* Utilizing **Pretrained Convnet** - a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task
    * A common and highly effective approach to deep learning on small image datasets
    * If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as generic model of the visual world, and hence its features can prove useful for many different computer-vision problems (even if these new problems many involve different classes than those of the original task)
    * e.g. train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images
    * Such **portability of learned features across different problems** is a key advantge of deep learning compared to many older, shallow-learning approaches
    * e.g. large convnet trained on the ImageNet dataset (1.4 million labeled images and 1000 different classes). ImageNet contains many animal classes, including different species of cats and dogs (which means it should perform well on the dog-vs-cat classification problem)
    * **VGG16** architecture (developed by Karen Simonyan and Andrew Zisserman in 2014)
        * Simple and widely used convnet architecture for ImageNet
        * Older model, far from the current state of the art and somewhat heavier than many other recent models
    * Other pretrained convnets: **VGG**, **ResNet**, **Inception**, **Inception-ResNet**, **Xception**, etc.
    * Two way to use pretrained network: **feature extraction** and **fine-tuning**
    * **Feature Extraction** (Using Pretrained Convnet) - consists of using the representations learned by previous network to extract interesting features from new samples
        * These features are then run through a new classifier, which is trained from scratch
        * Convnets used for image classification comprise two parts: 
            * First part: **Convolutional base** - a series of pooling and convolution layers
            * Second part: end with densely connected classifier
        * Feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a new classifier on the top of the output
        ![feature_extraction_pretrained](feature_extraction_pretrained.png)
        * Why are we only reusing the convolutional base? Could you reuse the densely connected classifier as well?
        * Avoid re-using the densely connected classifier because the representations learned by the convolution base are likely to be more generic and therefore more reusable
        * (Convolution base) The feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer-vision problem at hand
            * Still has information about object location
        * (Dense classifier) But, the representations learned by the classifier will necessarily be specific to the set of classes on which the model was trained (They will only contain information about the presence probability of this or that class in the entire picture)
            * Representations in densely connected layers no longer have information about **where** objects are located in the input image (no notion of space)
            * For problems where object locations matters, densely connected features are largely useless
        * Level of generality and reusability of the representations extracted by specific convolution layers depends on depth of layer
            * Earlier layer extract local, highly generic feature maps (e.g. visual edges, color, textures)
            * Higher up layers extract more abstract concepts (e.g. "cat ear" or "dog eye")
        * If new dataset differs a lot from the dataset on the original model, it may be better off using only the first few layers for feature extraction rather than entire convolutional base
        * e.g. since ImageNet contains multiple dog and cat classes, it is likely to be beneficial to reuse info contained in densely connected layers, but we won't in order to cover more general case where the class set of new problem doesn't overlap the class set of original model
            1. use convolutional base of VGG16 network (`keras.application.VGG16`)
            2. train on ImageNet
            3. extract interesting features from cat and dog images
            4. train a dog-vs-cat classifier on top of these features
            ```python
            from keras.applications import VGG16
            
            conv_base = VGG16(weights='imagenet',
                              include_top=False,
                              input_shape=(150, 150, 3))
            ```
        * Pre-trained model arguments:
            * `weights` - specifies the weight checkpoint from which to initialize the model
            * `include_top` - refers to including (or not) the densely connected classifier on the top of the network. By default, this densely connected classifier corresponds to the 1000 classes from ImageNet.
            * `input_shape` - the shape of the image tensors that you'll feed to the network. This argument is purely optional (if you don't pass it, the network will be able to process inputs of any size)
        * List of image-classification models (all pretrained on the ImageNet dataset)
            * **Xception**
            * **Inception V3**
            * **ResNet50**
            * **VGG16**
            * **VGG19**
            * **MobileNet**
        * Two ways to extract features using pretrained model:
            * Method 1: **Fast Feature Extraction Without Data Augmentation** - fast and cheap to run since it only requires running the convolutional base **once** for every input image. However, doesn't allow for data augmentation
                1. Run convolutional base over dataset
                2. Record the output on numpy array on disk
                3. Using this data as input to a standalone, densely connected classifier
            * This method yields accuracy of 90%, but loss shows overfitting almost immediately despite dropout since the model doesn't use data augmentation
            * Method 2: **Feature Extraction With Data Augmentation** - expensive, but allows for data augmentation since every input image goes through the convolutional base every time it's seen by the model
                1. Extend model you have (`conv_base`) by adding `Dense` layers on top
                2. Run the whole model end to end on the input data
            * This technique is so expensive that it should only be attempted with access to GPU
            * Before compile and training model, it's very important to **freeze** the convolutional base
                * Freezing a layer or set of layers means preventing their weights from being updated during training (e.g. `conv_base.trainable = False`)
                * If you don't do this, then the representations that were previously learned by the convolutional base will be modified during training
                * Because the `Dense` layers on top are randomly initialized, **very large weight updates** would be propagated through the network, **effectively destorying the representations previously learned**
            * With this setup, only the weights from the layers that you added will be trained (e.g. two `Dense` layers)
            * Note that in order for these changes to take effect, you must first compile the model
            * This method will allow for validation accuracy of 96% (much better than the small convnet trained from scratch)
    * **Fine Tuning** - consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (e.g. the fully connected classifier) and these top layers
        * Another widely used technique for model reuse, complementary to feature extraction
        * This is called **fine-tuning** because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand
        ![fine_tuning](fine_tuning.png)
        * Earlier, we stated that it's necessary to freeze the convolution base of VGG16 in order to train randomly initialized classifier on top. For the same reason, it's only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained
        * If the classifier isn't already trained, then the **error signal** propagating through the network during training will be **too large**, and the **representations previously learned** by the layers being fine-tuned will be **destoryed**
        * Fine-tuning Steps:
            1. Add your custom network on top of an already-trained base network
            2. Freeze the base network
            3. Train the part you added
            4. Unfreeze some layers in the base network
            5. Jointly train both these layers and the part you added
        * e.g. With the convolutional base, we will fine-tune the last three convolutional layers, which means all layers up to `block4_pool` should be frozen, and the layers `block5_conv1`, `block5_conv2`, and `block5_conv3` should be trainable
        * Why not fine-tune more layers? Why not fine-tun the entire convolutional base? Consider these facts when tuning more layers:
            * Earlier layers in the convolutional base encode more-generic, reusable features, whereas layers higher up encode more-specialized features. 
                * It's more useful to fine-tune the more specialized features, because these are the ones that need to be repurposed on your new problem. 
                * There would be fast-decreasing returns in fine-tuning lower layers
            * The more parameters you're training, the more you're at risk of overfitting
                * e.g. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset
        * Fine-tuning network with RMSProp optimizer using **very low learning rate** to limit magnitude of the modifications made to the representations of the three layers fine-tuned. Updates too large may harm these representations
            * You may want to smooth out noisy curves to see the actual trend of the training and validation accuracy/loss plots
            * Why can the accuracy stay stable or improve if the loss isn't decreasing?
                * What you display is an average of pointwise loss values, but what matters for accuracy is the **distribution of the loss values**, **not their average**, because accuracy is the result of a binary thresholding of the class probability predicted by the model. The model may still be improving even if this isn't reflected in the average loss

6) Visualizing What Convnets Learn
* Often people say deep-learning models are "black boxes": learning representations that are difficult to extract and present in a human-readable form
    * Although, this is partially true for certain types of deep-learning models, it's **definitely not true for convnets**
* The representations learned by convnets are highly amenable to visualization, in large part because they're **representations of visual concepts**
* Since 2013, a wide array of techniques have been developed for visualizing and interpreting these representations:
    1. **Visualizing intermediate convnet outputs (Intermediate activations)** - useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters
    2. **Visualizing convnets filters** - useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to
    3. **Visualizing heatmaps of class activation in an image** - useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images
* **Visualizing Intermediate Activations** - visualizing intermediate activations consists of displaying the feature maps that are output (of activation function) by various convolution and pooling layers in a network
    * This gives a view into a how an input is decomposed into the different filters learned by the network
    * Visualize feature maps with three dimensions: width, height, depth (channels)
    * Each channel encodes relatively independent features, so the proper way to visualize these feature maps is by independently plotting the contents of every channel as a 2D image
    * In order to extract the feature maps, create a Keras model that takes batches of images as input, and outputs the activations of all convolution and pooling layers (`Model`)
    * When sets the `Model` class apart is that it allows for models with multiple outputs, unlike `Sequential`
    * When fed an image input, this model returns the values of th e layer activations in the original model
        * e.g. one input and eight outputs (one output per layer activation)
    * Take an input image and visualize specific channels of the activation layer
    ![test_cat_input](test_cat_input.png)
    ![test_cat_diagonal_edge](test_cat_diagonal_edge.png)
    ![test_cat_eyes](test_cat_eyes.png)
    * Visualizing every channel in every intermediate activation
    ![visualize_intermediate_activation](visualize_intermediate_activation.png)
        * The first layer acts as a collection of various edge detectors. At that stage, the **activation retain almost all of the information** present in the initial picture
        * As you go higher, the activations become increasingly abstract and less visually interpretable
            * e.g. encode higher-level concepts such as "cat ear" and "cat eye"
            * Higher presentations carry increasingly less information about the visual contents of the image, and increasingly more information related to the class of the image
        * The sparsity of the activations increases with the depth of the layer: in the first layer, all filters are activated by the input image. But, in the following layers, more and more filters are blank. This means the pattern encoded by the filter isn't found in the input image