Convolutional Neural Networks (CovNet/CNN)

* Objectives:
    * Understand the fundamental differences between image data and other kinds of data
    * Be aware of the tools and pipeline for working with images
    * Understand general computer vision techniques for working/transforming images
    * Be able to explain what a convolution is, and how it works
    * Understand the basic structure of a convolutional neural network
    * Comprehend the three basic ideas behind convolutional networks: Local receptive field, shared weights, and pooling
    * Be aware of general strategies for building convolutional neural networks

1) **Image Processing** - remove unnecessary details to allow for better generalization of images to image classes
* equation: $x_i \rightarrow y_i: argmin_y \frac{1}{2}\sum_{i=1}(x_i-y_i)^2+\lambda\sum_{i=1}|y_{i+1}-y_i|$
    * **Fidelity** - $\frac{1}{2}\sum_{i=1}(x_i-y_i)^2$
    * **Variation** - $\lambda\sum_{i=1}|y_{i+1}-y_i|$
* What is the difficulty of image processing?
    * Images come in many different sizes
    * Viewing conditions are infinite
    * Objects are surrounded by other objects
    * Computers have hard time understanding the context of an image (e.g. Barack Obama secretly stepping on his staff member's scale as a joke)
* Why is it important to understand how to process images? Simply that there is more and more data in the form of audio, image and video that have potential for modeling
* How do we make object recognition possible?
    * Compress the data
    * Keep the search simple
    * Method of segmenting potential objects
* Python libaries for image processing:
    * **Scikit-image (skimage)**
    * **OpenCV (based on C++)**
        * be careful with package dependencies
    * Python Imaging Library
    * Pillow
* Image Pipeline: Read, Resize, and Transformations
    * Reading Images (Image types):
        * Colored images shape: (width, height, 3)
        * Greyscale images shape: (width, height)
        * Image Tensor example: (RGB)
        ```python
        array([
            #  R  G  B     R  G  B 
            [[108,50,13],[111,55,18]],
            #  R  G  B     R  G  B
            [[115,61,23],[130,129,127]]
        ])
        ```
        * What is the shape of this array? (2, 2, 3)
        * What if the same array is greyscaled? (2, 2)
    * Resizing Images:
        * Making the image a specified shape without cropping
            * **Downsampling** - reducing the size of the image when image is too large for processing
            ![downsampling](downsampling.png)
            * **Upsampling** - purposely increasing the size of image when image is too small for processing
            ![upsampling](upsampling.png)
            * **Interpolation** - resize or distort your image from one pixel grid to another
            ![interpolation](http://northstar-www.dartmouth.edu/doc/idl/html_6.2/images/Interpolation_Methods-14.jpg)
    * Transforming Images - converting an image from one domain to another
        * **Greyscale** - removing color from image
        * **Denoise** - removing unnecessary details of an image allowing for better generalization of the class of image
            * **Gaussian Kernel** - probability density function (called the standard deviation), and the square of it, $s^2$, the variance
            ![gaussian_kernel](gaussian_kernel.png)
    * Before convolution neural nets, image analysis (or object recognition) was focused on examining pixels (or color vectors)
        * **K-means of RGB pixels** - segment colors in an automated fashion using k-means clustering
        ![kmeans_of_pixel_colors](https://www.mathworks.com/matlabcentral/answers/uploaded_files/9604/sample.jpg)
        * **Raw Vector based methods** - ascertain features in images by looking at intensity gradients
        ![raw_vector_method](raw_vector_method.png)
            
        

2) Image Featurization in Covolutional Neural Nets
* **Convolutions** - in image processing, a kernel, convolution matrix, or mask is a small matrix useful for blurring, sharpening, embossing, edge-detection, and more. This is accomplished by means of **convolution** between kernel and an image
    * Applying a kernel over an image to get a convolved feature:
    ![kernel_apply](kernel_apply.png)
    * Moving a kernel in an image and getting the dot product of each pixel
        * dot product: $a \cdot b=\sum_{i=1}^n a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$
        * $\left[\begin{array}{cc}
            (1)(1) & (1)(0) & (1)(1) \\ 
            (0)(0) & (1)(1) & (1)(0) \\
            (0)(1) & (0)(0) & (1)(1)
            \end{array}\right] = \left[\begin{array}{cc}
            1 & 0 & 1 \\ 
            0 & 1 & 0 \\
            0 & 0 & 1
            \end{array}\right] = 
            1+0+1+0+1+0+0+0+1 = 4$ 
        * Convolved feature: $\left[\begin{array}{cc}
            4 & - & - \\
            - & - & - \\
            - & - & - 
            \end{array}\right]$
        * **Sobel filter** - kernels that have values that are on the vertical edges and the horizontal edges to detect edges
            * Vertical edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & 0 & +1 \\ 
                    -2 & 0 & +2 \\
                    -1 & 0 & +1
                    \end{array}\right]$
            * Horizontal edge detector example:
                * $\left[\begin{array}{cc}
                    -1 & -2 & -1 \\ 
                    0 & 0 & 0 \\
                    +1 & +2 & +1
                    \end{array}\right]$
            * Useful for detecting outline of a door image and to recognize the essential features of a door
            ![door_detection_with_edge](door_detection_with_edge.png)
        * **Canny Filter**
    * What if we could get a computer to build it's own kernels, apply those to images, then interpret those results to perform object recognition? Convolution Neural Networks (CNNs)

3) Convolutional Neural Network Architecture
![cnn](http://www.mdpi.com/entropy/entropy-19-00242/article_deploy/html/images/entropy-19-00242-g001.png)
* General Structure:
    * There are three key feature that make CNN structure actualy work: local receptive fields, shared weights, and pooling
    * Input Layer $\rightarrow$ Convolutional Layers $\rightarrow$ Pooling Layers $\rightarrow$ Fully Connected Layers $\rightarrow$ Output Layer
* Input Layer $\rightarrow$ Convolutional Layers:
    * Using a kernel, the input image is converted to multiple convolved features (learning different features of the image)
    * **Local Receptive Fields** - a group of pixels that has variety of sizes defined by the kernel
    ![local_receptive_field](local_receptive_field.png)
        * The kernel is slid across the entire image
        * Multiple kernels are applied to the image, which results in multiple learned kernels per hidden layer (yielding multiple convolutional layers)
        * The image is transformed into the set of local receptive fields
    * **Shared Weights** - multiple convolutions are learned or used
        * These weights within a convolution are shared
    ![convolutional_layers](convolutional_layers.png)
* Convolutional Layers $\rightarrow$ Pooling Layers:
    * **Pooling Layers** - used immedidately after convolutional layers, and simplifies the information in the output from the convolutional layer (e.g. **Max Pooling**)
        * reduces the computational complexity for later layers
        * provides a form of translational invariance
    ![max_pool](max_pool.png)
* Pooling Layers $\rightarrow$ Fully Connected Layers:
    * **Fully Connected Layers** - used to aggregate all of the information that has been learned in the convolutional and pooling layers
    * They produce higher order features in standard NN manner
* Fully Connected Layers $\rightarrow$ Output Layer:
    * **Output Layer** - produces the probability that the image is of a certain class
    * Softmax can be applied to the output layer, $\eta_k$ where $k=1,\dots,K$ to estimate the one-versus-all class probabilities of $K$ classes: $\frac{e^{\eta_k}}{\sum_{k'=1}^K e^{\eta_k}}$

4) CNN Intuition
* Denoising is not actually common to use with CNN, but available
* New set of image processing techniques for getting more images (e.g. rotate images, flip images, etc.) - making your model more translational invariant
* It is **not** too common to use dropout after convolutional layers (instead use it after fully connected layers)
* It is common to have multiple convolutional layers in between pooling layers
* ReLU activation units are incredibly popular with CNNs
* In general, it is best to go off of a research paper in your domain space that uses CNNs. Start off trying to get something working that uses the same structure they did, and go from there.
* CNNs are the best at identifying patterns in complex data (e.g. recognizing digits from images)
    * Classifier - Test Error Rate
    * Large and Deep Convolutional Network - 0.33%
    * SVM with degree 9 polynomial kernel - 0.66%
    * Gradient boosted stumps on Haar features - 0.87%