# Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a type of deep learning architecture designed primarily for processing grid-like data, such as images or time-series data. CNNs are particularly effective at recognizing patterns and hierarchical features within the data by employing convolutional layers, pooling layers, and fully connected layers.

Distinguishing features of CNNs compared to other network architectures:
- Convolutional layers: Unlike traditional feedforward neural networks, CNNs use convolutional layers to apply a set of filters (also called kernels) to the input data. These filters learn local patterns and features in the data, making CNNs well-suited for tasks like image recognition.
- Pooling layers: CNNs often use pooling layers to reduce the spatial dimensions of the input data, which helps to decrease computational complexity, control overfitting, and increase translation invariance.
- Hierarchical feature learning: CNNs are capable of learning hierarchical features in the data, with lower layers detecting simple patterns and higher layers capturing more complex and abstract representations.

Advantages of CNNs:
- Parameter efficiency: CNNs use shared weights in the convolutional layers, which reduces the number of parameters compared to fully connected networks. This makes CNNs more computationally efficient and less prone to overfitting.
- Translation invariance: Due to the use of convolutional and pooling layers, CNNs are more robust to translations in the input data, making them effective at recognizing patterns regardless of their position in the input.
- Effective for grid-like data: CNNs are well-suited for processing grid-like data structures, such as images or time-series data, due to their ability to capture local and hierarchical features.

Disadvantages of CNNs:
- Limited to grid-like data: CNNs are primarily designed for grid-like data structures and may not be the best choice for other types of data (e.g., graph-structured or relational data).
- Computationally intensive: Although CNNs are more parameter-efficient than fully connected networks, they can still be computationally intensive, especially when dealing with large input data or deep architectures.
- Lack of interpretability: Like other deep learning models, CNNs can be difficult to interpret, making it challenging to understand the underlying decision-making process or identify potential biases in the model.

In summary, Convolutional Neural Networks are specialized deep learning architectures designed for processing grid-like data, with distinguishing features like convolutional layers, pooling layers, and hierarchical feature learning. They offer advantages like parameter efficiency, translation invariance, and effectiveness for grid-like data but may have limitations in terms of applicability to other data types, computational intensity, and interpretability.

## Edge Detection Example for Convolution
You use a filter (or kernal) which contains values that correspond to an edge. For example:
- vert_edge = 
  - [1, 0, -1]<br>
  - [1, 0, -1]<br>
  - [1, 0, -1]<br>
- hor_edge = 
  - [-1, -1, -1]<br>
  - [ 0,  0,  0]<br>
  - [ 1,  1,  1]<br>
- slanted_edge =
  - [ 0,  1,  2]<br>
  - [-1,  0,  1]<br>
  - [-2, -1,  0]<br>


Convolution would be the process of applying this filter to portions of the original image. Unfortunetally, while (*) is a better symbol to represent convolution, the "*" symbol is also commonly used to denote the convolution process (it is overloaded to mean different things such as element-wise multiplication as discussed earlier).

If applying our above filters to a 6x6 matrix we will get a 4x4 matrix as the output as our 3x3 kernal starts in the upper left corner to generate the 0,0 index for the 4x4 matrix. Then the kernal is shifted over by one pixel to the right and applied again to produce the the (0,1) index for our 4x4 output matrix. This continues until there are no more pixels left in the top row. Then the filter moves down by one row and back to the left to begin the process agian for the second row to generate output datapoints for the second row including (1,0), (1,1), (1, 2), (1, 3).

You take the element wise product for all pixels in the the filter as applied to the input data. You then add up the totals of all nine numbers generated from the element wise product to create the value for that kernal. A higher positive value or lower negative value denotes that the target edge was detected and present while numbers closer to 0 denote that the edge is not detected. Generally a positive number denotes some sort of feature while negative is the absence of such feature.

There is some debate as to what values make the most sense for creating these kernals:
- Sobel_filter = 
  - [1, 0, -1]
  - [2, 0, -2]
  - [1, 0, -1]
- Sobel_filter gives more weight to what is happening in the center of the filter than what is going on across the entire filter. 
- Scharr filter =
  - [3, 0, 3]
  - [10, 0, 10]
  - [3, 0, 3]

However, instead of hand picking these values yourself you can also set them up to be parameters for the network to learn. For example:
- learnt_filter =
  - [w1, w2, w3] 
  - [w4, w5, w6]
  - [w7, w8, w9]
- typically learnt filters will do a better job of detecting the features in the network than the hand-picked filters explained above.
- This is an EXTREMELY POWERFUL breakthrough in computer vision.

# Padding Convolutions
When applying the convolution operation on a set of data the size of your output will be smaller than the dimensions of the input matrix.
- If your input data is of dimension (n, n) and your filter is of dimension (f, f) the output matrix will be of dimension 
- (n-f+1 , n-f+1) or (4, 4) in the case of a (6x6) input matrix and (3x3) filter

Sometimes the shrinking of the output can be an issue (especially with smaller images) especially when you are applying multiple convolution steps in a row as the size of the output data shrinks with each iteration. 

Another problem, is that pixels in the corners only have a single filter applied to them while pixels in the center of the image will have 9 filters applied to them (that is assuming our typical (3,3) matrix). This results in effectively throwing away data contained on the edges of the images.

Padding is the solution to this problem.

To make it so the output of our filter is the same size as the input data we can pad our input image with extra rows and columns of pixels around the edge to increase the size of the data to the point where the output is the same size as the original input data.
- In the case of our running (6,6) and (3,3) example:
- If we added a row of pixels above and below the image as well as columns of pixels on each size of the image, our input matrix will change size to (8,8) resulting in an output image of size (8-3+1, 8-3+1) or (6, 6)
- when padding, p=1 is saying that you are adding a onepixel boarder around the image.
- typically 0 is the value you use to pad the images.

## How much to pad? Valid vs Same convolutions
- **Valid** is the same as saying there is no padding
- **Same** convolutions pad so that the output size is the same as the input size (the amount of padding depends on the size of your filter)
  - To figure out how much padding:
  - p = (f - 1) / 2 (where f is the length of one size of the kernal)
  - (3,3) filter results in a padding of 1
  - (5,5) filter results in a padding of 2
  - filter dimensions are almost always odd ad they have a central position

# Striding Convolutions
Same basic principles as a non-strided convolution, but instead of moving the filter by one pixel at a time, you set a stride value and move by that many pixels at a time. When using padding, the output size of your matrix decreases even further according to this equasion:
- n = size of one dimension of input data
- s = stride amount
- f = size of one filter dimension
- p = padding amount
- s = stride amount
- os = output size
- os = (((n + 2p - f) / s) + 1, ((n + 2p - f) / s) + 1)
- If the equasion returns a non-integer, you would round down (take the floor)

## Difference between cross-correlation and convolution
Cross-correlation and convolution are similar mathematical operations used in signal processing, image processing, and deep learning, particularly in the context of applying filters or kernels to data. The primary difference between the two lies in the way the filter or kernel is applied to the input data.

Cross-correlation:
In cross-correlation, the filter is applied to the input data as it is, without any modification. The operation involves sliding the filter over the input data and computing the element-wise multiplication and summation between the filter and the portion of the input it covers at each step.

Convolution:
In convolution, the filter is first flipped along both the x and y axes (rotated by 180 degrees) before being applied to the input data. Like cross-correlation, the operation involves sliding the filter over the input data and computing the element-wise multiplication and summation between the filter and the portion of the input it covers at each step.

In the context of deep learning and Convolutional Neural Networks (CNNs), the distinction between cross-correlation and convolution is often blurred. In practice, CNNs typically use the cross-correlation operation, but the term "convolution" is commonly used to describe the process. Since the filters are learned during training, the distinction between the two operations becomes less important, and both can achieve similar performance.

# Convolution on Volume (3 dimensions)
In convolutional neural networks, when working with 3D matrices (or tensors) for image data, the third dimension typically represents the channels (e.g., RGB for color images) rather than spatial dimensions like X, Y, and Z.

This process is similar to standard convolution, but we are working with 3D tensors instead of 2D matrices. When conducting convolution on a volume:
- Our kernel is also 3D, usually having a size like (3, 3, C), (5, 5, C), or (7, 7, C), where C is the number of channels in the input data (e.g., 3 for RGB images).
- The input data we apply the kernel to is in three dimensions: height, width, and channels.
- Our output data is a 2D tensor, where the Z dimension is flatted by the filter
- The process for calculating the convolutions is similar, except that the kernel is applied across all channels in the input data. The kernel is moved along the X and Y dimensions, and the element-wise multiplication and summation are performed for each channel.
- It's important to note that the kernel's depth (number of channels) should match the input data's depth. In practice, we often have multiple kernels (or filters) applied to the input data, which results in multiple feature maps stacked together to form the output 3D tensor.
- When using multiple 3D filters on the same dataset you will get an output with the following dimensions:
  - (n-f+1, n-f+1, num_filters)

## Implementing a single layer of a CNN
- After you calculate you convolution output data you would add a bias constant to each value in the matrix before putting the matrix through whatever activation fuinction you have chosen (e.g. ReLU)
- You will have a different bias value for each layer of the output layer (to the output of each kernal)
- This can lead to many parameters if you allow the network to determine the weights and biases for each of the filters
- Lets say we have 10 filters that are (3,3,3) in layer n
  - W[n] = 10 * 3 * 3 * 3 = 270 weight values
  - b[n] = 10 bias values
  - Or 280 total parameters for layer n
  - This is independent from the number of input features fed into the layer (1000,1000, 3) or (64,64,3) images for instance. this makes them so they are pretty good at not overfitting large images

## Notation summary
If layer l is a convolution layer:
- f[l] = filter size
- p[l] = padding amount
- s[l] = stride
- nc[l] = number of filters
- each filter is = (f[l], f[l], nc[l-1]) shape
- activations = a[l] -> (nh[l], nw[l], nc[l])
- A[l] -> m * nh[l] * nw[l] * nc[l]
- Weights = (f[l], f[l], nc[l-1], nc[l])
- bias = nc[l] -> usually represented as (1, 1, 1, nc[l]) 4D vector to make broadcasting and matrix opperations easier to execute

Inputs:
- (nh[l - 1], nw[l-1], nc[l-1]) matrix (height, width, columns)

Outputs:
- (nh[l], nw[l], nc[l]) matrix
- nh[l] = floor((nh[l-1] + 2p[l] - f[l]) / s[l]) + 1)
- nw[l] = floor((nw[l-1] + 2p[l] - f[l]) / s[l]) + 1)
- nc[l] = number of filters

## Multiple Layer CNNs
For the last convolution layer you will typically flatten the array (into a single column with all the values in the rows) which is then sent to your final output layer (softmax, logistic regression, etc.)

A large part of designing CNNs is choosing the values for the hyper-parameters including the stride, padding, number of filters, etc.

Typically when you use CNN networks you typically trend down the dimensions of the nh and nw values decrease while the number of filters increase.

Convolutional Neural Networks typically contain three types of layers:
- Convolutions layers (CONV layers)
- Pooling Layer (POOL)
- Fully connected (FC)

## What are Pooling Layers?
A pooling layer in a Convolutional Neural Network (CNN) is used to reduce the spatial dimensions of the input data (feature maps) while retaining important features. The primary goals of pooling layers are to decrease computational complexity, control overfitting, and increase translation invariance in the network.

Pooling layers perform a downsampling operation by summarizing local regions in the input data using a specific function. The two most common types of pooling are:

**Max Pooling**: In max pooling, a small window (usually 2x2 or 3x3) is moved across the input data with a certain stride, and the maximum value within the window is retained as the output for that region. Max pooling has the advantage of preserving the most prominent features in the input data.
- Max Pooling has hyper parameters but no parameters to learn
- f = width and height of kernal
- s = step size for kernal
- common values are f=2 and s=2
  - this halfs height and width
- f=3, s=2 is also a common choice in hyper-parameter values
- **IMPORTANT** Typically when you use max pooling you do not use any padding
- if performing maxpooling on multiple channels, the output of the pooling will have the same number of channels as the kernal is conducted independly on each channel
- Max Pooling is much more common than Average Pooling with the exception of the use-case of reducing the size of a matrix without loosing "data"

**Average Pooling**: In average pooling, the window is moved across the input data similarly to max pooling, but instead of retaining the maximum value, the average value within the window is computed and used as the output for that region. Average pooling tends to produce smoother feature maps compared to max pooling.

It's important to note that pooling layers are applied independently to each channel of the input data. This means that the depth (number of channels) of the input data remains the same after applying a pooling layer, while the height and width are reduced.

In summary, a pooling layer in a CNN is used to reduce the spatial dimensions of the input data, helping to decrease computational complexity, control overfitting, and increase translation invariance in the network. The most common types of pooling are max pooling and average pooling, which summarize local regions in the input data using the maximum and average values, respectively.

## What are Fully Connected layers?
A fully connected (FC) layer, also known as a dense layer, is a layer in a Convolutional Neural Network (CNN) where each neuron is connected to every neuron in the previous layer and every neuron in the subsequent layer. Fully connected layers are typically used towards the end of a CNN architecture, after a series of convolutional and pooling layers, to perform high-level reasoning and make final predictions based on the features extracted by the earlier layers.

In a CNN, the fully connected layers act as a bridge between the feature extraction part of the network, consisting of convolutional and pooling layers, and the output layer that produces the final predictions or classification. Before being fed into the fully connected layers, the output of the last convolutional or pooling layer is usually flattened into a one-dimensional vector.

In the fully connected layers, each neuron computes a weighted sum of its inputs followed by the application of an activation function, such as ReLU, sigmoid, or softmax. The weights and biases of the neurons in the fully connected layers are learned during the training process, allowing the network to make predictions based on the patterns and features it has learned from the training data.

The primary role of fully connected layers in a CNN is to:

Combine the features extracted by the convolutional and pooling layers to perform high-level reasoning.
Map the high-level features to the final output classes, which can be probabilities for classification tasks or real values for regression tasks.
In summary, a fully connected layer in a CNN is a layer where each neuron is connected to all neurons in the adjacent layers. It plays a crucial role in combining the features extracted by the earlier layers and mapping them to the final output, enabling the network to make predictions or classifications based on the input data.

## CNN Example
**PLEASE NOTE**: There are inconsistancies in the literature as to what is called a layer within a CNN. 
- Input -> CONV1 -> POOL1 -> Output
- Some will combine the CONV1 and POOL1 steps and refer to the two as layer1, while other people will refer to CONV1 as layer 1 and POOL1 as layer 2. The first method is common because often times when people report the number of layers in a NN they only count the layers which have parameters (weights) and because Pooling layers dont have parameters they are often discounted in the layer count. We will generally be following the convention of only counting layers that have weights.
- **IMPORTANT** Generally, dont try to pick your own hyper-parameters for CNNs it is usally better to look at the literature, choose hyper-parameter values that have worked for other projects and use those as a base.
- Typical CNN Network structure:
  - X -> CONV1 -> POOL1 -> CONV2 -> POOL2 -> FC1 -> FC2 -> FC3 -> Softmax
  - X -> CONV1 -> CONV2 -> POOL1 -> CONV3 -> FC1 -> FC2 -> Softmax
- Typically the majority of your learnable parameters are contained within the fully connected layers while the CONV layers contain far fewer features
- Furthermore, the activation size (the number of neurons or nodes in that layer) tends to go down from one layer to the next 

## Why use Convolutions?
- **Fewer Parameters** If you were to use fully-connected layers for all of your layers and are dealing with images or other large input arrays, the number of parameters in the network is crazy large. Alternatively, the convolutional layers contain far fewer parameters which the network has to learn.

### **Parameter Sharing** 
Convolutional neural networks (CNNs) use convolutional layers to reduce the number of parameters required to learn representations from high-dimensional input data such as images. The main mechanism that allows CNNs to reduce the number of parameters is parameter sharing.

Parameter sharing is a technique where the same set of weights (also known as filters or kernels) is applied to multiple locations in the input data. The idea behind parameter sharing is that features that are useful in one part of the input may also be useful in other parts of the input. By sharing the same set of weights across multiple locations, the model can learn these features more efficiently and with fewer parameters.

For example, let's consider a 2D convolutional layer with a kernel size of 3x3 and 32 filters. If the input to this layer is a 100x100x3 image (100 pixels width, 100 pixels height, and 3 color channels), then the number of parameters in this layer would be 3x3x3x32 = 288.

During the forward pass, the convolution operation slides the 3x3 kernel over the input image and performs element-wise multiplication and summation to produce a single output value (also known as a feature map). By applying the same kernel to multiple locations in the input image, the same set of 288 parameters is used for all the operations, which reduces the number of parameters required to learn the features in the input image.

Through the use of parameter sharing, convolutional layers can effectively learn useful features from high-dimensional input data while reducing the number of parameters required to learn these features. This allows CNNs to learn representations from large datasets efficiently and with fewer computational resources.

### Sparsity of Connections
Each output value in the convolution layer can be thought of as only having connections to the pixels over which the kernal is applied (9 inputs in the case of a (3,3) or 25 in the case of a (5,5))

Sparsity of connections is another technique used in convolutional layers to further reduce the number of parameters required for learning. In convolutional layers, sparsity of connections refers to the fact that not all neurons in a layer are connected to all neurons in the previous layer.

In a convolutional layer, each neuron is only connected to a small local region of the previous layer. This local region is determined by the size of the kernel (filter) used in the convolution operation. For example, in a 2D convolutional layer with a kernel size of 3x3, each neuron in the current layer is only connected to a 3x3 region of the previous layer. This means that the number of connections between the current layer and the previous layer is significantly reduced compared to a fully connected layer.

This sparsity of connections has several benefits. First, it reduces the number of parameters required to learn the features of the input. Second, it enables the model to learn translation-invariant features that are useful for tasks such as image classification. Third, it reduces the computational cost of the model, making it more efficient and faster to train.

Moreover, to further increase the sparsity of connections, some convolutional layers use a technique called "dilated convolution" or "atrous convolution". In dilated convolution, the kernel is dilated by inserting zeros between the kernel elements. This increases the receptive field of the neuron without increasing the number of connections. The dilation rate determines the spacing between the kernel elements and hence the degree of sparsity in the connections.

Overall, sparsity of connections in convolutional layers is a powerful technique that reduces the number of parameters and computational cost of the model while enabling it to learn effective features for tasks such as image classification and object detection.

In [2]:
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
from matplotlib.pyplot import imread
import scipy
from PIL import Image
import pandas as pd
import tensorflow as tf
import tensorflow.keras.layers as tfl
from tensorflow.python.framework import ops
from cnn_utils import *
from test_utils import summary, comparator

%matplotlib inline
np.random.seed(1)

ModuleNotFoundError: No module named 'cnn_utils'

In [None]:
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_happy_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

In [None]:
# GRADED FUNCTION: happyModel

def happyModel():
    """
    Implements the forward propagation for the binary classification model:
    ZEROPAD2D -> CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> FLATTEN -> DENSE
    
    Note that for simplicity and grading purposes, you'll hard-code all the values
    such as the stride and kernel (filter) sizes. 
    Normally, functions should take these values as function parameters.
    
    Arguments:
    None

    Returns:
    model -- TF Keras model (object containing the information for the entire training process) 
    """
    model = tf.keras.Sequential([
           
            # YOUR CODE STARTS HERE
             ## ZeroPadding2D with padding 3, input shape of 64 x 64 x 3
            tfl.ZeroPadding2D(padding=3, input_shape=(64,64,3)),
            ## Conv2D with 32 7x7 filters and stride of 1
            tfl.Conv2D(32, (7,7), strides=1),
            ## BatchNormalization for axis 3
            tfl.BatchNormalizatoin(axis=3),
            ## ReLU
            tfl.ReLU(),
            ## Max Pooling 2D with default parameters
            tfl.MaxPool2D(),
            ## Flatten layer
            tfl.Flatten(),
            ## Dense layer with 1 unit for output & 'sigmoid' activation
            tfl.Dense(1, activation='sigmoid')
            
            # YOUR CODE ENDS HERE
        ])
    
    return model

In [None]:
# GRADED FUNCTION: convolutional_model

def convolutional_model(input_shape):
    """
    Implements the forward propagation for the model:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> DENSE
    
    Note that for simplicity and grading purposes, you'll hard-code some values
    such as the stride and kernel (filter) sizes. 
    Normally, functions should take these values as function parameters.
    
    Arguments:
    input_img -- input dataset, of shape (input_shape)

    Returns:
    model -- TF Keras model (object containing the information for the entire training process) 
    """

    # YOUR CODE STARTS HERE
    
    input_img = tf.keras.Input(shape=input_shape)
    ## CONV2D: 8 filters 4x4, stride of 1, padding 'SAME'
    Z1 = tf.keras.layers.Conv2D(filters=8, kernel_size=(4, 4), strides=1, padding='same')(input_img)
    ## RELU
    # TODO = make sure number of dimensions is correct for this function
    A1 = tf.keras.layers.ReLU()(Z1)
    ## MAXPOOL: window 8x8, stride 8, padding 'SAME'
    P1 = tf.keras.layers.MaxPool2D(pool_size=(8, 8), strides=(8,8), padding='same')(A1)
    ## CONV2D: 16 filters 2x2, stride 1, padding 'SAME'
    Z2 = tf.keras.layers.Conv2D(filters=16, kernel_size=(2, 2), strides=1, padding='same')(P1)
    ## RELU
    A2 = tf.keras.layers.ReLU()(Z2)
    ## MAXPOOL: window 4x4, stride 4, padding 'SAME'
    P2 = tf.keras.layers.MaxPool2D(pool_size=(4, 4), strides=(4, 4), padding='same')(A2)
    ## FLATTEN - TODO - do I pass P2 into this function?
    F = tf.keras.layers.Flatten()(P2)
    ## Dense layer
    ## 6 neurons in output layer. Hint: one of the arguments should be "activation='softmax'" 
    outputs = tf.keras.layers.Dense(units=6, activation='softmax')(F)
    # YOUR CODE ENDS HERE
    model = tf.keras.Model(inputs=input_img, outputs=outputs)
    return model