### Convolutional Neural Networks (CNNs)

CNNs are useful for image recognition. In this section, we'll use the LeNet5 architecture, designed by Yann LeCun. 

### Initialization

For the MNIST dataset, each image has dimensions 28 x 28 (784 pixels per image). So, each input image corresponds to a tensor of 784 normalized floating point values between 0.0 and 1.0. The label for an image is a one-hot tensor with 10 classes (0-9). 

When we use a batch of input data, the shape of `inputs` is (batch_size, self.input_dim * 2) (because there are batch_size images and each image is a square with length = input_dim) and the shape of the labels is (batch_size, self.output_size). 

In [None]:
"""
batch_size = 16
dataset = dataset.batch(batch_size)
it = dataset.make_one_shot_iterator()
inputs, labels = it.get_next()
with tf.Session() as sess:
    # Batch of data size 16
    input_arr, label_arr = sess.run(
        (inputs, labels))
print(repr(input_arr))
"""

"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        # CODE HERE
        self.input_dim = input_dim
        self.output_size = output_size
"""

### Reshaping

#### NHWC Format
In Tf, image data has as NHWC format (num_samples, height, width, channels). So, we need to transform the 2-D data. For the MNIST dataset, the height and width of each image is self.input_dim, while the number of channels is 1. 

#### Reshaping the Data

To reshape, we use tf.reshape. It takes in a tensor and new shape as required arguments. E.g., you could reshape a tensor from (4, 10) to (5, 4, 2) because both shapes contain 40 elements, but you could not reshape (4, 10) to (3, 10, 2). 

We can use -1 for one dimension of the new shape, and the dimension of -1 will take on the value necessary to allow the new shape to contain all the elements of the tensor. 

We reshape the data so we can set it up to run through the CNN

In [None]:
"""
with tf.Session() as sess:
    input_arr = sess.run(inputs)
    reshaped_arr = sess.run(
        tf.reshape(inputs, [-1, 2, 2, 1])
    )
print(repr(input_arr))
print(repr(reshaped_arr))
"""

"""
class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        # reshaping the input in NHWC format
        reshaped_inputs = tf.reshape(inputs, (-1, self.input_dim, self.input_dim, 1))
"""


### Convolution

#### Filters and kernels

We use filters to transform inputs and extract features to recognize certain images. For example, we can have filters that magnify curves, so our model can tell the difference between curved digits (e.g., 8) and non-curved digits (e.g., 1). 

A kernel matrix defines the weights of a filter, and it is usually a square matrix. Like all neural network weights, the filter's weights are trainable. 

(so, in a feedforward neural net, you train the weights to create the linear combination of features that lead to the best prediction. You can apply that principle with CNNs to create the optimal filters to distinguish between digits by extracting the most useful hidden features)

When the input data has multiple channels, a filter will have a separate kernel matrix per channel. 

#### Convolution

The convolution represents how we combine our filter weights to the input data (so, we have the filters, which are like our "weights", and we figure the "linear combination" - or in this case, matrix product - of filter weights that give us the best outcome). 

The main operation used by a convolution is the matrix dot product (not matrix multiplication, but rather the sum over the element-wise product of two matrices. 

Ex. [[0.4, 0.1], [0.1, 0.2]] * [[0.0, 1.0], [0.4, 0.2]] = 0.18

In addition to matrix dot products the convolution includes a trainable bias term that's added to the matrix dot product in a convolution. 

The number of matrix dot products in a convolution depends on the dimensions of the input data, the kernel matrix, and the stride size (vertical/horizontal offset of the kernel matrix as it moves along the input data). (so, when you pass the filter over the data, how far do you move left/right/up/down as you move from one portion of the input to the next). 

#### Padding

We can change the size of the kernel matrix that we use (e.g., 2 x 2, 3 x 3, etc.). But, our combination of kernel and stride size might not fit nicely with the data. If we want to use all the input data in our convolution, we can pad the input matrix with 0s, adding rows/columns made entirely of 0s to the edges of the input data matrix. Since 0s multiplied by any number equals 0, the padding doesn't affect the matrix products. 

But, we only pad 0s to the right and bottom of the input data matrix, so as to avoid extra dot products and pad only the absolute minimum necessary to use all our input data. 

#### Convolution Layer

A convolution layer (similar to a hidden layer in a feedforward network) applies multiple filters to the input tensor (and the neural network learns what the weights of these filters should be). While each filter has a separate kernel matrix for each of the input channels, the overall result of a filter's convolution is the sum of the convolutions across all the input channels. 

Adding more filters to a convolution layer allows the layer to better extract hidden features. But, this results in additional training time and computational complexity, since filters add extra weights to the model. 




In [None]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # CODE HERE
        conv1 = tf.layers.conv2d(reshaped_inputs, 
                                 filters = 32, 
                                 kernel_size = [5, 5], 
                                 padding = 'same', 
                                 activation = tf.nn.relu, 
                                 name = 'conv1')
"""

### Max Pooling

#### Purpose of Pooling

The convolution layer extracts important hidden features, but the number of features it extracts can still be pretty large. To get around this, we can use $\textit{pooling}$, which reduces the size of the data, allowing the model to perform fewer computations and train faster. (so, similar to dimensionality reduction in that sense)

Pooling also prevents overfitting because it extracts only the most salient features and ignores distortions or uncommon features found only in a few examples.

#### How Pooling Works

We use filter matrices for pooling. But, the pooling filter doesn't have any weights, and it doesn't perform matrix dot products. Instead, it applies a reduction operation to subsections of the input data. 

(NOTE: Pooling can operate after a convolution step! It's used to reduce the size of the dataset, so after you convolve you can still have a pretty large matrix. You can then use pooling to reduce the dimensions, like you do with convolution, but you don't fundamentally create any values that weren't already there, like you do with convolution)

The type of pooling usually used in CNNs is referred to as max pooling. The filters of max pooling use the "max" operation to obtain the maximum number in each submatrix of the input data. Below is an example: (using a 2 x 2 filter and a stride size of 1)

[[1, 2, 5], [4, 0, 3], [0, 1, 9]] ---> [[4, 5], [4, 9]]

Other types of pooling include min pooling and average pooling. 

For input data with dimensions $H_{in} x W_{in}$, the output of pooling with filter dimensions $H_F x W_F$ and stride size $S$ has the following height and weight:

$$H_{out} = [\frac{H_{in} - H_F - + 1}{S}]$$
$$W_{out} = [\frac{W_{in} - W_F - + 1}{S}]$$

#### Padding

Similar to convolutions, we may want to pad our input data prior to pooling. We pad our data with a value dependent on the pooling operation (e.g., for max pooling, we pad each matrix with $-\infty$ (since $-\infty$ is smaller than every number, it allows us to resize the input data without adding distortions when pooling)

When padding, the output dimensions don't depend on the filter dimensions anymore. For input data with dimensions $H_{in} x W_{in}$, the output of padded pooling with a stride size of $S$ has dimensions $$H_{out} = [\frac{H_{in}}{S}]$$ $$W_{out} = [\frac{W_{in}}{S}]$$

So, to clarify, the process, thus far, goes like this:

Input --> 

Convolution (perform calculations, matrix dot products, sums, etc) --> 

(Max) Pooling (get subset of convolution values, to get dimensionality reduction)

In [None]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
            inputs=reshaped_inputs,
            filters=32,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv1')
        # CODE HERE
        pool1 = tf.layers.max_pooling2d(inputs = conv1, pool_size = [2, 2], strides = 2, name = 'pool1')

"""

### Multiple Layers

We can increase the size and depth of a CNN

#### Adding extra layers

CNNs can benefit from additional layers. Additional layers allow a CNN to essentially stack multiple filters together for use on the image data. But, adding additional layers increases computational complexity, training time, and the risk of overfitting. 

For MNIST data, images are pretty simple and only have one channel, so one convolution and max pooling layer works well.

#### Increased filters

We normally increase the number of filters in a convolution layer the deeper the convolutional layer is in our model. This is because the deeper the convolution layer, the more detailed the extracted features become (e.g., the filters in the first layer might extract a feature like lines and edges, but the filters in the second layer take the combination of these features and extract more distinguished features, such as the sharp angle of a 7 or the intersecting curves of an 8). 

In the following example, our first convolution layer has 32 filters, while the second convolution layer has 64 layers. 

In [None]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
            inputs=reshaped_inputs,
            filters=32,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv1')
        # Pooling Layer #1
        pool1 = tf.layers.max_pooling2d(
            inputs=conv1,
            pool_size=[2, 2],
            strides=2,
            name='pool1')
        ####################
        # Convolutional Layer #2
        conv2 = tf.layers.conv2d(
            inputs = pool1, 
            filters=64, 
            kernel_size = [5, 5], 
            padding = 'same', 
            activation = tf.nn.relu, 
            name = 'conv2'
        )
        # Pooling Layer #2
        pool2 = tf.layers.max_pooling2d(
            inputs = conv2, 
            pool_size = [2, 2], 
            strides = 2, 
            name = 'pool2'
        )
"""

### Designing Fully-Connected Layers

Fully-connected layers can be used to aggregate and flatten data

#### Fully-connected layer

After the output data of the second pooling layer, we can apply a fully-connected layer of size 1024 (the number of outputs that result from the second layer). The number of units is somewhat arbitrary - enough to be powerful, but not so much as to be resource-intensive. 

The purpose of the fully-connected layer is to aggregate the data features before we convert them to logits. This allows the model to make better predictions than if you directly convert the pooling output to logits (since you add an extra level of "gathering" that can combine outputs of different filters). 

#### Flattening

The data, thus far, has been in NHWC format. But, in order to use a fully-connected layer, the data needs to be a matrix, where the # of rows = batch size, and # of columns = data features. 

We need to reshape, using tf.reshape, but in the opposite direction as before, now converting from NHWC to a 2-D matrix. 

Since the first dimension remains the batch size, we can use -1 for this dimension. To calculate the size of the second dimension (i.e., the total # of data features in `pool2`), we use the `shape` property of tensors in Tensorflow. The `shape` property gives us. `tf.TensorShape` object, which can be converted to a list of integers using its `as_list` function. 

The flattened data size is just the product of the H, W, and C sizes. 



In [None]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # Apply fully-connected layer
    def create_fc(self, pool2):
        # CODE HERE
        pass
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
            inputs=reshaped_inputs,
            filters=32,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv1')
        # Pooling Layer #1
        pool1 = tf.layers.max_pooling2d(
            inputs=conv1,
            pool_size=[2, 2],
            strides=2,
            name='pool1')
        ##############
        # Convolutional Layer #2
        conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=64,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv2')
        # Pooling Layer #2
        pool2 = tf.layers.max_pooling2d(
            inputs=conv2,
            pool_size=[2, 2],
            strides=2,
            name='pool2')
        dense = self.create_fc(pool2)
        ##############
        # get dimensions hwc (not including num_samples)
        hwc = pool2.shape.as_list[1:]
        # set value of flattened size
        flattened_size = 1
        for val in hwc:
            flattened_size *= val        
        # get flattened layer
        pool2_flat = tf.reshape(pool2, (-1, flattened_size))
        # apply a fully-connected (dense) layer
        dense = tf.layers.dense(pool2_flat, 1024, activation = tf.nn.relu, name = 'dense')
        # return dense layer
        return dense
"""

### Dropout

A problem with the CNN (and, frankly, any neural net) is the issue of overfitting. Dropout is one technique for reducing overfitting. 

#### Co-Adaptation

Co-adaptation occurs when multiple neurons in a layer extract the same, or very similar, hidden features from the input data. This can happen when the connection weights for two different neurals are nearly identical. At that point, having the additional layers creates redundancy in the model. 

When a fully-connected layer has a large number of neurons, co-adapation is more likely to occur. This can be a problem for two reasons. First, it's a waste of computation. Second, if many neurons are extracting the same features, it magnifiies the significance of that one feature. This leads to overfitting if the duplicated feature(s) is/are specific to only the training set. 

#### Dropout

To deal with the problem of co-adaptation, we can apply dropout during training. 

When we apply dropout, we randomly shut down some fraction of a layer's neurons at each training step (by zeroing out the neuron values). The fraction of neurons to be zero'd out is known as the dropout rate, $r_d$. The remaining neurons have their values multiplied by $\frac{1}{1- r_d}$, so that the overall sum of the neuron values remains the same. 

By randomly dropping a fraction of the neurons, we're choosing a random sample of neurons to use at each training step. So, each individual neuron works with many different subsets of the other neurons rather than them all at once. This helps each neuron avoid depending to much on other neurons to correct its mistakes (the root cause of co-adaptation, since a neuron is relying on other neurons to capture the same features), while still allowing the neurons to learn different things from one another. 

In [None]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # Apply dropout to final layer
    def apply_dropout(self, dense, is_training):
        # applies dropout to the dense fully-connected layer
        dropout = tf.layers.dropout(dense, rate = 0.4, training = is_training)
        return dropout
        
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
            inputs=reshaped_inputs,
            filters=32,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv1')
        # Pooling Layer #1
        pool1 = tf.layers.max_pooling2d(
            inputs=conv1,
            pool_size=[2, 2],
            strides=2,
            name='pool1')
        # Convolutional Layer #2
        conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=64,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv2')
        # Pooling Layer #2
        pool2 = tf.layers.max_pooling2d(
            inputs=conv2,
            pool_size=[2, 2],
            strides=2,
            name='pool2')
        # Dense Layer
        hwc = pool2.shape.as_list()[1:]
        flattened_size = hwc[0] * hwc[1] * hwc[2]
        pool2_flat = tf.reshape(pool2, [-1, flattened_size])
        dense = tf.layers.dense(pool2_flat, 1024,
            activation=tf.nn.relu, name='dense')
        # Apply Dropout
        dropout = self.apply_dropout(dense, is_training)

"""

### Logits

Once we have a fully-connected layer, we can extract multiclass logits from the CNN

#### Multiclass logits

Since there are 10 possible digits an MNIST image can be, we use a 10-neuron fully-connected layer to obtain the logits for each digit class. 

Then, just like in multiclass classification, we can use a softmax to convert the logits to per-class probabilities. The labels are one-hot vectors, where the 'hot index' corresponds to the digit in the MNIST image. We can take the max probability to classify. 

In [1]:
"""
import tensorflow as tf

class MNISTModel(object):
    # Model Initialization
    def __init__(self, input_dim, output_size):
        self.input_dim = input_dim
        self.output_size = output_size
    
    # Get logits from the dropout layer
    def get_logits(self, dropout):
        # CODE HERE
        logits = tf.layers.dense(dropout, self.output_size, name = 'logits')
        return logits
    
    # CNN Layers
    def model_layers(self, inputs, is_training):
        reshaped_inputs = tf.reshape(
            inputs, [-1, self.input_dim, self.input_dim, 1])
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
            inputs=reshaped_inputs,
            filters=32,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv1')
        # Pooling Layer #1
        pool1 = tf.layers.max_pooling2d(
            inputs=conv1,
            pool_size=[2, 2],
            strides=2,
            name='pool1')
        # Convolutional Layer #2
        conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=64,
            kernel_size=[5, 5],
            padding='same',
            activation=tf.nn.relu,
            name='conv2')
        # Pooling Layer #2
        pool2 = tf.layers.max_pooling2d(
            inputs=conv2,
            pool_size=[2, 2],
            strides=2,
            name='pool2')
        # Dense Layer
        hwc = pool2.shape.as_list()[1:]
        flattened_size = hwc[0] * hwc[1] * hwc[2]
        pool2_flat = tf.reshape(pool2, [-1, flattened_size])
        dense = tf.layers.dense(pool2_flat, 1024,
            activation=tf.nn.relu, name='dense')
        # Apply Dropout
        dropout = tf.layers.dropout(dense, rate=0.4,
            training=is_training)
        # Get and Return Logits
        return self.get_logits(dropout)

"""

"\nimport tensorflow as tf\n\nclass MNISTModel(object):\n    # Model Initialization\n    def __init__(self, input_dim, output_size):\n        self.input_dim = input_dim\n        self.output_size = output_size\n    \n    # Get logits from the dropout layer\n    def get_logits(self, dropout):\n        # CODE HERE\n        logits = tf.layers.dense(dropout, self.output_size, name = 'logits')\n        return logits\n    \n    # CNN Layers\n    def model_layers(self, inputs, is_training):\n        reshaped_inputs = tf.reshape(\n            inputs, [-1, self.input_dim, self.input_dim, 1])\n        # Convolutional Layer #1\n        conv1 = tf.layers.conv2d(\n            inputs=reshaped_inputs,\n            filters=32,\n            kernel_size=[5, 5],\n            padding='same',\n            activation=tf.nn.relu,\n            name='conv1')\n        # Pooling Layer #1\n        pool1 = tf.layers.max_pooling2d(\n            inputs=conv1,\n            pool_size=[2, 2],\n            strides=2,\n  

In [None]:
### full code to get predictions

"""
def run_model_setup(self, inputs, labels, is_training):
    logits = self.model_layers(inputs, is_training)

    # convert logits to probabilities with softmax activation
    self.probs = tf.nn.softmax(logits, name='probs')
    # round probabilities
    self.predictions = tf.argmax(
        self.probs, axis=-1, name='predictions')
    class_labels = tf.argmax(labels, axis=-1)
    # find which predictions were correct
    is_correct = tf.equal(
        self.predictions, class_labels)
    is_correct_float = tf.cast(
        is_correct,
        tf.float32)
    # compute ratio of correct to incorrect predictions
    self.accuracy = tf.reduce_mean(
        is_correct_float)
    # train model
    if self.is_training:
        labels_float = tf.cast(
            labels, tf.float32)
        # compute the loss using cross_entropy
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(
            labels=labels_float,
            logits=logits)
        self.loss = tf.reduce_mean(
            cross_entropy)
        # use adam to train model
        adam = tf.train.AdamOptimizer()
        self.train_op = adam.minimize(
            self.loss, global_step=self.global_step)
"""