In [6]:
import tensorflow as tf

# Intro
We can do better if we already know something about our image.  If it has structure, and the network doesn't have to learn the structure from scratch, it will perform better.  An example is when we look at letters that are different colors. The colors doesn't affect the classification of the letter.  So a network that only pays attention to the grayscale will perform better.

# Statistical Invariance
We have an image, we want our network to say that it is an image with a cat in it.  The location of the cat shouldn't matter.  If network has to learn about kittens in different positions independently, that is a lot of work that it has to do.  Instead we explicitly tell it that objects in images are largely the same regardless of position.  This is called **translation invariance**. Defferent positions, same kitten.  Same with text and the position of kitten.  We achieve this using **weight sharing.** When we know that two inputs can contain the same kind of information, we share their weights. Train weights jointly for those inputs.  **Statistical invariants** -things that don't change on average across time or space.  For images, the idea of weight sharing will get us to study convolutional networks.  Text will lead us to embedding and recurring NN.

# Convolutional Networks (Convnets)
**Neural networks that share their parameters across space**

Imagine an image as a flat pancake with a width, height, and depth (r,g,b). This is the input.  Now we take a tiny patch of the image and run a tiny NN on it with **k** outputs.  Slide NN across the image, without changing the weights.  Slide across and vertically.  On the output we have drawn a different image. Has different width, height, and more importantly, depth.  Have an output that has **k** colored channels.  This operation is called the convolution.  If patch size is whole image, no different than regular layer of NN.  Because of the smaller patch, we have fewer weights and they are shared acros space.  A convnet is basically a deep netwrok where instead of having stacks of matrix multipy layers, we are going to have stacks of convolutions.  It will form a pyramid, bottom is original image that is shallow (rgb).  Apply convolutions that will progressively squeeze the spacial dimensions while increasing the depth.  Corresponds roughly to the semantic complexity of representation.  On top we have the classifier.  Have a representation where all the spacial information has been squeezed out, only params that map to conent of the image remain.

Stride = 1 : roughly same size as input
Stride = 2 : about half the size.

Roughly because it depends on how we handle edges. If we don't go pass the edge = **'valid' padding**.  Go off the edge and pad with zeros =**'same' padding**. Will produce output that is roughly the same size as the image.
![Convnet 1](Lesson8/convnet-1.png)
![Convnet 2](Lesson8/convnet-2.png)

# Intuition
Convolutional NN classifies a golden retriever the same way we do.  It looks at certain parts of the image, such as nose, eyes, etc, and then combines these features.   Does this, but more lower level where it first identifies lines, then it groups those together and so on and so forth.  The cool thing is that the NN does all this learning by itself.  We don't tell it what to look for. It is able to pick out the traits that matter.

![Heirarchy](Lesson8/heirarchy-diagram.jpg)

# Filters
## Breaking up an Image
First we need to break up an image.  This is done by selectin a width and height that defines a filter.  The filter looks at patches that are the same size.  We slide the filter around the image so it can examine different paches.  **Stride** is the amount by which the filter slides.  The stride is another parameter we can tune.  Increasing the stride reduces the size of the model because it reduces the total patches each layer observes.  Comes with a reduction in accuracy.

An important thing about this is that we group together adjacent pixels and treat them as a collective.  We ignore this in a non-convolutional NN.  This is because a normal NN connects every pixel in the input image to a neuron in the next layer.  Pixels in images are close together for a reason.  The CNN learns to classify local patterns, like shapes and objects.

## Filter Depth
It's common to have more than one filter.  Different filters pick up different qualities of a patch (i.e. color, shape, etc)  Number of filters is called the **filter depth**

We connect each patch of pixels to **k** neurons in the next layer, **k** is the filter depth.  So the next layer has a height of **k**.  It is a hyperparameter that we can tune.

We connect a single patch to multiple neurons in the next level because a patch can contain many interst characteristics (i.e. tongue, teeth, fur)

**Same** padding equations:
```python
out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))
```

**Valid** padding equations:
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

# Convolutions Cont.
![Complete CNN](Lesson8/complete-cnn.png)
![Chain Rule With Sharing](Lesson8/chain-with-sharing.png)

# Parameters
## Parameter Sharing
We don't care where on the image a cat is present, we want to still be able to classify it as a cat.  Classification of a given pathch is determined by the weights and biases corresponding to that patch.  If we want a cat that's in the top left patch to be classified the same way as a cat in the bottom right patch, we need the weights and biases to be the same.

That's what we do in CNN.  The weights and biases we learn for a given output layer are shared across all patches in a given input layer. As we increase the depth of our filter, the number of weights and biases we have to learn still increase.  **Weights are not shared across the output channel**

Additional benefit: If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pairs.  Doesn't scale well. So this helps with translation invariance, and **gives us a smaller, more scalable model**

## Padding
If we have a 5x5 grid and a 3x3 patch and did not use padding, then out picture would shrink to 3x3 as well. We need to be able to maintain the same height and width, so we don't have to worry about dimensionality shrinking.  Easy way to achieve this is to add a border of 0's to the original image.  Would expand original image to 7x7 and the next layer image would be 5x5 again.

## Dimensionality
Given:
* our input layer has a width of **W** and a height of **H**
* our convolution layer has a filter size **F**
* we have a stride of **S**
* a padding of **P**
* and a filter depth of **K**

width of the next layer: **W_out = (W-F+2P)/S + 1**

height of the next layer: **H_out = (H-F+2P)/S + 1**

Depth of the next layer will be equal to the filter depth: **D_out = K**

Output volume: **W_out \* H_out \* D_out **

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network.

## Quiz
**Setup**
H = height, W = width, D = depth
* We have an input of shape 32x32x3 (HxWxD)
* 20 filters of shape 8x8x3 (HxWxD)
* A stride of 2 for both the height and width (S)
* With padding of size 1 (P)

**Answer**
14x14x20

### Code
```python
input = tf.placeholder(tf.float32, (None, 32, 32, 3))
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20)) # (height, width, input_depth, output_depth)
filter_bias = tf.Variable(tf.zeros(20))
strides = [1, 2, 2, 1] # (batch, height, width, depth)
padding = 'SAME'
conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias
```

Shape of **conv** will be [1, 16, 16, 20].  4D because it accounts for batch size.  Why not [1, 14, 14, 20]?  TensorFlow uses a different padding algorithm.  If using **'VALIS'** it would produce [1, 13, 13, 20]

**SAME Padding**
```python
out_height = ceil(float(in_height) / float(strides[1]))

out_width = ceil(float(in_width) / float(strides[2]))
```

**VALID Padding**
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))

out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```
## Quiz: Number of Parameters
### Setup
H = height, W = width, D = depth

* We have an input of shape 32x32x3 (HxWxD)
* 20 filters of shape 8x8x3 (HxWxD)
* A stride of 2 for both the height and width (S)
* Zero padding of size 1 (P)

### Output Layer
* 14x14x20 (HxWxD)

### Hint
Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. In addition, each neuron in the output layer must also connect to a single bias neuron.

### Solution
(8\*8\*3+1)\*(14\*14\*20) = 756560

8 \* 8 \* 3 is the number of weights, we add 1 for the bias. Remember, each weight is assigned to every single part of the output (14 \* 14 \* 20). So we multiply these two numbers together and we get the final answer.

## Quiz: Parameter Sharing
This is the number of parameters actually used in a convolution layer (**tf.nn.conv2d()**)
### Setup
H = height, W = width, D = depth

* We have an input of shape 32x32x3 (HxWxD)
* 20 filters of shape 8x8x3 (HxWxD)
* A stride of 2 for both the height and width (S)
* Zero padding of size 1 (P)

### Output Layer
* 14x14x20 (HxWxD)

### Hint
With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel. So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer.

### Solution
(8\*8\*3+1)\*(20) = 3860

That's 3840 weights and 20 biases. This should look similar to the answer from the previous quiz. The difference being it's just 20 instead of (14 \* 14 \* 20). Remember, with weight sharing we use the same filter for an entire depth slice. Because of this we can get rid of 14 \* 14 and be left with only 20.

# Visualizing CNNs
From [Visualizing and Understanding Convolutional Networks](http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf).  We will see what each layer detects and how each layer detects more and more complex ideas.

## Layer 1
![Layer 1](Lesson8/layer-1-grid.png)

Patterns that cause activations in the first layer.  Range from simple diagonal lines to green blobs.  These are the patterns that the first layer recognizes. Top left shows a -45 degree angle, and top middle shows +45.

![Layer 1 activations](Lesson8/grid-layer-1.png)
Above images activate the -45 degree line.

## Layer 2
![Layer 2 grid](Lesson8/layer-2-grid.png)

The second layer captures complex ideas.  Now we have circles and stripes.  **The CNN learns to do this on its own!**

## Layer 3
![Layer 3 grid](Lesson8/layer-3-grid.png)

## Layer 5
![Layer 5 grid](Lesson8/layer-5-grid.png)

# TensorFlow Convolution Layer
How to implement a CNN in TensorFlow
Uses **tf.nn.conv2d()** and **tf.nn.bias_add()**

In [10]:
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)

TensorFlow uses a stride for each **input** dimension, **[batch, input_height, input_width, input_channels]**. Generally going to set the strides for **batch** and **input_channels** to 1.

**tf.nn.bias_add()** adds a 1-d bias to the last dimension in a matrix.

# Explore the Design Space
## Advanced Convnet-ology
* Pooling
* 1x1 Convolutions
* Inception

## Pooling
Better way to reduce the spacial extent of the feature maps of the convolutional pyramid.  We currently use striding.  This is very agressive way to sample image.  Removes a lot of information.

Why use pooling? If we use a stride of 1, the output image will never get smaller, it would be the same size as the previous layer.  This is bad because it makes our model larger.  If we have a stride of 2 or more then we tend to loose a lot of information.  So the solution is to use pooling.  In one layer we will use a stride on 1.  Then in the next layer we will apply pooling.  We look at neighboring pixels and combine them (using max, ave, etc) to produce a smaller image.
![Pooling](Lesson8/pooling.png)

![Max Pooling](Lesson8/max-pooling.png)

More expensive because the convolutions have a smaller stride.

### Typical Architecture
![CNN Architecture](Lesson8/cnn-arch.png)

# TensorFlow Max Pooling
![Mac Pooling](Lesson8/max-pooling-2.png)
By Aphex34 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

**Max pooling with a 2x2 filter and stride of 2**

The benefi of the max pooling operation is to reduce the size of the input, and allow the NN to focus on only the most important elements.  Does this by only retaining the max value for each filter area, and removing the remaining values.

TensorFlow provides **tf.nn.max_pooling()**

In [14]:
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

**tf.nn.max_pooling()** performs max pooling with the **ksize** as the size of the filter and the **strides** as the length of the stride.  2x2 filter with a strid 2x2 are comming in practice.  Params are structured as 4-element lists, with each element corresponding to a dimension of the input tensor ([batch, height, width, channels]).  Batch and channels dimensions are typically 1.

Recents pooling layers have fallen out of favor because:
* Recent datasets are so big and complex we're more concerned about underfitting.
* Dropout is a much better regularizer.
* Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.

## Quiz: Pooling Mechanics
### Setup
H = height, W = width, D = depth

* We have an input of shape 4x4x5 (HxWxD)
* Filter of shape 2x2 (HxW)
* A stride of 2 for both the height and width (S)
Recall the formula for calculating the new height or width:
```python
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1
```

**What's the shape of the output? Format is HxWxD.**
2x2x5
### Code
Output shape of **pool** will be [1, 2, 2, 5] even if **padding** is changed to **'SAME'**.

In [15]:
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)

# 1x1 Convolution
![1x1 convolutions](Lesson8/1x1-conv.png)
* We have a mini nn over a patch instead of a linear classifier
* Inexpensive way to make model deeper and have more parameters without completely changing its structure
* Very cheap, they are just matrix multiply and have relatively few params

# Inception Modules
![Inception Model](Lesson8/inception.png)

# Convolution Network in TensorFlow
The structure of this network follows the classic structure of CNNs, which is a mix of convolution layers and max pooling, followed by fully-connected layers.  Thanks to [Aymeric Damien](https://github.com/aymericdamien/TensorFlow-Examples) for providing the original TensorFlow model.

## Dataset
We will again be exploring the MNIST dataset

In [3]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if you're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# Store layers weight and bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')

def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)
    
    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)
    
    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)
    
    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)
    
# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initialize variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})
            
            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})
            
            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                  epoch + 1,
                  batch + 1,
                  loss,
                  valid_acc))
            
        # Calculate Test Accuracy
        test_acc = sess.run(accuracy, feed_dict={
            x: mnist.test.images[:test_valid_size],
            y: mnist.test.labels[:test_valid_size],
            keep_prob: 1.})
        print('Testing Accuracy: {}'.format(test_acc))

Extracting .\train-images-idx3-ubyte.gz
Extracting .\train-labels-idx1-ubyte.gz
Extracting .\t10k-images-idx3-ubyte.gz
Extracting .\t10k-labels-idx1-ubyte.gz
Epoch  1, Batch   1 -Loss: 59164.2656 Validation Accuracy: 0.085938
Epoch  1, Batch   2 -Loss: 43841.5273 Validation Accuracy: 0.093750
Epoch  1, Batch   3 -Loss: 42512.7734 Validation Accuracy: 0.074219
Epoch  1, Batch   4 -Loss: 36167.4297 Validation Accuracy: 0.066406
Epoch  1, Batch   5 -Loss: 38407.0078 Validation Accuracy: 0.074219
Epoch  1, Batch   6 -Loss: 29824.0762 Validation Accuracy: 0.082031
Epoch  1, Batch   7 -Loss: 28880.4922 Validation Accuracy: 0.105469
Epoch  1, Batch   8 -Loss: 28233.3965 Validation Accuracy: 0.109375
Epoch  1, Batch   9 -Loss: 27130.5859 Validation Accuracy: 0.136719
Epoch  1, Batch  10 -Loss: 28684.4922 Validation Accuracy: 0.128906
Epoch  1, Batch  11 -Loss: 25472.8418 Validation Accuracy: 0.109375
Epoch  1, Batch  12 -Loss: 25876.5625 Validation Accuracy: 0.125000
Epoch  1, Batch  13 -Loss:

# Quiz: TensorFlow Convolution Layer
## My Solution

In [4]:
"""
Setup the strides, padding and filter weight/bias such that
the output shape is (1, 2, 2, 3).
"""
import tensorflow as tf
import numpy as np

# `tf.nn.conv2d` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array([
    [0, 1, 0.5, 10],
    [2, 2.5, 1, -8],
    [4, 0, 5, 6],
    [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))
X = tf.constant(x)


def conv2d(input):
    # Filter (weights and bias)
    # The shape of the filter weight is (height, width, input_depth, output_depth)
    # The shape of the filter bias is (output_depth,)
    # TODO: Define the filter weights `F_W` and filter bias `F_b`.
    # NOTE: Remember to wrap them in `tf.Variable`, they are trainable parameters after all.
    F_W = tf.Variable(tf.truncated_normal([4, 4, 1, 3]))
    F_b = tf.Variable(tf.truncated_normal([3]))
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = 'SAME'
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d
    # `tf.nn.conv2d` does not include the bias computation so we have to add it ourselves after.
    return tf.nn.conv2d(input, F_W, strides, padding) + F_b

out = conv2d(X)

## Their Solution

In [5]:
def conv2d(input):
    # Filter (weights and bias)
    F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3)))
    F_b = tf.Variable(tf.zeros(3))
    strides = [1, 2, 2, 1]
    padding = 'VALID'
    return tf.nn.conv2d(input, F_W, strides, padding) + F_b

I want to transform the input shape (1, 4, 4, 1) to (1, 2, 2, 3). I choose 'VALID' for the padding algorithm. I find it simpler to understand and it achieves the result I'm looking for.
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

Plugging in the values:

```python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```

In [20]:
import math
math.ceil(float(10 - 2 + 1) / float(2))


5