In [1]:
import numpy as np
import tensorflow as tf
from __future__ import division, print_function, unicode_literals

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Tensorflow and Deep Learning

In this lab assignment, first you will learn how to build and train a neural network that recognises handwritten digits, and then you will build LeNet-5 CNN architecture, which is widely used for handwritten digit recognition. At the end of this lab assignment, you will make AlexNet CNN architecture, which won the 2012 ImageNet ILSVRC challenge.

---
# 1. Dataset
In the first part of the assignment, we use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. There are 70,000 images, and each image has 784 features. This is because each image is 28×28=784 pixels, and each feature simply represents one pixel's intensity, from 0 (white) to 255 (black). The following figure shows a few images from the MNIST dataset to give you a feel for the complexity of the classification task.

<img src="figs/1-mnist.png" style="width: 300px;"/>

To begin the assignment, first, use `mnist_data.read_data_sets` and download images and labels. It return two lists, called `mnist.test` with 10K images+labels, and `mnist.train` with 60K images+labels.

In [2]:
# TODO: Replace <FILL IN> with appropriate code

from tensorflow.examples.tutorials.mnist import input_data as mnist_data

mnist = mnist_data.read_data_sets("MNIST_data/", one_hot = True, reshape=False, validation_size=0)

# (x_train, y_train),(x_test, y_test) = mnist.load_data()

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


---
# 2. A One-Layer Neural Network
<img src="figs/2-comic1.png" style="width: 500px;"/>

Let's start by building a one-layer neural network. Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a **one-layer neural network**. Each neuron in the network does a weighted sum of all of its inputs, adds a bias and then feeds the result through some non-linear activation function. Here we design a one-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9).
<img src="figs/3-one_layer.png" style="width: 400px;"/>


For a classification problem, an *activation function* that works well is **softmax**. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector.
<img src="figs/4-softmax.png" style="width: 300px;"/>

We can summarise the behaviour of this single layer of neurons into a simple formula using a *matrix multiply*. If we give input data into the network in *mini-batch* of 100 images, it produces 100 predictions as the output. We define the **weights matrix $W$** with 10 columns, in which each column indicates the weight of a one class (a single digit), from 0 to 9. Using the first column of $W$, we can compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron that points to the number 0. Using the second column of $W$, we do the same for the second neuron (number 1) and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images in the mini-batch. If we call $X$ the matrix containing our 100 images (each row corresponds to one digit), all the weighted sums for our 10 neurons, computed on 100 images are simply $X.W$. Each neuron must now add its bias. Since we have 10 neurons, we have 10 bias constants. We finally apply the **softmax activation function** and obtain the formula describing a one-layer neural network, applied to 100 images.
<img src="figs/5-xw.png" style="width: 600px;"/>
<img src="figs/6-softmax2.png" style="width: 500px;"/>

Then, we need to use the **cross-entropy** to measure how good the predictions are, i.e., the distance between what the network tells us and what we know to be the truth. The cross-entropy is a function of weights, biases, pixels of the training image and its known label. If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases, we obtain a **gradient**, computed for a given image, label and present value of weights and biases. We can update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images.
<img src="figs/7-cross_entropy.png" style="width: 600px;"/>

### Define Variables and Placeholders
First we define TensorFlow **variables** and **placeholders**. *Variables* are all the parameters that you want the training algorithm to determine for you (e.g., weights and biases). *Placeholders* are parameters that will be filled with actual data during training (e.g., training images). The shape of the tensor holding the training images is [None, 28, 28, 1] which stands for:
  - 28, 28, 1: our images are 28x28 (784) pixels x 1 value per pixel (grayscale). The last number would be 3 for color images and is not really necessary here.
  - None: this dimension will be the number of images in the mini-batch. It will be known at training time.

We also need an additional placeholder for the training labels that will be provided alongside training images.

In [3]:
# TODO: Replace <FILL IN> with appropriate code
batch_size = 100
# neural network with 1 layer of 10 softmax neurons
#
# · · · · · · · · · ·       (input data, flattened pixels)       X [batch, 784] 
# \x/x\x/x\x/x\x/x\x/    -- fully connected layer (softmax)      W [784, 10]     b[10]
#   · · · · · · · ·                                              Y_hat [batch, 10]

# input X: 28x28 grayscale images, the first dimension (None) will index the images in the mini-batch
X = tf.placeholder(tf.float32, shape = (None, 784))

# correct answers will go here
Y = tf.placeholder(tf.float32, shape = (None, 10))

# weights W[784, 10], 784 = 28 * 28
W = tf.Variable(tf.random_uniform(
    [784, 10],
    minval=0,
    maxval=1,
    dtype=tf.float32))

# biases b[10]
b = tf.Variable(tf.random_uniform(
    [10],
    minval=0,
    maxval=1,
    dtype=tf.float32))


### Build The Model
Now, we can make a **model** for a one-layer neural network. The formula is the one we explained before, i.e., $\hat{Y} = softmax(X . W + b)$. You can use the `tf.nn.softmax` and `tf.matmul` to build the model. Here, we need to use the `tf.reshape` to transform our 28x28 images into single vectors of 784 pixels.

In [4]:
# TODO: Replace <FILL IN> with appropriate code

# flatten the images into a single line of pixels
XX = tf.reshape(X, [60000, 784])

# The model
Y_hat = tf.nn.softmax(tf.matmul(X, W) + b)

### Define The Cost Function
Now, we have model predictions $\hat{Y}$ and correct labels $Y$, so for each instance $i$ (image) we can compute the cross-entropy as the **cost function**: $cross\_entropy = -\sum(Y_i * log(\hat{Y}i))$. You can use `reduce_mean` to add all the components in a tensor.

In [5]:
# TODO: Replace <FILL IN> with appropriate code

# cross_entropy = -1 * (tf.to_float(60000 * tf.reduce_mean(Y)) * tf.math.log(tf.to_float(60000 * tf.reduce_mean(Y_hat))))

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits = Y_hat, labels = Y)
cost = tf.reduce_mean(cross_entropy)

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



### Traine the Model
Now, select the gradient descent optimiser `GradientDescentOptimizer` and ask it to minimise the cross-entropy cost. In this step, TensorFlow computes the partial derivatives of the cost function relatively to all the weights and all the biases (the gradient). The gradient is then used to update the weights and biases. Set the learning rate is $0.005$.

In [6]:
# TODO: Replace <FILL IN> with appropriate code

optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.005)
train_step = optimizer.minimize(cross_entropy)

### Execute the Model
It is time to run the training loop. All the TensorFlow instructions up to this point have been preparing a computation graph in memory but nothing has been computed yet. The computation requires actual data to be fed into the placeholders. This is supplied in the form of a Python dictionary, where the keys are the names of the placeholders. During the trainig print out the cost every 200 steps. Moreove, after training the model, print out the accurray of the model by testing it on the test data.

In [7]:
# TODO: Replace <FILL IN> with appropriate code

# init
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs): 
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        if (epoch % 200 == 0):
            print(cost.eval(session = sess, feed_dict = {X: batch_x, Y: batch_y}))
        
        sess.run(train_step, feed_dict = {X: batch_x, Y: batch_y})
        
    correct_predictions = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
    
    print("Accuracy: {}".format(sess.run(accuracy, feed_dict = {X: mnist.test.images, Y: mnist.test.labels})))

2.28437
1.84385
1.70272
1.65363
1.64154
1.6061
1.59273
1.57056
1.55547
1.6015
1.59846
1.55137
1.5963
1.57656
1.59236
1.56271
1.57591
1.58269
1.60141
1.52589
1.57432
1.56226
1.5606
1.52749
1.52336
Accuracy: 0.9174000024795532


---
# 2. Add More Layers

<img src="figs/8-comic2.png" style="width: 500px;"/>

Now, let's improve the recognition accuracy by adding more layers to the neural network. The neurons in the second layer, instead of computing weighted sums of pixels will compute weighted sums of neuron outputs from the previous layer. We keep the softmax function as the activation function on the last layer, but on intermediate layers we will use the the **sigmoid** activation function. So, let's build a five-layer fully connected neural network with the following structure, and train the model with the trainging data and print out its accuracy on the test data.
<img src="figs/9-five_layer.png" style="width: 500px;"/>

In [161]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with five layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1 [200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2 [100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3 [60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4 [30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5 [10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, shape=(None, 784))
Y = tf.placeholder(tf.int64, shape=(None))

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
W1 = tf.get_variable("weights1", dtype=tf.float32, 
                    initializer=tf.random_uniform([784, 200]))
B1 = tf.get_variable("bias1", dtype=tf.float32, initializer=tf.random_uniform([200]))

W2 = tf.get_variable("weights2", dtype=tf.float32, 
                    initializer=tf.random_uniform([200, 100]))
B2 = tf.get_variable("bias2", dtype=tf.float32, initializer=tf.random_uniform([100]))

W3 = tf.get_variable("weights3", dtype=tf.float32, 
                    initializer=tf.random_uniform([100, 60]))
B3 = tf.get_variable("bias3", dtype=tf.float32, initializer=tf.random_uniform([60]))

W4 = tf.get_variable("weights4", dtype=tf.float32, 
                    initializer=tf.random_uniform([60, 30]))
B4 = tf.get_variable("bias4", dtype=tf.float32, initializer=tf.random_uniform([30]))

W5 = tf.get_variable("weights5", dtype=tf.float32, 
                    initializer=tf.random_uniform([30, 10]))
B5 = tf.get_variable("bias5", dtype=tf.float32, initializer=tf.random_uniform([10]))

########################################
# build the model
########################################
#XX = tf.reshape([])

Y1_hat = tf.nn.sigmoid(tf.matmul(X, W1) + B1)
Y2_hat = tf.nn.sigmoid(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.sigmoid(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.sigmoid(tf.matmul(Y3_hat, W4) + B4)
Y_hat = tf.nn.softmax(tf.matmul(Y4_hat, W5) + B5)

########################################
# define the cost function
########################################
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = Y_hat, labels = Y))

########################################
# define the optimizer
########################################
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.005)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

batch_size = 100
n_epochs = 5000
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs): 
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        if (epoch % 200 == 0):
            print(cross_entropy.eval(session = sess, feed_dict = {X: batch_x, Y: batch_y}))
        else: 
            sess.run(train_step, feed_dict = {X: batch_x, Y: batch_y})
        
    correct_predictions = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
    
    print("Accuracy: {}".format(sess.run(accuracy, feed_dict = {X: mnist.test.images, Y: mnist.test.labels})))

2.30892
2.31485
2.31197
2.31044
2.30698
2.31267


KeyboardInterrupt: 

---
# 3. Special Care for Deep Networks
As layers were added, neural networks tended to converge with more difficulties. For example, the accuracy could stuck at 0.1. Here, we want to apply some updates to the network we built in the previous part to improve its performance. 

### ReLU Activation Function
<img src="figs/10-comic3.png" style="width: 500px;"/>
The sigmoid activation function is actually quite problematic in deep networks. It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely. An alternative activation function is **ReLU** that shows better performance compare to sigmoid. It looks like as below:
<img src="figs/11-relu.png" style="width: 300px;"/>

### A Better Optimizer
In very high dimensional spaces like here, **saddle points** are frequent. These are points that are not local minima, but where the gradient is nevertheless zero and the gradient descent optimizer stays stuck there. One possible solution to tackle this probelm is to use better optimizers, such as Adam optimizer `tf.train.AdamOptimizer`.

### Random Initialisations
When working with ReLUs, the best practice is to initialise bias values to small positive values, so that neurons operate in the non-zero range of the ReLU initially.

### Learning Rate
<img src="figs/12-comic4.png" style="width: 500px;"/>
With two, three or four intermediate layers, you can now get close to 98% accuracy, if you push the iterations to 5000 or beyond. But, the results are not very consistent, and the curves jump up and down by a whole percent. A good solution is to start fast and decay the learning rate exponentially from $0.005$ to $0.0001$ for example. In order to pass a different learning rate to the `AdamOptimizer` at each iteration, you will need to define a new placeholder and feed it a new value at each iteration through `feed_dict`. Here is the formula for exponential decay: $learning\_rate = lr\_min + (lr\_max - lr\_min) * e^{\frac{-i}{2000}}$, where $i$ is the iteration number.

### NaN?
In the network you built in the last section, you might see accuracy curve crashes and the console outputs NaN for the cross-entropy. It may happen, because you are attempting to compute a $log(0)$, which is indeed Not A Number (NaN). Remember that the cross-entropy involves a log, computed on the output of the softmax layer. Since softmax is essentially an exponential, which is never zero, we should be fine, but with 32 bit precision floating-point operations, exp(-100) is already a genuine zero. TensorFlow has a handy function that computes the softmax and the cross-entropy in a single step, implemented in a numerically stable way. To use it, you will need to separate the weighted sum plus bias on the last layer, before softmax is applied and then give it with the true values to the function `tf.nn.softmax_cross_entropy_with_logits`.

In the code below, apply the following changes and show their impact on the accuracy of the model on training data, as well as the test data:
* Replace the sigmoid activation function with ReLU
* Use the Adam optimizer
* Initialize weights with small random values between -0.2 and +0.2, and make sure biases are initialised with small positive values, for example 0.1
* Update the learning rate in different iterations. Start fast and decay the learning rate exponentially from $0.005$ to $0.0001$, i.e., 
```
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0
```
* Use `tf.nn.softmax_cross_entropy_with_logits` to prevent getting NaN in output.

In [8]:
tf.__version__

'1.12.0'

In [35]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, shape=(None, 784))
Y = tf.placeholder(tf.int64, shape=(None))

# variable learning rate
learning_rate = tf.placeholder(tf.float32, shape=[])

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.get_variable("weights1", dtype=tf.float32, 
                    initializer=tf.random_uniform([784, 200], minval=-0.5,
    maxval=0.5))
B1 = tf.get_variable("bias1", dtype=tf.float32, initializer=tf.random_uniform([200], minval=0,
    maxval=1))

W2 = tf.get_variable("weights2", dtype=tf.float32, 
                    initializer=tf.random_uniform([200, 100], minval=-0.5,
    maxval=0.5))
B2 = tf.get_variable("bias2", dtype=tf.float32, initializer=tf.random_uniform([100], minval=0,
    maxval=1))

W3 = tf.get_variable("weights3", dtype=tf.float32, 
                    initializer=tf.random_uniform([100, 60], minval=-0.5,
    maxval=0.5))
B3 = tf.get_variable("bias3", dtype=tf.float32, initializer=tf.random_uniform([60], minval=0,
    maxval=1))

W4 = tf.get_variable("weights4", dtype=tf.float32, 
                    initializer=tf.random_uniform([60, 30], minval=-0.5,
    maxval=0.5))
B4 = tf.get_variable("bias4", dtype=tf.float32, initializer=tf.random_uniform([30], minval=0,
    maxval=1))

W5 = tf.get_variable("weights5", dtype=tf.float32, 
                    initializer=tf.random_uniform([30, 10], minval=-0.5,
    maxval=0.5))
B5 = tf.get_variable("bias5", dtype=tf.float32, initializer=tf.random_uniform([10], minval=0,
    maxval=1))

########################################
# build the model
########################################
#XX = <FILL IN>

Y1_hat = tf.nn.relu(tf.matmul(X, W1) + B1)
Y2_hat = tf.nn.relu(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.relu(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.relu(tf.matmul(Y3_hat, W4) + B4)
Y_hat = tf.nn.relu(tf.matmul(Y4_hat, W5) + B5)

########################################
# defining the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits = Y_hat, labels = Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0

batch_size = 100
n_epochs = 5000
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs): 
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        new_learning_rate = min_learning_rate + tf.math.multiply(tf.math.subtract(max_learning_rate, min_learning_rate), 
                                tf.math.exp(tf.math.xdivy(tf.math.negative(tf.to_float(epoch)), decay_speed))).eval()

        if (epoch % 100 == 0):
            print(cross_entropy.eval(session = sess, feed_dict = {X: batch_x, Y: batch_y, 
                                                        learning_rate: new_learning_rate}))
        else: 
            sess.run(train_step, feed_dict = {X: batch_x, Y: batch_y, 
                                        learning_rate: new_learning_rate})
                
    correct_predictions = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
    
    print("Accuracy: {}".format(sess.run(accuracy, feed_dict = {X: mnist.test.images, Y: mnist.test.labels})))

ValueError: Cannot feed value of shape (100, 28, 28, 1) for Tensor 'Placeholder:0', which has shape '(?, 784)'

---
# 4. Overfitting and Dropout
<img src="figs/13-comic5.png" style="width: 500px;"/>
You will have noticed that cross-entropy curves for test and training data start disconnecting after a couple thousand iterations. The learning algorithm works on training data only and optimises the training cross-entropy accordingly. It never sees test data so it is not surprising that after a while its work no longer has an effect on the test cross-entropy which stops dropping and sometimes even bounces back up. 
<img src="figs/14-overfit.png" style="width: 500px;"/>
This disconnect is usually labeled **overfitting** and when you see it, you can try to apply a regularisation technique called **dropout**. In dropout, at each training iteration, you drop random neurons from the network. You choose a probability `pkeep` for a neuron to be kept, usually between 50% and 75%, and then at each iteration of the training loop, you randomly remove neurons with all their weights and biases. Different neurons will be dropped at each iteration. When testing the performance of your network of course you put all the neurons back (`pkeep = 1`).
<img src="figs/15-dropout.png" style="width: 500px;"/>
TensorFlow offers a dropout function to be used on the outputs of a layer of neurons. It randomly zeroes-out some of the outputs and boosts the remaining ones by `1 / pkeep`. You can add dropout after each intermediate layer in the network now. 

In the following code, use the dropout between each layer during the training, and set the probability `pkeep` once to $50%$ and another time to $75%$ and compare their results.

In [23]:
import time
time.gmtime().tm_hour

18

In [32]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, shape=(None, 784))
Y = tf.placeholder(tf.int64, shape=(None))

# variable learning rate
learning_rate = tf.placeholder(tf.float32, shape=[])

# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = tf.placeholder(tf.float32)

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.get_variable("weights1", dtype=tf.float32, 
                    initializer=tf.random_uniform([784, 200], minval=-0.5,
    maxval=0.5))
B1 = tf.get_variable("bias1", dtype=tf.float32, initializer=tf.random_uniform([200], minval=0,
    maxval=1))

W2 = tf.get_variable("weights2", dtype=tf.float32, 
                    initializer=tf.random_uniform([200, 100], minval=-0.5,
    maxval=0.5))
B2 = tf.get_variable("bias2", dtype=tf.float32, initializer=tf.random_uniform([100], minval=0,
    maxval=1))

W3 = tf.get_variable("weights3", dtype=tf.float32, 
                    initializer=tf.random_uniform([100, 60], minval=-0.5,
    maxval=0.5))
B3 = tf.get_variable("bias3", dtype=tf.float32, initializer=tf.random_uniform([60], minval=0,
    maxval=1))

W4 = tf.get_variable("weights4", dtype=tf.float32, 
                    initializer=tf.random_uniform([60, 30], minval=-0.5,
    maxval=0.5))
B4 = tf.get_variable("bias4", dtype=tf.float32, initializer=tf.random_uniform([30], minval=0,
    maxval=1))

W5 = tf.get_variable("weights5", dtype=tf.float32, 
                    initializer=tf.random_uniform([30, 10], minval=-0.5,
    maxval=0.5))
B5 = tf.get_variable("bias5", dtype=tf.float32, initializer=tf.random_uniform([10], minval=0,
    maxval=1))
########################################
# build the model
########################################
#XX = <FILL IN>

Y1_hat = tf.nn.relu(tf.matmul(X, W1) + B1)
Y1_hat_dropout = tf.nn.dropout(Y1_hat, 0.75)
Y2_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y1_hat_dropout, W2) + B2), 0.75)
Y3_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y2_hat_dropout, W3) + B3), 0.75)
Y4_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y3_hat_dropout, W4) + B4), 0.75)
Y_hat = tf.nn.relu(tf.matmul(Y4_hat_dropout, W5) + B5)

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits = Y_hat, labels = Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0

batch_size = 100
n_epochs = 10000
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs): 
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        new_learning_rate = min_learning_rate + tf.math.multiply(tf.math.subtract(max_learning_rate, min_learning_rate), 
                                tf.math.exp(tf.math.xdivy(tf.math.negative(tf.to_float(epoch)), decay_speed))).eval()

        if (epoch % 100 == 0):
            print("({}:{}) Epoch: {} Cost: {}".format(time.gmtime().tm_hour, time.gmtime().tm_min, epoch, cross_entropy.eval(session = sess, feed_dict = {X: batch_x, Y: batch_y, 
                                                        learning_rate: new_learning_rate})))
        else: 
            sess.run(train_step, feed_dict = {X: batch_x, Y: batch_y, 
                                        learning_rate: new_learning_rate})
                
    correct_predictions = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
    
    print("({}:{})Accuracy: {}".format(time.gmtime().tm_hour, time.gmtime().tm_min, sess.run(accuracy, feed_dict = {X: mnist.test.images, Y: mnist.test.labels})))
    
    
# 50% dropout, Accuracy: 0.10260000079870224
# 75%: Accuracy: 0.1868000030517578

(2:14) Epoch: 0 Cost: 8373.060546875
(2:14) Epoch: 100 Cost: 229.20431518554688
(2:14) Epoch: 200 Cost: 234.15896606445312
(2:14) Epoch: 300 Cost: 230.45201110839844
(2:14) Epoch: 400 Cost: 229.81748962402344
(2:15) Epoch: 500 Cost: 231.63775634765625
(2:15) Epoch: 600 Cost: 231.17236328125
(2:16) Epoch: 700 Cost: 230.04164123535156
(2:16) Epoch: 800 Cost: 230.26670837402344
(2:17) Epoch: 900 Cost: 229.5806884765625
(2:17) Epoch: 1000 Cost: 230.6591796875
(2:18) Epoch: 1100 Cost: 229.52447509765625
(2:18) Epoch: 1200 Cost: 232.05914306640625
(2:19) Epoch: 1300 Cost: 230.8430633544922
(2:20) Epoch: 1400 Cost: 230.5386505126953
(2:21) Epoch: 1500 Cost: 229.9733123779297
(2:21) Epoch: 1600 Cost: 230.1727752685547
(2:22) Epoch: 1700 Cost: 229.9633331298828
(2:23) Epoch: 1800 Cost: 232.23190307617188
(2:24) Epoch: 1900 Cost: 230.60316467285156
(2:25) Epoch: 2000 Cost: 231.01666259765625
(2:26) Epoch: 2100 Cost: 229.46556091308594
(2:28) Epoch: 2200 Cost: 230.01904296875
(2:29) Epoch: 2300 C

## ---
# 6. Convolutional Network
<img src="figs/16-comic6.png" style="width: 500px;"/>
In the previous sections, all pixels of images flattened into a single vector, which was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, we can use **convolutional neural networks (CNN)** to take advantage of shape information. CNNs apply *a series of filters* to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for classification. CNNs contains three components:
  - **Convolutional layers**: apply a specified number of convolution filters to the image. For each subregion, the layer performs a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a ReLU activation function to the output to introduce nonlinearities into the model.
  - **Pooling layers**: downsample the image data extracted by the convolutional layers to reduce the dimensionality of the feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
  - **Dense (fully connected) layers**: perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
  
Typically, a CNN is composed of a *stack of **convolutional modules*** that perform feature extraction. Each *module* consists of a *convolutional layer* followed by a *pooling layer*. The last convolutional module is followed by one or more dense layers that perform classification. The final dense layer in a CNN contains a single neuron for each target class in the model, with a softmax activation function to generate a value between 0-1 for each neuron. We can interpret the softmax values for a given image as relative measurements of how likely it is that the image falls into each target class.

Now, let us build a convolutional network for handwritten digit recognition. In this assignment, we will use the architecture shown in the following figure that has three convolutional layers, one fully-connected layer, and one softmax layer. Notice that the second and third convolutional layers have a stride of two that explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. A convolutional layer requires a weights tensor like `[4, 4, 3, 2]`, in which the first two numbers define the size of a filter (map), the third number shows the *depth* of the filter that is the number of *input channel*, and the last number shows the number of *output channel*. The output channel defines the number of times that we repeat the same thing with a different set of weights in one layer. In our implementation, we assume the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected layer is 200.
<img src="figs/17-arch1.png" style="width: 600px;"/>

Convolutional layers can be implemented in TensorFlow using the `tf.nn.conv2d` function, which performs the scanning of the input image in both directions using the supplied weights. This is only the weighted sum part of the neuron. You still need to add a bias and feed the result through an activation function. The padding strategy that works here is to copy pixels from the sides of the image. All digits are on a uniform background so this just extends the background and should not add any unwanted shapes.

In [28]:
# TODO: Replace <FILL IN> with appropriate code
import time
# · · · · · · · · · ·      (input data, 1-deep)               X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @   -- conv. layer 5x5x1=>4 stride 1      W1 [5, 5, 1, 4]        B1 [4]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 4]
#   @ @ @ @ @ @ @ @     -- conv. layer 5x5x4=>8 stride 2      W2 [5, 5, 4, 8]        B2 [8]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 8]
#     @ @ @ @ @ @       -- conv. layer 4x4x8=>12 stride 2     W3 [4, 4, 8, 12]       B3 [12]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 12] => reshaped to YY [batch, 7*7*12]
#      \x/x\x\x/        -- fully connected layer (relu)       W4 [7*7*12, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/         -- fully connected layer (softmax)    W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()
batch_size = 100

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
y_true = tf.placeholder(tf.float32, [None, 10])
# learning_rate = tf.placeholder(tf.float32, shape=[])

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected
# layer is 200

# shape=(filter_dim_x, filter_dim_y, input_layers, convolutions)
W1 = tf.Variable(name = "weights1", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,1,4], minval=0, maxval=1), trainable = True)

B1 = tf.Variable(name = "bias1", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [4]), trainable = True) # 1 per filter

W2 = tf.Variable(name = "weights2", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,4,8], minval=0, maxval=1), trainable = True)

B2 = tf.Variable(name = "bias2", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [8]), trainable = True) # 1 per filter

W3 = tf.Variable(name = "weights3", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4,4,8,12], minval=0, maxval=1), trainable = True)

B3 = tf.Variable(name = "bias3", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [12]), trainable = True) # 1 per filter

W4 = tf.Variable(name = "weights4", dtype=tf.float32, 
                    initial_value=tf.random_uniform([7 * 7 * 12, 200], minval=0, maxval=1), trainable = True)

B4 = tf.Variable(name = "bias4", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [200]), trainable = True) # 1 per filter

W5 = tf.Variable(name = "weights5", dtype=tf.float32, 
                    initial_value=tf.random_uniform([200, 10], minval=0, maxval=1), trainable = True)

B5 = tf.Variable(name = "bias5", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [10]), trainable = True) # 1 per filter

########################################
# build the model
########################################
stride = 1  # output is 28x28
Y1_hat = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, stride, stride, 1], padding = 'SAME') + B1) # use tf.nn.conv2d

stride = 2  # output is 14x14
Y2_hat = tf.nn.relu(tf.nn.conv2d(Y1_hat, W2, strides=[1, stride, stride, 1], padding = 'SAME') + B2) # use tf.nn.conv2d

stride = 2  # output is 7x7
Y3_hat = tf.nn.relu(tf.nn.conv2d(Y2_hat, W3, strides=[1, stride, stride, 1], padding = 'SAME') + B3) # use tf.nn.conv2d

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(Y3_hat, [tf.shape(X)[0], 7 * 7 * 12])
Y4_hat = tf.nn.relu(tf.matmul(YY_hat, W4) + B4)
logits = tf.matmul(Y4_hat, W5) + B5
y_hat = tf.nn.softmax(logits)

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_true)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

########################################
# define the optmizer
########################################
lr = 0.003
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(loss = cross_entropy, global_step=tf.train.get_global_step())

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
n_epochs = 5000
with tf.Session() as sess:
    init.run()
    # print(tf.trainable_variables())
    for iteration in range(n_epochs):
        batch_X, batch_y = mnist.train.next_batch(batch_size)
        
        if iteration % 100 == 0:
            a, c = sess.run([accuracy, cross_entropy], feed_dict={X: batch_X, y_true: batch_y})
            print('Epoch {}: Train accuracy: {}, Loss: {}'.format(iteration, a, c))
            out_weights = W5.eval()
            # print(np.sum(out_weights))

        # train
        sess.run(train_step, feed_dict={X: batch_X, y_true: batch_y})
        
    a, c = sess.run([accuracy, cross_entropy], feed_dict={X: mnist.test.images, y_true: mnist.test.labels})
    print('test data: accurecy = {}, loss = {}'.format(a, c))  


Epoch 0: Train accuracy: 0.12999999523162842, Loss: 414694048.0
Epoch 100: Train accuracy: 0.699999988079071, Loss: 687688.0
Epoch 200: Train accuracy: 0.6600000262260437, Loss: 1523296.0
Epoch 300: Train accuracy: 0.800000011920929, Loss: 785600.0
Epoch 400: Train accuracy: 0.7099999785423279, Loss: 889348.0625
Epoch 500: Train accuracy: 0.6600000262260437, Loss: 885440.0625
Epoch 600: Train accuracy: 0.6000000238418579, Loss: 793628.0
Epoch 700: Train accuracy: 0.6399999856948853, Loss: 1055286.0
Epoch 800: Train accuracy: 0.7599999904632568, Loss: 571164.0
Epoch 900: Train accuracy: 0.7099999785423279, Loss: 491962.0
Epoch 1000: Train accuracy: 0.6899999976158142, Loss: 498995.03125
Epoch 1100: Train accuracy: 0.7900000214576721, Loss: 329938.0
Epoch 1200: Train accuracy: 0.8600000143051147, Loss: 114807.9921875
Epoch 1300: Train accuracy: 0.8600000143051147, Loss: 131242.0
Epoch 1400: Train accuracy: 0.7699999809265137, Loss: 183813.0
Epoch 1500: Train accuracy: 0.8899999856948853,

# 7. Improve The Performance
A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem. In the above model, we set the output channel to 4 in the first convolutional layer, which means that we repeat the same filter shape (but with different weights) four times. If we assume that those filters evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are made from more than 4 elemental shapes. So let us bump up the filter sizes a little, and also increase the number of filters in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. The following figure shows the new architecture you should build. Please complete the following code based on the given architecture and dropout technique.
<img src="figs/18-arch2.png" style="width: 600px;"/>

In [46]:
## Still not training properly... 


import time
# · · · · · · · · · ·      (input data, 1-deep)               X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @   -- conv. layer 5x5x1=>4 stride 1      W1 [5, 5, 1, 4]        B1 [4]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 4]
#   @ @ @ @ @ @ @ @     -- conv. layer 5x5x4=>8 stride 2      W2 [5, 5, 4, 8]        B2 [8]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 8]
#     @ @ @ @ @ @       -- conv. layer 4x4x8=>12 stride 2     W3 [4, 4, 8, 12]       B3 [12]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 12] => reshaped to YY [batch, 7*7*12]
#      \x/x\x\x/        -- fully connected layer (relu)       W4 [7*7*12, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/         -- fully connected layer (softmax)    W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()
batch_size = 100

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
y_true = tf.placeholder(tf.float32, [None, 10])
pkeep = tf.placeholder(tf.float32)

learning_rate = tf.placeholder(tf.float32, shape=[])

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected
# layer is 200

# shape=(filter_dim_x, filter_dim_y, input_layers, convolutions)
W1 = tf.Variable(name = "weights1", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,1,6], minval=0, maxval=1), trainable = True)

B1 = tf.Variable(name = "bias1", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [6]), trainable = True) # 1 per filter

W2 = tf.Variable(name = "weights2", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,6,12], minval=0, maxval=1), trainable = True)

B2 = tf.Variable(name = "bias2", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [12]), trainable = True) # 1 per filter

W3 = tf.Variable(name = "weights3", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4,4,12,24], minval=0, maxval=1), trainable = True)

B3 = tf.Variable(name = "bias3", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [24]), trainable = True) # 1 per filter

W4 = tf.Variable(name = "weights4", dtype=tf.float32, 
                    initial_value=tf.random_uniform([7 * 7 * 24, 200], minval=0, maxval=1), trainable = True)

B4 = tf.Variable(name = "bias4", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [200]), trainable = True) # 1 per filter

W5 = tf.Variable(name = "weights5", dtype=tf.float32, 
                    initial_value=tf.random_uniform([200, 10], minval=0, maxval=1), trainable = True)

B5 = tf.Variable(name = "bias5", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [10]), trainable = True) # 1 per filter

########################################
# build the model
########################################
stride = 1  # output is 28x28
Y1_hat = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, stride, stride, 1], padding = 'SAME') + B1) # use tf.nn.conv2d

stride = 2  # output is 14x14
Y2_hat = tf.nn.relu(tf.nn.conv2d(Y1_hat, W2, strides=[1, stride, stride, 1], padding = 'SAME') + B2) # use tf.nn.conv2d

stride = 2  # output is 7x7
Y3_hat = tf.nn.relu(tf.nn.conv2d(Y2_hat, W3, strides=[1, stride, stride, 1], padding = 'SAME') + B3) # use tf.nn.conv2d

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(Y3_hat, [tf.shape(X)[0], 7 * 7 * 24])
Y4_hat = tf.nn.relu(tf.matmul(YY_hat, W4) + B4)
# YY4_hat = tf.nn.dropout(Y4_hat, pkeep) # kanske måste slanga på random seed? 
logits = tf.matmul(Y4_hat, W5) + B5
y_hat = tf.nn.softmax(logits)

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_true)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

########################################
# define the optmizer
########################################
lr = 0.005
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(loss = cross_entropy, global_step=tf.train.get_global_step())

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
n_epochs = 5000
max_learning_rate = 0.005
min_learning_rate = 0.0001
with tf.Session() as sess:
    init.run()
    # print(tf.trainable_variables())
    for iteration in range(n_epochs):
        batch_X, batch_y = mnist.train.next_batch(batch_size)
        
        new_learning_rate = min_learning_rate + tf.math.multiply(tf.math.subtract(max_learning_rate, min_learning_rate), 
        tf.math.exp(tf.math.xdivy(tf.math.negative(tf.to_float(epoch)), decay_speed))).eval()
        new_learning_rate = 0.005
        # train 
        a, c = sess.run([accuracy, cross_entropy], feed_dict={X: batch_X, y_true: batch_y, pkeep: 0.75, learning_rate: new_learning_rate})
        if iteration % 100 == 0:
            print('Epoch {}: Train accuracy: {}, Loss: {}'.format(iteration, a, c))

        
    a, c = sess.run([accuracy, cross_entropy], feed_dict={X: mnist.test.images, y_true: mnist.test.labels, pkeep: 1.0})
    print('test data: accuracy = {}, loss = {}'.format(a, c))  
# 10000 epocs lr 0.01 --> clearly overfitting
# 5000 epocs lr 0.005 --> probably overfitting or stuck in local minima

Epoch 0: Train accuracy: 0.12999999523162842, Loss: 4116005888.0
Epoch 100: Train accuracy: 0.1599999964237213, Loss: 3508354304.0
Epoch 200: Train accuracy: 0.12999999523162842, Loss: 4094560512.0
Epoch 300: Train accuracy: 0.11999999731779099, Loss: 4121275904.0
Epoch 400: Train accuracy: 0.15000000596046448, Loss: 4640076800.0
Epoch 500: Train accuracy: 0.14000000059604645, Loss: 4165452288.0
Epoch 600: Train accuracy: 0.09000000357627869, Loss: 4301783040.0
Epoch 700: Train accuracy: 0.12999999523162842, Loss: 4515278336.0
Epoch 800: Train accuracy: 0.14000000059604645, Loss: 4358060032.0
Epoch 900: Train accuracy: 0.07000000029802322, Loss: 4561107456.0
Epoch 1000: Train accuracy: 0.14000000059604645, Loss: 4055534336.0
Epoch 1100: Train accuracy: 0.1899999976158142, Loss: 3846390272.0
Epoch 1200: Train accuracy: 0.11999999731779099, Loss: 3622874880.0
Epoch 1300: Train accuracy: 0.1599999964237213, Loss: 3678336000.0


KeyboardInterrupt: 

---
# 8. Tensorflow Layers Module
The TensorFlow **layers** `tf.layers` module provides a high-level API that makes it easy to construct a neural network. It provides methods that facilitate: (i) the creation of dense (fully connected) layers and convolutional layers, (ii) adding activation functions, and (iii) applying dropout regularization. In this section use the module `tf.layers` to build the network you made in section 7.

In [34]:
# · · · · · · · · · ·    (input data, 1-deep)                 X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @ -- conv. layer 6x6x1=>6 stride 1        W1 [5, 5, 1, 6]        B1 [6]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 6]
#   @ @ @ @ @ @ @ @   -- conv. layer 5x5x6=>12 stride 2       W2 [5, 5, 6, 12]        B2 [12]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 12]
#     @ @ @ @ @ @     -- conv. layer 4x4x12=>24 stride 2      W3 [4, 4, 12, 24]       B3 [24]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 24] => reshaped to YY [batch, 7*7*24]
#      \x/x\x\x/ ✞    -- fully connected layer (relu+dropout) W4 [7*7*24, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/       -- fully connected layer (softmax)      W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

reset_graph()

#######################################
# defineplaceholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
y_true = tf.placeholder(tf.float32, [None, 10])
pkeep = tf.placeholder(tf.float32)

########################################
# build the model
########################################
# Convolutional Layer #1
stride = 1
conv1 = tf.layers.conv2d(inputs=X, filters=6, kernel_size=[6, 6], strides = [stride, stride], padding="same", activation=tf.nn.relu)

# Convolutional Layer #2
stride = 2
conv2 = tf.layers.conv2d(inputs=conv1, filters=12, kernel_size=[5, 5], strides = [stride, stride], padding="same", activation=tf.nn.relu)

# Convolutional Layer #3
stride = 2
conv3 = tf.layers.conv2d(inputs=conv2, 
                         filters=24, 
                         kernel_size=[4, 4], 
                         strides = [stride, stride], 
                         padding="SAME", 
                         activation=tf.nn.relu)

# reshape the output from the third convolution for the fully connected layer
conv3_reshape = tf.reshape(conv3, [tf.shape(X)[0], 7 * 7 * 24])


# Dense Layer
# Input Tensor Shape: [batch_size, 200]
# Output Tensor Shape: [batch_size, 200]
dense = tf.layers.dense(inputs=conv3, units=200, activation=tf.nn.relu)

# Add dropout operation
dropout = tf.layers.dropout(
      inputs=dense, rate=pkeep)

# Logits layer
# Input Tensor Shape: [batch_size, 200]
# Output Tensor Shape: [batch_size, 10]
logits = tf.layers.dense(inputs=conv3_reshape, units=10)
y_hat = tf.nn.softmax(logits)

########################################
# define the cost and accuracy functions
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_true)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

########################################
# define the optimizer
########################################
lr = 0.003
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
n_epochs = 1000
batch_size = 100
with tf.Session() as sess:
    sess.run(init)

    for i in range(n_epochs):
        # load batch of images and correct answers
        batch_X, batch_y = mnist.train.next_batch(batch_size)
        
        # train
        a, c = sess.run([accuracy, cross_entropy], feed_dict={X: batch_X, y_true: batch_y, pkeep: 0.75})
        
        if i % 100 == 0:
            # a, c = sess.run([accuracy, cross_entropy], feed_dict={X: batch_X, y_true: batch_y, pkeep: 1.0})
            print('epoch {}: accurecy = {}, loss = {}'.format(i, a, c))


    a, c = sess.run([accuracy, cross_entropy], feed_dict={X: mnist.test.images, y_true: mnist.test.labels, pkeep: 1.0})
    print('test data: accurecy = {}, loss = {}'.format(a, c))  
##  without dropout, 1000 epochs: accurecy = 0.9836000204086304, loss = 5.435532569885254

## with dropout: 

epoch 0: accurecy = 0.18000000715255737, loss = 229.36260986328125
epoch 100: accurecy = 0.9300000071525574, loss = 17.235523223876953
epoch 200: accurecy = 0.9700000286102295, loss = 14.939865112304688
epoch 300: accurecy = 0.9599999785423279, loss = 14.98404312133789
epoch 400: accurecy = 0.9599999785423279, loss = 11.164587020874023
epoch 500: accurecy = 0.9700000286102295, loss = 9.224174499511719
epoch 600: accurecy = 1.0, loss = 2.9929733276367188
epoch 700: accurecy = 0.9800000190734863, loss = 11.214426040649414
epoch 800: accurecy = 0.9800000190734863, loss = 3.565812110900879
epoch 900: accurecy = 0.9599999785423279, loss = 30.281909942626953
test data: accurecy = 0.9794999957084656, loss = 6.202642917633057


---
# 9. Keras
Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production. `tf.keras` is TensorFlow's implementation of the Keras API specification. To work with Keras, you need to import `tf.keras` as part of your TensorFlow program setup.
```
import tensorflow as tf
from tensorflow.keras import layers
```
#### Build a model
In Keras, you assemble **layers** to build a model, i.e., a graph of layers. The most common type of model is a stack of layers: the `tf.keras.Sequential` model. For example, the following code builds a simple, fully-connected network (i.e., multi-layer perceptron):
```
model = tf.keras.Sequential()
# adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense(64, activation='relu'))
# add another
model.add(layers.Dense(64, activation='relu'))
# add a softmax layer with 10 output units:
model.add(layers.Dense(10, activation='softmax'))
```
There are many `tf.keras.layers` available with some common constructor parameters:
* `activation`: set the activation function for the layer, which is specified by the name of a built-in function or as a callable object.
* `kernel_initializer` and `bias_initializer`: the initialization schemes that create the layer's weights (weight and bias).
* `kernel_regularizer` and `bias_regularizer`: the regularization schemes that apply the layer's weights (weight and bias), such as L1 or L2 regularization.

#### Train and evaluate
After you construct a model, you can configure its learning process by calling the `compile` method:
```
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
```
The method `tf.keras.Model.compile` takes three important arguments:
* `optimizer`: it specifies the training procedure, e.g., `tf.train.AdamOptimizer` and `tf.train.GradientDescentOptimizer`.
* `loss`: the cost function to minimize during optimization, e.g., mean square error (mse), categorical_crossentropy, and binary_crossentropy.
* `metrics`: used to monitor training, e.g., `accuracy`.

The next step after confiuring the model is to train it by calling the `model.fit` method and giving it training data as its input. After training the model you can call `tf.keras.Model.evaluate` and `tf.keras.Model.predict` methods to evaluate the inference-mode loss and metrics for the data provided or predict the output of the last layer in inference for the data provided, respectively.

You can read more about Keras [here](https://www.tensorflow.org/guide/keras).


In this task, please use Keras to rebuild the network you made in section 7.

In [10]:
## Keras implementation

from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical


# to reset the Tensorflow default graph
reset_graph()

model = Sequential()
model.add(Conv2D(6, kernel_size=[6, 6], activation='relu', input_shape=[28, 28, 1], strides = 2))
model.add(Conv2D(12, kernel_size=[5, 5], activation='relu', strides = 2))
model.add(Conv2D(24, kernel_size=[4, 4], activation='relu', strides = 2))
model.add(Dense(200, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(10, activation='softmax'))

model.compile(loss=categorical_crossentropy, optimizer=Adam(),
              metrics=['accuracy'])

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# convert class vectors to binary class matrices
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)


x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
y_train = y_train.reshape(y_train.shape[0], 1, 1, 10)
y_test = y_test.reshape(y_test.shape[0], 1, 1, 10)

model.fit(x_train, y_train, 
          batch_size=100, 
          epochs=10, 
          verbose=1, 
          validation_data=(x_test, y_test))
c, a = model.evaluate(x_test, y_test, verbose=0)
print('test data: accurecy = {}, loss = {}'.format(a, c))  

Train on 60000 samples, validate on 10000 samples
Epoch 1/10


InvalidArgumentError: Tensor dropout/keras_learning_phase:0, specified in either feed_devices or fetch_devices was not found in the Graph

---
# 10. Implement LeNet-5
In this section, you should implement **LeNet-5** either using Tensorflow or Keras. Please take a look at its [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) before starting to implement it.
The LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in the following table.
<img src="figs/19-letnet5.png" style="width: 600px;"/>
There are a few extra details to be noted:
* MNIST images are 28×28 pixels, but they are zero-padded to 32×32 pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
* The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient and adds a learnable bias term, then finally applies the activation function.
* Most neurons in layer C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps). See table 1 in the [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) for details.
* The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross-entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

In [5]:
# TODO: Build the LetNet-5 model, and test it on MNIST
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, AveragePooling2D, Activation, Flatten
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import to_categorical

# to reset the Tensorflow default graph
reset_graph()

## Data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape)
# convert class vectors to binary class matrices
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

x_train = np.pad(x_train, pad_width = ([0,0], [2,2], [2,2], [0,0]), mode = 'constant', constant_values = ([0,0], [0,0], [0,0], [0,0])).astype('float32')
x_test = np.pad(x_test, pad_width = ([0,0], [2,2], [2,2], [0,0]), mode = 'constant', constant_values = ([0,0], [0,0], [0,0], [0,0])).astype('float32')

# reshape to 4 dimensions 
y_train = y_train.reshape(y_train.shape[0], 1, 1, 10)
y_test = y_test.reshape(y_test.shape[0], 1, 1, 10)

## Model 

model = Sequential()
model.add(Conv2D(6, kernel_size=[5, 5], activation='tanh', input_shape=[32, 32, 1], strides = 1)) # , data_format = 'channels_last'
model.add(AveragePooling2D(2, strides=2))
model.add(Activation('tanh'))
model.add(Conv2D(16, kernel_size=[5, 5], activation='tanh', strides = 1))
model.add(AveragePooling2D(pool_size=[2, 2], strides=2))
model.add(Activation('tanh'))
model.add(Conv2D(120, kernel_size=[5, 5], activation='tanh', strides = 1))
model.add(Dense(84, activation='tanh'))
model.add(Dense(10, activation='softmax'))

model.compile(loss=categorical_crossentropy, optimizer= SGD(0.01),# Adam(lr = 0.005, decay = 0.5),
              metrics=['accuracy'])

model.fit(x_train, y_train, 
          batch_size=100, 
          epochs=10, 
          verbose=1, 
          validation_data=(x_test, y_test))
c, a = model.evaluate(x_test, y_test, verbose=0)
print('test data: accurecy = {}, loss = {}'.format(a, c))  


(60000, 28, 28)
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
 1200/60000 [..............................] - ETA: 1:31 - loss: 487.7701 - acc: 0.1925

KeyboardInterrupt: 

---
# 11. Implement AlexNet
In the last section, you should implement **AlexNet** either using Tensorflow or Keras. Again, please take a look at its [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) before start to implement it.
The AlexNet CNN architecture won the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2012/) in 2012 by a large margin. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The following table presents this architecture.
<img src="figs/20-alexnet.png" style="width: 600px;"/>
To train the model, we need a big dataset, however, in this assignment you are going to to assign the pretrained weights to your model, using `tf.Variable.assign`. You can download the pretrained weights from [bvlc_alexnet.npy](https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy). This file is a NumPy array file created by the python. After you read this file, you will receive a python dictionary with a <key, value> pair for each layer. Each key is one of the layers names, e.g., `conv1`, and each value is a list of two values: (1) weights, and (2) biases of that layer. Part of the function to load the weights and biases to your model is given, and you need to complete it.

Here is what you see if you read and print the shape of each layer from the file:
```
weight_dic = np.load("bvlc_alexnet.npy", encoding="bytes").item()
for layer in weights_dic:
    print("-" * 20)
    print(layer)
    for wb in weights_dic[layer]:
        print(wb.shape)

#--------------------
# fc8
# (4096, 1000) # weights
# (1000,) # bias
#--------------------
# fc7
# (4096, 4096) # weights
# (4096,) # bias
#--------------------
# fc6
# (9216, 4096) # weights
# (4096,) # bias
#--------------------
# conv5
# (3, 3, 192, 256) # weights
# (256,) # bias
#--------------------
# conv4
# (3, 3, 192, 384) # weights
# (384,) # bias
#--------------------
# conv3
# (3, 3, 256, 384) # weights
# (384,) # bias
#--------------------
# conv2
# (5, 5, 48, 256) # weights
# (256,) # bias
#--------------------
# conv1
# (11, 11, 3, 96) # weights
# (96,) # bias
```


In [16]:
def maxPoolLayer(x, kHeight, kWidth, strideX, strideY, name, padding = "SAME"):
    return tf.nn.max_pool(x, ksize = [1, kHeight, kWidth, 1],
                          strides = [1, strideX, strideY, 1], padding = padding, name = name)
 
def dropout(x, keepPro, name = None):
    return tf.nn.dropout(x, keepPro, name)
 
def LRN(x, R, alpha, beta, name = None, bias = 1.0):
    return tf.nn.local_response_normalization(x, depth_radius = R, alpha = alpha,
                                              beta = beta, bias = bias, name = name)
 
def fcLayer(x, inputD, outputD, reluFlag, name):
    """fully-connect"""
    with tf.variable_scope(name) as scope:
        w = tf.get_variable("w", shape = [inputD, outputD], dtype = "float")
        b = tf.get_variable("b", [outputD], dtype = "float")
        out = tf.nn.xw_plus_b(x, w, b, name = scope.name)
        if reluFlag:
            return tf.nn.relu(out)
        else:
            return out
 
def convLayer(x, kHeight, kWidth, strideX, strideY,
              featureNum, name, padding = "SAME", groups = 1): #group=2 means the second part of AlexNet
    """convlutional"""
    channel = int(x.get_shape()[-1]) #get channel
    conv = lambda a, b: tf.nn.conv2d(a, b, strides = [1, strideY, strideX, 1], padding = padding)
    with tf.variable_scope(name) as scope:
        w = tf.get_variable("w", shape = [kHeight, kWidth, channel/groups, featureNum])
        b = tf.get_variable("b", shape = [featureNum])
 
        xNew = tf.split(value = x, num_or_size_splits = groups, axis = 3)#input and weights after split
        wNew = tf.split(value = w, num_or_size_splits = groups, axis = 3)
 
        featureMap = [conv(t1, t2) for t1, t2 in zip(xNew, wNew)] #retriving the feature map separately
        mergeFeatureMap = tf.concat(axis = 3, values = featureMap) #concatnating feature map 
        # print mergeFeatureMap.shape
        out = tf.nn.bias_add(mergeFeatureMap, b)
        return tf.nn.relu(tf.reshape(out, mergeFeatureMap.get_shape().as_list()), name = scope.name)

In [17]:
class alexNet(object):
    """alexNet model"""
    def __init__(self, x, keepPro, classNum, skip, modelPath = "bvlc_alexnet.npy"):
        self.X = x
        self.KEEPPRO = keepPro
        self.CLASSNUM = classNum
        self.SKIP = skip
        self.MODELPATH = modelPath
        #build CNN
        self.buildCNN()
 
    def buildCNN(self):
        """build model"""
        conv1 = convLayer(self.X, 11, 11, 4, 4, 96, "conv1", "VALID")
        lrn1 = LRN(conv1, 2, 2e-05, 0.75, "norm1")
        pool1 = maxPoolLayer(lrn1, 3, 3, 2, 2, "pool1", "VALID")
 
        conv2 = convLayer(pool1, 5, 5, 1, 1, 256, "conv2", groups = 2)
        lrn2 = LRN(conv2, 2, 2e-05, 0.75, "lrn2")
        pool2 = maxPoolLayer(lrn2, 3, 3, 2, 2, "pool2", "VALID")
 
        conv3 = convLayer(pool2, 3, 3, 1, 1, 384, "conv3")
 
        conv4 = convLayer(conv3, 3, 3, 1, 1, 384, "conv4", groups = 2)
 
        conv5 = convLayer(conv4, 3, 3, 1, 1, 256, "conv5", groups = 2)
        pool5 = maxPoolLayer(conv5, 3, 3, 2, 2, "pool5", "VALID")
 
        fcIn = tf.reshape(pool5, [-1, 256 * 6 * 6])
        fc1 = fcLayer(fcIn, 256 * 6 * 6, 4096, True, "fc6")
        dropout1 = dropout(fc1, self.KEEPPRO)
 
        fc2 = fcLayer(dropout1, 4096, 4096, True, "fc7")
        dropout2 = dropout(fc2, self.KEEPPRO)
 
        self.fc3 = fcLayer(dropout2, 4096, self.CLASSNUM, True, "fc8")
 
    def loadModel(self, sess):
        """load model"""
        wDict = np.load(self.MODELPATH, encoding = "bytes").item()
        #for layers in model
        for name in wDict:
            if name not in self.SKIP:
                with tf.variable_scope(name, reuse = True):
                    for p in wDict[name]:
                        if len(p.shape) == 1:
                            #bias
                            sess.run(tf.get_variable('b', trainable = False).assign(p))
                        else:
                            #weights
                            sess.run(tf.get_variable('w', trainable = False).assign(p))

In [25]:
reset_graph()

import os
import urllib.request
import argparse
import sys
import cv2
import tensorflow as tf
import numpy as np
import caffe_classes
import glob

dropoutPro = 1
classNum = 1000
skip = []
#get testImage
testPath = "test_images"
testImg = []

def listdir_nohidden(path):
    return glob.glob(os.path.join(path, '*')) # so there is no problem with hidden files

for f in listdir_nohidden(testPath):
    #print(f)
    testImg.append(cv2.imread(f))
 
imgMean = np.array([104, 117, 124], np.float)
x = tf.placeholder("float", [1, 227, 227, 3])
 
model = alexNet(x, dropoutPro, classNum, skip)
score = model.fc3
softmax = tf.nn.softmax(score)
 
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    model.loadModel(sess) #Load the model
 
    for i, img in enumerate(testImg):
        #img preprocess
        test = cv2.resize(img.astype(float), (227, 227)) #resize
        test -= imgMean #subtract image mean
        test = test.reshape((1, 227, 227, 3)) #reshape into tensor shape
        maxx = np.argmax(sess.run(softmax, feed_dict = {x: test}))
        res = caffe_classes.class_names[maxx] #find the max probility
        #print(res)
        font = cv2.FONT_HERSHEY_SIMPLEX
        cv2.putText(img, res, (int(img.shape[0]/3), int(img.shape[1]/3)), font, 1, (0, 0, 255), 2) #putting on the labels
        cv2.imshow("demo", img) 
        cv2.waitKey(5000)

In [7]:
## New version with Keras

# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

# build the AlexNet model

model = Sequential()
model.add(Conv2D(96, name = 'conv1', kernel_size=[11, 11], activation='relu', padding = 'same', input_shape=[224, 224, 3], strides = 4)) # , data_format = 'channels_last'
model.add(MaxPooling2D(strides=3, padding = 'valid'))
model.add(Conv2D(256, name = 'conv2', kernel_size=[5, 5], padding = 'same', activation='relu', strides = 1))
model.add(MaxPooling2D(pool_size=[3, 3], strides=2, padding = 'valid'))
model.add(Conv2D(384, name = 'conv3', kernel_size=[3, 3], padding = 'same', activation='relu', strides = 1))
model.add(Conv2D(384, name = 'conv4', kernel_size=[3, 3], padding = 'same', activation='relu', strides = 1))
model.add(Conv2D(256, name = 'conv5', kernel_size=[3, 3], padding = 'same', activation='relu', strides = 1))
model.add(Dense(4096, name = 'fc6', activation='relu'))
model.add(Dense(4096, name = 'fc7', activation='relu'))
model.add(Dense(1000, name = 'fc8', activation='softmax'))

# load inital weights and biases to the model
def load_initial_weights(self, session):
    # load the weights into memory
    weights_dict = np.load('bvlc_alexnet.npy', encoding='bytes').item()
    
    all_weights = [weights_dict['conv1'][0], weights_dict['conv1'][1],
    weights_dict['conv2'][0], weights_dict['conv2'][1],
    weights_dict['conv3'][0], weights_dict['conv3'][1],
    weights_dict['conv4'][0], weights_dict['conv4'][1],
    weights_dict['conv5'][0], weights_dict['conv5'][1],
    weights_dict['fc6'][0], weights_dict['fc6'][1],
    weights_dict['fc7'][0], weights_dict['fc7'][1],
    weights_dict['fc8'][0], weights_dict['fc8'][1]]
    
    for w in all_weights:
        print(w.shape)
        
    
    model.set_weights(all_weights)
    # loop over all layer names stored in the weights dict
    #for layer_name in weights_dict:
    #    with tf.variable_scope(layer_name, reuse=True):
    #        # loop over list of weights/biases and assign them to their corresponding tf variable
    #        for wb in weights_dict[layer_name]:
    #            
    #            #layer_weights, layer_biases = wb[0], wb[1]
    #            #layer = model.get_layer(layer_name)
    #            #layer.set_weights([layer_weights, layer_biases])
    #            #print("Done layer ", layer_name)
    #            
    #            ## biases
    #            #if len(wb.shape) == 1:
    #            #    bias = tf.get_variable(<FILL IN>)
    #            #    session.run(bias.assign(wb))
    #            ## weights
    #            #else:
    #            #    weight = tf.get_variable(<FILL IN>)
    #            #    session.run(weight.assign(wb))
##
load_initial_weights(None, None)



(11, 11, 3, 96)
(96,)
(5, 5, 48, 256)
(256,)
(3, 3, 256, 384)
(384,)
(3, 3, 192, 384)
(384,)
(3, 3, 192, 256)
(256,)
(9216, 4096)
(4096,)
(4096, 4096)
(4096,)
(4096, 1000)
(1000,)


ValueError: Shapes (5, 5, 96, 256) and (5, 5, 48, 256) are incompatible

In [3]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
#reset_graph()

# build the AlexNet model
#<FILL IN> :)

# load inital weights and biases to the model
#def load_initial_weights(self, session):
    # load the weights into memory
#    weights_dic = np.load('bvlc_alexnet.npy', encoding='bytes').item()

    # loop over all layer names stored in the weights dict
#    for layer in weights_dict:
#        with tf.variable_scope(layer, reuse=True):
#            # loop over list of weights/biases and assign them to their corresponding tf variable
#            for wb in weights_dict[layer]:
#                # biases
#                if len(wb.shape) == 1:
#                    bias = tf.get_variable(<FILL IN>)
#                    session.run(bias.assign(wb))
#                # weights
#                else:
#                    weight = tf.get_variable(<FILL IN>)
#                    session.run(weight.assign(wb))
 
    
reset_graph()

X = tf.placeholder(tf.float32, [None, 224, 224, 3])
y_true = tf.placeholder(tf.float32, [None, 10])

#
# C1 
#
W1 = tf.Variable(name = "w_conv1", dtype=tf.float32, 
                    initial_value=tf.random_uniform([11,11,3,96], minval=0, maxval=1), trainable = True)

B1 = tf.Variable(name = "b_conv1", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [96]), trainable = True) # 1 per filter

C1 = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, 4, 4, 1], padding="SAME") + B1)

#
# S2 
#
S2 = tf.nn.max_pool(C1, ksize = [1, 3, 3, 1], strides=[1, 2, 2, 1], padding="VALID")


#
# C3 
#
W3 = tf.Variable(name = "w_conv2", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,96,256], minval=0, maxval=1), trainable = True)

B3 = tf.Variable(name = "b_conv2", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [256]), trainable = True) # 1 per filter

C3 = tf.nn.relu(tf.nn.conv2d(S2, W3, strides=[1, 1, 1, 1], padding="SAME") + B3)

#
# S4 
#
S4 = tf.nn.max_pool(C3, ksize = [1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")

#
# C5
#
W5 = tf.Variable(name = "w_conv3", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,256,384], minval=0, maxval=1), trainable = True)

B5 = tf.Variable(name = "b_conv3", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [384]), trainable = True) # 1 per filter

C5 = tf.nn.relu(tf.nn.conv2d(S4, W5, strides=[1, 1, 1, 1], padding="SAME") + B5)

#
# C6
#
W6 = tf.Variable(name = "w_conv4", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,384,384], minval=0, maxval=1), trainable = True)

B6 = tf.Variable(name = "b_conv4", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [384]), trainable = True) # 1 per filter

C6 = tf.nn.relu(tf.nn.conv2d(C5, W6, strides=[1, 1, 1, 1], padding="SAME") + B6)

#
# C7
#
W7 = tf.Variable(name = "w_conv5", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,384,256], minval=0, maxval=1), trainable = True)

B7 = tf.Variable(name = "b_conv5", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [256]), trainable = True) # 1 per filter

C7 = tf.nn.relu(tf.nn.conv2d(C6, W7, strides=[1, 1, 1, 1], padding="SAME") + B7)



# Flat C7 for dense layers
C7_flat = tf.reshape(C7, [tf.shape(X)[0], 13 * 13 * 256])

#
# F8
#
W8 = tf.Variable(name = "w_fc6", dtype=tf.float32, 
                    initial_value=tf.random_uniform([256 * 13 * 13, 4096], minval=0, maxval=1), trainable = True)

B8 = tf.Variable(name = "b_fc6", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [4096]), trainable = True) # 1 per filter

F8 = tf.nn.relu(tf.matmul(C7_flat, W8) + B8)

#
# F9
#
W9 = tf.Variable(name = "w_fc7", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4096, 4096], minval=0, maxval=1), trainable = True)

B9 = tf.Variable(name = "b_fc7", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [4096]), trainable = True) # 1 per filter

F9 = tf.nn.relu(tf.matmul(F8, W9) + B9)
#
# O10.- Output/Logits layer
#
W10 = tf.Variable(name = "w_fc9", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4096, 1000], minval=0, maxval=1), trainable = True)

B10 = tf.Variable(name = "b_fc9", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [1000]), trainable = True) # 1 per filter


## STILL GETTING DIMENSION ERROR HERE 
O10 = tf.matmul(F9, W10) + B10

y_hat = tf.nn.softmax(O10)

# define the cost and accuracy functions
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=y_hat, labels=y_true)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# define the optimizer
lr = 0.003
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)

## load inital weights and biases to the model
def load_initial_weights(session):
#    # load the weights into memory
    weights_dict = np.load('bvlc_alexnet.npy', encoding='bytes').item()
    for name in weights_dict:
        with tf.variable_scope(name, reuse = True):
            for p in weights_dict[name]:
                if len(p.shape) == 1:
                    #bias
                    sess.run(tf.get_variable('b', trainable = False).assign(p))
                else:
                    #weights
                    sess.run(tf.get_variable('w', trainable = False).assign(p))
    # loop over all layer names stored in the weights dict
#    for layer in weights_dict:
 #       with tf.variable_scope(layer, reuse=True):
  #          # loop over list of weights/biases and assign them to their corresponding tf variable
   #         for wb in weights_dict[layer]:
    #            # biases
     #           if len(wb.shape) == 1:
      #             print('b_' + layer)
       #             #bias = tf.get_variable(layer, shape = wb.shape, trainable = False)
       #             #session.run(bias.assign(wb))
       #             session.run(tf.get_variable('b', trainable = False).assign(wb))
       #             
        #        # weights
        #        else:
         #           print('w_' + layer)
          #          #weight = tf.get_variable(layer, shape = wb.shape, trainable = False)
           #         #session.run(weight.assign(wb))
            #        session.run(tf.get_variable('w', trainable = False).assign(wb))



Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



#### Test the model
After building the AlexNet model, you can test it on different images and present the accuracy of the model. To do so, first you need to use **OpenCV** library to make the images ready to give as input to the model. OpenCV is a library used for image processing. Below you can see how to read an image file and pre-process it using OpenCV to give it to the model. However, you need to complete the code and test the accuracy of your model. The teset images (shown below) are available in the `test_images` folder.
<table width="100%">
<tr>
<td><img src="test_images/test_image1.jpg" style="width:200px;"></td>
<td><p align="center"><img src="test_images/test_image2.jpg" style="width:200px;"></td>
<td align="right"><img src="test_images/test_image3.jpg" style="width:200px;"></td>
</tr>

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# test the AlexNet model on the given images

import cv2

#get list of all images
current_dir = os.getcwd()
image_path = os.path.join(current_dir, 'test_images')
img_files = [os.path.join(image_path, f) for f in os.listdir(image_path) if f.endswith('.jpg')]

#load all images
imgs = []
for f in img_files:
    imgs.append(cv2.imread(f))

with tf.Session() as sess:
    <FILL IN>
    
    # loop over all images
    for i, image in enumerate(imgs):
        # convert image to float32 and resize to (227x227)
        img = cv2.resize(image.astype(np.float32), (227, 227))
        
        # subtract the ImageNet mean
        # Mean subtraction per channel was used to center the data around zero mean for each channel (R, G, B).
        # This typically helps the network to learn faster since gradients act uniformly for each channel.
        imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)
        img -= imagenet_mean
        
        # reshape as needed to feed into model
        img = img.reshape((1, 227, 227, 3))
        
        <FILL IN>

In [14]:
## Old version from scratch 

# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

# build the AlexNet model

# placeholders
X = tf.placeholder(tf.float32, [None, 224, 224, 3])
y_true = tf.placeholder(tf.float32, [None, 10])

#
# C1 
#
W1 = tf.Variable(name = "w_conv1", dtype=tf.float32, 
                    initial_value=tf.random_uniform([11,11,3,96], minval=0, maxval=1), trainable = True)

B1 = tf.Variable(name = "b_conv1", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [96]), trainable = True) # 1 per filter

C1 = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, 4, 4, 1], padding="SAME") + B1)

#
# S2 
#
S2 = tf.nn.max_pool(C1, ksize = [1, 3, 3, 1], strides=[1, 2, 2, 1], padding="VALID")


#
# C3 
#
W3 = tf.Variable(name = "w_conv2", dtype=tf.float32, 
                    initial_value=tf.random_uniform([5,5,96,256], minval=0, maxval=1), trainable = True)

B3 = tf.Variable(name = "b_conv2", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [256]), trainable = True) # 1 per filter

C3 = tf.nn.relu(tf.nn.conv2d(S2, W3, strides=[1, 1, 1, 1], padding="SAME") + B3)

#
# S4 
#
S4 = tf.nn.max_pool(C3, ksize = [1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")

#
# C5
#
W5 = tf.Variable(name = "w_conv3", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,256,384], minval=0, maxval=1), trainable = True)

B5 = tf.Variable(name = "b_conv3", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [384]), trainable = True) # 1 per filter

C5 = tf.nn.relu(tf.nn.conv2d(S4, W5, strides=[1, 1, 1, 1], padding="SAME") + B5)

#
# C6
#
W6 = tf.Variable(name = "w_conv4", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,384,384], minval=0, maxval=1), trainable = True)

B6 = tf.Variable(name = "b_conv4", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [384]), trainable = True) # 1 per filter

C6 = tf.nn.relu(tf.nn.conv2d(C5, W6, strides=[1, 1, 1, 1], padding="SAME") + B6)

#
# C7
#
W7 = tf.Variable(name = "w_conv5", dtype=tf.float32, 
                    initial_value=tf.random_uniform([3,3,384,256], minval=0, maxval=1), trainable = True)

B7 = tf.Variable(name = "b_conv5", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [256]), trainable = True) # 1 per filter

C7 = tf.nn.relu(tf.nn.conv2d(C6, W7, strides=[1, 1, 1, 1], padding="SAME") + B7)



# Flat C7 for dense layers
C7_flat = tf.reshape(C7, [tf.shape(X)[0], 13 * 13 * 256])

#
# F8
#
W8 = tf.Variable(name = "w_fc6", dtype=tf.float32, 
                    initial_value=tf.random_uniform([256 * 13 * 13, 4096], minval=0, maxval=1), trainable = True)

B8 = tf.Variable(name = "b_fc6", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [4096]), trainable = True) # 1 per filter

F8 = tf.nn.relu(tf.matmul(C7_flat, W8) + B8)

#
# F9
#
W9 = tf.Variable(name = "w_fc7", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4096, 4096], minval=0, maxval=1), trainable = True)

B9 = tf.Variable(name = "b_fc7", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [4096]), trainable = True) # 1 per filter

F9 = tf.nn.relu(tf.matmul(F8, W9) + B9)
#
# O10.- Output/Logits layer
#
W10 = tf.Variable(name = "w_fc9", dtype=tf.float32, 
                    initial_value=tf.random_uniform([4096, 1000], minval=0, maxval=1), trainable = True)

B10 = tf.Variable(name = "b_fc9", dtype=tf.float32, initial_value=tf.constant(0.1, tf.float32, [1000]), trainable = True) # 1 per filter


## STILL GETTING DIMENSION ERROR HERE 
O10 = tf.matmul(F9, W10) + B10

y_hat = tf.nn.softmax(O10)

# define the cost and accuracy functions
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=y_hat, labels=y_true)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# define the optimizer
lr = 0.003
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)

## load inital weights and biases to the model
def load_initial_weights(session):
#    # load the weights into memory
    weights_dict = np.load('bvlc_alexnet.npy', encoding='bytes').item()
    for name in weights_dict:
        with tf.variable_scope(name, reuse = True):
            for p in weights_dict[name]:
                if len(p.shape) == 1:
                    #bias
                    sess.run(tf.get_variable('b', trainable = False).assign(p))
                else:
                    #weights
                    sess.run(tf.get_variable('w', trainable = False).assign(p))
    # loop over all layer names stored in the weights dict
#    for layer in weights_dict:
 #       with tf.variable_scope(layer, reuse=True):
  #          # loop over list of weights/biases and assign them to their corresponding tf variable
   #         for wb in weights_dict[layer]:
    #            # biases
     #           if len(wb.shape) == 1:
      #             print('b_' + layer)
       #             #bias = tf.get_variable(layer, shape = wb.shape, trainable = False)
       #             #session.run(bias.assign(wb))
       #             session.run(tf.get_variable('b', trainable = False).assign(wb))
       #             
        #        # weights
        #        else:
         #           print('w_' + layer)
          #          #weight = tf.get_variable(layer, shape = wb.shape, trainable = False)
           #         #session.run(weight.assign(wb))
            #        session.run(tf.get_variable('w', trainable = False).assign(wb))



#### Test the model
After building the AlexNet model, you can test it on different images and present the accuracy of the model. To do so, first you need to use **OpenCV** library to make the images ready to give as input to the model. OpenCV is a library used for image processing. Below you can see how to read an image file and pre-process it using OpenCV to give it to the model. However, you need to complete the code and test the accuracy of your model. The teset images (shown below) are available in the `test_images` folder.
<table width="100%">
<tr>
<td><img src="test_images/test_image1.jpg" style="width:200px;"></td>
<td><p align="center"><img src="test_images/test_image2.jpg" style="width:200px;"></td>
<td align="right"><img src="test_images/test_image3.jpg" style="width:200px;"></td>
</tr>

In [27]:
# TODO: Replace <FILL IN> with appropriate code
# test the AlexNet model on the given images

import cv2
import os

#get list of all images
current_dir = os.getcwd()
image_path = os.path.join(current_dir, 'test_images')
img_files = [os.path.join(image_path, f) for f in os.listdir(image_path) if f.endswith('.jpg')]

#load all images
imgs = []
for f in img_files:
    imgs.append(cv2.imread(f))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    load_initial_weights(sess)
    
    # loop over all images
    for i, image in enumerate(imgs):
        # convert image to float32 and resize to (227x227)
        img = cv2.resize(image.astype(np.float32), (227, 227))
        
        # subtract the ImageNet mean
        # Mean subtraction per channel was used to center the data around zero mean for each channel (R, G, B).
        # This typically helps the network to learn faster since gradients act uniformly for each channel.
        imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)
        img -= imagenet_mean
        
        # reshape as needed to feed into model
        img = img.reshape((1, 227, 227, 3))
        
        maxx = np.argmax(sess.run(softmax, feed_dict = {x: test}))
        res = caffe_classes.class_names[maxx] #find the max probility
        #print(res)
        font = cv2.FONT_HERSHEY_SIMPLEX
        cv2.putText(img, res, (int(img.shape[0]/3), int(img.shape[1]/3)), font, 1, (0, 0, 255), 2) #putting on the labels
        cv2.imshow("demo", img) 
        cv2.waitKey(5000)

ValueError: Variable fc6/w does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?