Recently, machine learning has achieved remarkable results in systems that classify objects in images. The biggest recent advancement came from the AlexNet architecure (Krizhevsky, Sutskever, & Hinton, 2012), a huge convolutional neural network that is initially designed by LeCun at 1989, won 2012 ImageNet competition by making %40 less errors than the next best competitor. In the next years, almost all participants implemented AlexNet inspired deep neural networks and currently the models we have are quite comparable to human-level performance (Russakovsky et al., 2014).


I will try to implement AlexNet using TensorFlow framework. Although, the original model was running on 1.2M images, I will use MNIST data for simplicity.

## ARCHITECTURE

![AlexNet](http://sinb.github.io/images/imagenet-cnn/cnn3.png)




| Layer Name | Layer Size | # of Neurons | # of Parameters    | # of Parameters |
|------------|------------|--------------|--------------------|-----------------|
| INPUT      | 224x224x3  |              |                    |                 |
| CONV1      | 55x55x96   | 290,400      | 11 x 11 x 3 x 96   | 34,848          |
| CONV2      | 27x27x256  | 186,624      | 5 x 5 x 96 x 256   | 614,400         |
| CONV3      | 13x13x384  | 64,896       | 3 x 3 x 256 x 384  | 884,736         |
| CONV4      | 13x13x384  | 64,896       | 3 x 3 x 384 x 384  | 1,327,104       |
| CONV5      | 13x13x256  | 43,264       | 3 x 3 x 384 x 256  | 884,736         |
| FC         | 4,096      | 4,096        | 6 x 6 x 256 x 4096 | 37,748,736      |
| FC         | 4,096      | 4,096        | 4096 x 4096        | 16,777,216      |
| SOFTMAX    | 1,000      | 1,000        | 4096 x 1000        | 4,096,000       |
|            |            |              |                    |                 |
| Total      |            | 659,272      |                    | 62,367,776      |

Table : https://docs.google.com/spreadsheets/d/1uq4lmpws9pPRWFyIAfOr5-6I-xtFWwPsuf82e9nHMqM/


* So the ALEXNET architecture has 8 layers in total, 5 of them being convolutional and 3 of them fully connected. The output of the last fully connected layer is fed to 1000-way softmax which produces class probabilities.
* The second, third and last convolutional layer provide pooling as well.
* The patch sizes are 11x11, 5x5, 3x3, 3x3, 3x3.
* Another thing to recognize is input images are 224 by 224 instead of the original images which are 256 by 256. One problem with deep neural networks is it overfits a lot. Therefore they trained it on 224x224 patches extracted randomly from 256x256 images, and also their horizontal reflections.
* One last thing is that we have 650K neurons and 60M parameters in total. That is huge. That's why using GPU's was so important back then and still now.



## Download Data
So we defined our architecture already. Let's start with downloading our data.

In [308]:
import tensorflow as tf

# Import MINST data
import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


## Weight Initialization
To create this model, we're going to need to create weights and biases. One should generally initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients. Since we're using ReLU neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons." 

In Alexnet paper the input images were 256x256x3. However since we are using MNIST dataset, our images are 28x28x1. So we are going to change the weights from the original paper.

Original paper weights:

AlexNet Parameters
CONV1 11 x 11 x 3 x 96 (Stride 4) (MAX POOL)  
CONV2 5 x 5 x 96 x 256 (MAX POOL)  
CONV3 3 x 3 x 256 x 384  
CONV4 3 x 3 x 384 x 384  
CONV5 3 x 3 x 384 x 256 (MAX POOL)  
FC    6 x 6 x 256 x 4096  
FC    4096 x 4096  
SOFTMAX 4096 x 1000  
  
Our Parameters  
CONV1 3 x 3 x 1 x 64 (MAX POOL)  
CONV2 3 x 3 x 32 x 128 (MAX POOL)  
CONV3 3 x 3 x 64 x 256 (MAX POOL)  
FC L*L*256 * 1024  
FC 1024 * 1024  
SOFTMAX 1024 * 10  
  



In [309]:
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([3, 3, 1, 64])),
    'wc2': tf.Variable(tf.random_normal([3, 3, 64, 128])),
    'wc3': tf.Variable(tf.random_normal([3, 3, 128, 256])),
    'wd1': tf.Variable(tf.random_normal([4*4*256, 1024])),
    'wd2': tf.Variable(tf.random_normal([1024, 1024])),
    'out': tf.Variable(tf.random_normal([1024, 10]))
}
biases = {
    'bc1': tf.Variable(tf.random_normal([64])),
    'bc2': tf.Variable(tf.random_normal([128])),
    'bc3': tf.Variable(tf.random_normal([256])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'bd2': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([10]))
}

In [310]:
# Parameters
learning_rate = 0.001
training_iters = 300000
batch_size = 64

# tf Graph input
x = tf.placeholder(tf.types.float32, [None, 784])
y = tf.placeholder(tf.types.float32, [None, 10])


### 1. Use of ReLU Nonlinearity
One of the most important, if not the most important, novelties in this paper was the use of ReLUs. The standard way of adding nonlinearity to the network was using tanh or sigmoids. However, due to the saturating non-linearity of these functions, it takes a lot of time to train them. The non-linearity that comes with the ReLUs were good enough, but most importantly they were much faster than tanh or sigmoid. This enabled to run such a large (at that time) neural network. 

The convolutional layer uses 1 as the stride size, creates an image with the same size using 'SAME' padding. I am going to define this as a function so that I can use it whenever I want.

In [311]:
def conv2d(x, w, b):
    return tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w, strides=[1, 1, 1, 1], padding='SAME'),b))

### 2. Overlapping Pooling
It is common to periodically insert a pooling layer right after convolution layer. This pooling layer downsamples and resizes the image using the MAX operation. The benefits : less spatial size, less computation, less parameters, control of overfitting. 

Pooling layers in convolutional neural networks traditionally don't overlap. If the downsizing happens 2x2 -> 1x1 then the stride is 2 and image is downsamples by 4. So, a 256x256 image becomes 128x128. However, in this paper the patches were 3x3 and the stride was 2, which creates an overlapping pooling. 

This scheme reduced the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme 2x2 with stride 2, which produces output of equivalent dimensions. They also observed that models with overlapping pooling are slightly more difficult to overfit.

In [312]:
def max_pool(x, k, s):
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, s, s, 1], padding='SAME')

### 3. Local Response Normalization
Local Response Normalization is another novel part of this paper, which improved top-1 and top-5 error rates by 1.4% and 1.2%, respectively, according to the paper. I am not going to the details of the formula, however we may think of this as a way of normalizing over several adjacent kernels (these kernels are randomly adjacent) to implement lateral inhibition. I am going to use exactly the same parameters that are in the paper.



In [313]:
def norm(x):
    return tf.nn.lrn(x, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)


### 4. Training on Multiple GPUs
Another trick in the paper was the use of multiple GPUs. They put half of the kernels on each GPU and GPUs communicated only in certain layers. 

This scheme reduced  top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net2.


### 5. Reducing Overfitting and Dropouts
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. AlexNet has 60M parameters as explained above and could suffer from the same problem. The paper explains 2 primary ways of combatting this problem.

#### Data Augmentation
The first one is data augmentation which is image translations and horizontal reflections. We are not going to use these in this example. The second form of data augmentation is altering the intensities of the RGB channels in training images. Please refer to paper for extra details.

#### Dropout
Combining the predictions of many different models is usually a very successful way to reduce test errors, but since deep learning networks already take a lot of days to train, this method is unfeasible. 

"Dropout" method sets the output of each hidden neuron to zero with a probability of 0.5. The dropped out neurons dont participate in forward or backward passes. So every time a new input comes, the network develops a different architecture. These different architectures have the same weights, though. This technique prevents units from co-adapting (since a neuron cannot rely on the presence of particular other neurons) too much. Therefore, the neurons are forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. 

At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.
We use dropout in the first two fully-connected layers, as well. Without dropout, these network can show extreme cases of overfitting. Dropout roughly doubles the number of iterations required to converge.

In [314]:
keep_prob = tf.placeholder(tf.types.float32) # dropout (keep probability)
dropout_rate = 0.8 # although the original paper had 0.5 dropout rate, we are gonna use a higher ratio because our system and dataset is much smaller

def dropout(x):
    return tf.nn.dropout(x, keep_prob)

## Defining ALEXNET

In [315]:
def alex_net(_X, _weights, _biases, _dropout):
    # Reshape input picture to 28x28 in order to convolve
    _X = tf.reshape(_X, shape=[-1, 28, 28, 1])  

    
    # First Convolution Layer, followed by pooling and normalization
    conv1 = conv2d(_X, _weights['wc1'], _biases['bc1']) # Shape: 28*28*64 
    pool1 = max_pool(conv1, k=3, s=2) # Shape: 14*14*64
    norm1 = norm(pool1) 
    norm1 = dropout(norm1) 
    
    # Second Convolution Layer, followed by pooling and normalization
    conv2 = conv2d(norm1, _weights['wc2'], _biases['bc2']) # Shape: 14*14*128
    pool2 = max_pool(conv2, k=3, s=2) # Shape: 7*7*128
    norm2 = norm(pool2)
    norm2 = dropout(norm2)
    
    # Third Convolution Layer, followed by pooling and normalization
    conv3 = conv2d( norm2, _weights['wc3'], _biases['bc3']) # Shape: 7*7*256
    pool3 = max_pool( conv3, k=3, s=2) # Shape: 4*4*256
    norm3 = norm(pool3)
    norm3 = dropout(norm3)
    
    print _weights['wd1'].get_shape().as_list()[0]
    # Fully connected layer
    # Reshape conv3 output to fit dense layer input
    dense1 = tf.reshape(norm3, [-1, _weights['wd1'].get_shape().as_list()[0]]) 
    # Relu activation
    dense1 = tf.nn.relu(tf.matmul(dense1, _weights['wd1']) + _biases['bd1'], name='fc1')
    
    # Relu activation
    dense2 = tf.nn.relu(tf.matmul(dense1, _weights['wd2']) + _biases['bd2'], name='fc2') 

    # Output, class prediction
    out = tf.matmul(dense2, _weights['out']) + _biases['out']
    return out

In [316]:
# Construct model
pred = alex_net(x, weights, biases, keep_prob)

4096


### Details of Learning
The original paper used a batch size of 128 examples, momentum of 0.9 and a weight decay of 0.0005. 

In [317]:
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

In [318]:
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.types.float32))

In [319]:
# Initializing the variables
init = tf.initialize_all_variables()

In [None]:
# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    step = 1
    # Keep training until reach max iterations
    while step * batch_size < training_iters:
        batch_xs, batch_ys = mnist.train.next_batch(batch_size)
        # Fit training using batch data
        sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys, keep_prob: dropout_rate})
        if step % display_step == 0:
            # Calculate batch accuracy
            acc = sess.run(accuracy, feed_dict={x: batch_xs, y: batch_ys, keep_prob: 1.})
            # Calculate batch loss
            loss = sess.run(cost, feed_dict={x: batch_xs, y: batch_ys, keep_prob: 1.})
            print "Iter " + str(step*batch_size) + ", Minibatch Loss= " \
                  + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)
        step += 1
    print "Optimization Finished!"
    # Calculate accuracy for 256 mnist test images
    print "Testing Accuracy:", sess.run(accuracy, feed_dict={x: mnist.test.images[:256], 
                                                             y: mnist.test.labels[:256], 
                                                             keep_prob: 1.})

Iter 6400, Minibatch Loss= 103547.351562, Training Accuracy= 0.15625
Iter 12800, Minibatch Loss= 43817.253906, Training Accuracy= 0.12500
Iter 19200, Minibatch Loss= 21805.498047, Training Accuracy= 0.25000
Iter 25600, Minibatch Loss= 28639.246094, Training Accuracy= 0.25000
Iter 32000, Minibatch Loss= 19440.542969, Training Accuracy= 0.20312
Iter 38400, Minibatch Loss= 23771.638672, Training Accuracy= 0.09375
Iter 44800, Minibatch Loss= 15034.099609, Training Accuracy= 0.17188
Iter 51200, Minibatch Loss= 9669.911133, Training Accuracy= 0.21875