# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve CIFAR10 image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.
 
## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 50% (50% points)
    * 60% (60% points)
    * 65% (70% points)
    * 70% (80% points)
    * 75% (90% points)
    * 80% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 80%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the e-mail). After that, you can use whatever you want.
* you __can__ use validation data for training, but you __can't'__ do anything with test data apart from running the evaluation procedure.

## Tips on what can be done:


 * __Network size__
   * MOAR neurons, 
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, PyTorch will do the rest
       * Can be done manually or like [this](https://discuss.pytorch.org/t/simple-l2-regularization/139/2).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.
 
   
### Data augmemntation
   * getting 5x as large dataset for free is a great 
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image): 
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.CIFAR10(root=path_to_cifar_like_in_seminar, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   
```

```

```

```

```

```

```

```


   
There is a template for your solution below that you can opt to use or throw away and write it your way.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from cifar import load_cifar10
X_train,y_train,X_val,y_val,X_test,y_test = load_cifar10("cifar_data")
class_names = np.array(['airplane','automobile ','bird ','cat ','deer ','dog ','frog ','horse ','ship ','truck'])

print(X_train.shape,y_train.shape)

(40000, 3, 32, 32) (40000,)


In [3]:
import torch, torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

In [5]:
model = nn.Sequential()

model.add_module('conv_1', nn.Conv2d(in_channels=3, out_channels=10, kernel_size=(3,3), padding=1))
model.add_module('conv1_relu',nn.ReLU())

model.add_module('conv_2', nn.Conv2d(in_channels=10, out_channels=16, kernel_size=(3,3), padding=1))
model.add_module('conv_2_bn', nn.BatchNorm2d(16))
model.add_module('pool_1', nn.MaxPool2d((2,2)))
model.add_module('conv2_relu',nn.ReLU())

model.add_module('conv_3', nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), padding=1))
model.add_module('conv_3_bn', nn.BatchNorm2d(32))
model.add_module('conv3_relu',nn.ReLU())

model.add_module('conv_4', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), padding=1))
model.add_module('conv_4_bn', nn.BatchNorm2d(64))
model.add_module('pool_2', nn.MaxPool2d((2,2)))
model.add_module('conv4_relu',nn.ReLU())

##############
model.add_module('conv_5', nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3,3), padding=1))
model.add_module('conv_5_bn', nn.BatchNorm2d(128))
model.add_module('conv5_relu',nn.ReLU())

model.add_module('conv_6', nn.Conv2d(in_channels=128, out_channels=256, kernel_size=(3,3), padding=1))
model.add_module('conv_6_bn', nn.BatchNorm2d(256))
model.add_module('pool_3', nn.MaxPool2d((2,2)))
model.add_module('conv6_relu',nn.ReLU())

###########

model.add_module('flatten', Flatten())

model.add_module('dense_1', nn.Linear(4096, 1024))
model.add_module('dense_1_bn', nn.BatchNorm1d(1024))

model.add_module('dense1_relu', nn.ReLU())

model.add_module('dense_1.5', nn.Linear(1024, 256))
model.add_module('dense_1.5_bn', nn.BatchNorm1d(256))

model.add_module('dense1.5_relu', nn.ReLU())

model.add_module('dense_2', nn.Linear(256, 100))
model.add_module('dense_2_bn', nn.BatchNorm1d(100))

model.add_module('dense2_relu', nn.ReLU())

model.add_module('dense3_logits', nn.Linear(100, 10))

In [7]:
def compute_loss_cpu(X_batch, y_batch):
    X_batch = Variable(torch.FloatTensor(X_batch))
    y_batch = Variable(torch.LongTensor(y_batch))
    logits = model(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

In [8]:
compute_loss_cpu(X_train[:5], y_train[:5])

Variable containing:
 2.4060
[torch.FloatTensor of size 1]

In [9]:
model.cuda()

Sequential(
  (conv_1): Conv2d (3, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv1_relu): ReLU()
  (conv_2): Conv2d (10, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_2_bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
  (pool_1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv2_relu): ReLU()
  (conv_3): Conv2d (16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_3_bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
  (conv3_relu): ReLU()
  (conv_4): Conv2d (32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (pool_2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv4_relu): ReLU()
  (conv_5): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_5_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (conv5_relu): ReLU()
  (conv_6): Conv2d (128, 256, kernel_size=(3, 3), str

In [10]:
def compute_loss(X_batch, y_batch):
    X_batch = Variable(torch.cuda.FloatTensor(X_batch))
    y_batch = Variable(torch.cuda.LongTensor(y_batch))
    logits = model(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

__ Training __

In [11]:
def iterate_minibatches(X, y, batchsize):
    indices = np.random.permutation(np.arange(len(X)))
    for start in range(0, len(indices), batchsize):
        ix = indices[start: start + batchsize]
        yield X[ix], y[ix]
        
opt = torch.optim.Adam(model.parameters())

train_loss = []
val_accuracy = []

In [12]:
# with open(save_path, 'wb') as f:
#     torch.save(model.state_dict(), f)

In [13]:
# # with open('best_model.torch', 'rb') as f:
# model.load_state_dict(torch.load(save_path))

In [14]:
import time
num_epochs = 100 # total amount of full passes over training data
batch_size = 50  # number of samples processed in one SGD iteration
max_val_acc = None
max_val_acc_epoch = None
save_path = 'best_model2.torch'

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for X_batch, y_batch in iterate_minibatches(X_train, y_train, batch_size):
        # train on batch
        loss = compute_loss(X_batch, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.data.cpu().numpy()[0])
        
    # And a full pass over the validation data:
    model.train(False) # disable dropout / use averages for batch_norm
    for X_batch, y_batch in iterate_minibatches(X_val, y_val, batch_size):
        logits = model(Variable(torch.cuda.FloatTensor(X_batch)))
        y_pred = logits.max(1)[1].data.cpu().numpy()
        val_accuracy.append(np.mean(y_batch == y_pred))
    val_acc = np.mean(val_accuracy[-len(X_val) // batch_size :]) * 100
    
    
    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(X_train) // batch_size :])))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        val_acc))
    
    if max_val_acc is None or val_acc > max_val_acc:
        max_val_acc = val_acc
        max_val_acc_epoch = epoch
        print('this model is best so far, saving it to disk')
        with open(save_path, 'wb') as f:
            torch.save(model.state_dict(), f)
    elif epoch - max_val_acc_epoch > 10:
        print('early stopping, loading best model')
        model.load_state_dict(torch.load(save_path))
        break

Epoch 1 of 100 took 43.039s
  training loss (in-iteration): 	1.221769
  validation accuracy: 			66.10 %
this model is best so far, saving it to disk
Epoch 2 of 100 took 42.555s
  training loss (in-iteration): 	0.824221
  validation accuracy: 			71.13 %
this model is best so far, saving it to disk
Epoch 3 of 100 took 42.563s
  training loss (in-iteration): 	0.651109
  validation accuracy: 			71.31 %
this model is best so far, saving it to disk
Epoch 4 of 100 took 42.542s
  training loss (in-iteration): 	0.514053
  validation accuracy: 			77.49 %
this model is best so far, saving it to disk
Epoch 5 of 100 took 42.548s
  training loss (in-iteration): 	0.399793
  validation accuracy: 			76.67 %
Epoch 6 of 100 took 42.566s
  training loss (in-iteration): 	0.295286
  validation accuracy: 			77.13 %
Epoch 7 of 100 took 42.558s
  training loss (in-iteration): 	0.213828
  validation accuracy: 			77.75 %
this model is best so far, saving it to disk
Epoch 8 of 100 took 42.569s
  training loss (in

Let's load the best model

In [25]:
model = nn.Sequential()

model.add_module('conv_1', nn.Conv2d(in_channels=3, out_channels=10, kernel_size=(3,3), padding=1))
model.add_module('conv1_relu',nn.ReLU())

model.add_module('conv_2', nn.Conv2d(in_channels=10, out_channels=16, kernel_size=(3,3), padding=1))
model.add_module('conv_2_bn', nn.BatchNorm2d(16))
model.add_module('conv2_relu',nn.ReLU())

model.add_module('pool_1', nn.MaxPool2d((2,2)))

model.add_module('conv_3', nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), padding=1))
model.add_module('conv_3_bn', nn.BatchNorm2d(32))
model.add_module('conv3_relu',nn.ReLU())

model.add_module('conv_4', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), padding=1))
model.add_module('conv_4_bn', nn.BatchNorm2d(64))
model.add_module('conv4_relu',nn.ReLU())

model.add_module('pool_2', nn.MaxPool2d((2,2)))
##############
model.add_module('conv_5', nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3,3), padding=1))
model.add_module('conv_5_bn', nn.BatchNorm2d(128))
model.add_module('conv5_relu',nn.ReLU())

model.add_module('conv_6', nn.Conv2d(in_channels=128, out_channels=256, kernel_size=(3,3), padding=1))
model.add_module('conv_6_bn', nn.BatchNorm2d(256))
model.add_module('conv6_relu',nn.ReLU())

model.add_module('pool_3', nn.MaxPool2d((2,2)))
###########

model.add_module('flatten', Flatten())

model.add_module('dense_1', nn.Linear(4096, 256))
model.add_module('dense_1_bn', nn.BatchNorm1d(256))

model.add_module('dense1_relu', nn.ReLU())

model.add_module('dense_2', nn.Linear(256, 100))
model.add_module('dense_2_bn', nn.BatchNorm1d(100))

model.add_module('dense2_relu', nn.ReLU())

model.add_module('dense3_logits', nn.Linear(100, 10))

In [26]:
model.load_state_dict(torch.load('best_model.torch'))
model.cuda()

Sequential(
  (conv_1): Conv2d (3, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv1_relu): ReLU()
  (conv_2): Conv2d (10, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_2_bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
  (conv2_relu): ReLU()
  (pool_1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv_3): Conv2d (16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_3_bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
  (conv3_relu): ReLU()
  (conv_4): Conv2d (32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (conv4_relu): ReLU()
  (pool_2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv_5): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_5_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (conv5_relu): ReLU()
  (conv_6): Conv2d (128, 256, kernel_size=(3, 3), str

In [33]:
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
for X_batch, y_batch in iterate_minibatches(X_train, y_train, 500):
    logits = model(Variable(torch.cuda.FloatTensor(X_batch)))
    y_pred = logits.max(1)[1].cpu().data.numpy()
    test_batch_acc.append(np.mean(y_batch == y_pred))

train_accuracy = np.mean(test_batch_acc)
    
print("Final results:")
print("  test accuracy:\t\t{:.5f} %".format(
    train_accuracy * 100))

Final results:
  test accuracy:		99.61500 %


In [31]:
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
for X_batch, y_batch in iterate_minibatches(X_test, y_test, 500):
    logits = model(Variable(torch.cuda.FloatTensor(X_batch)))
    y_pred = logits.max(1)[1].cpu().data.numpy()
    test_batch_acc.append(np.mean(y_batch == y_pred))

test_accuracy = np.mean(test_batch_acc)
    
print("Final results:")
print("  test accuracy:\t\t{:.5f} %".format(
    test_accuracy * 100))

if test_accuracy * 100 > 95:
    print("Double-check, than consider applying for NIPS'17. SRSly.")
elif test_accuracy * 100 > 90:
    print("U'r freakin' amazin'!")
elif test_accuracy * 100 > 80:
    print("Achievement unlocked: 110lvl Warlock!")
elif test_accuracy * 100 > 70:
    print("Achievement unlocked: 80lvl Warlock!")
elif test_accuracy * 100 > 60:
    print("Achievement unlocked: 70lvl Warlock!")
elif test_accuracy * 100 > 50:
    print("Achievement unlocked: 60lvl Warlock!")
else:
    print("We need more magic! Follow instructons below")

Final results:
  test accuracy:		80.04306 %
Achievement unlocked: 110lvl Warlock!


```

```

```

```

```

```


# Report

All creative approaches are highly welcome, but at the very least it would be great to mention
* the idea;
* brief history of tweaks and improvements;
* what is the final architecture and why?
* what is the training method and, again, why?
* Any regularizations and other techniques applied and their effects;


There is no need to write strict mathematical proofs (unless you want to).
 * "I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
 * "I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
 * "I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - __not_ok__

### Hi, my name is `Ivan Golovanov`, and here's my story

A long time ago in a galaxy far far away, when it was still more than an hour before the deadline, i got an idea:

##### I gonna build a neural network, that
* will have 95% accuracy and train on cpu in a matter of minutes
* will not have way too many layers, will include batchnorm + images rotation + early stopping and thus will train very easily  
* because that is the way we very taught during the cnn seminar

How could i be so naive?!

##### One day, with no signs of warning,
This thing has finally converged and
* Some explaination about what were the results,
* what worked and what didn't
* most importantly - what next steps were taken, if any
* and what were their respective outcomes

Starting point:
NN from the seminar + early stopping + padding 2 = 63.35 % test accuracy

* Adding one more repetition of CONV -> BATCHNORM -> POOL -> RELU == 67.60 % test accuracy

* Batch Size 64 == 68.30 % test accuracy

* Switch before Flatten layers to (Conv2d -> Conv2d -> Batchnorm -> ReLU -> POOL) **2 == 74.30 % test accuracy

* Stack more layers!

* (Conv2d -> Conv2d -> Batchnorm -> ReLU -> POOL) ** 3 == 76.70 % test accuracy 

* Increase output of Convolutional layers (used to be 10-15-20.., now 10-16-32-64..256) == 80.00 % test accuracy (yeah, I know)

* Attempt to replace ReLU with LeakyReLU, test accuracy down to 79.47 %. Back to ReLU.

* Add one more FC layer == 79.25 %

##### Finally, after 6  iterations, 5 mugs of coffee
* Final Architecture:
(Conv2d -> Conv2d -> Batchnorm -> ReLU -> POOL) ** 3

Flatten()

(dense_1): Linear(in_features=4096, out_features=1024)
(dense_1_bn): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
(dense1_relu): ReLU()
(dense_1.5): Linear(in_features=1024, out_features=256)
(dense_1.5_bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True)
(dense1.5_relu): ReLU()
(dense_2): Linear(in_features=256, out_features=100)
(dense_2_bn): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True)
(dense2_relu): ReLU()
(dense3_logits): Linear(in_features=100, out_features=10)

(full architecture printed below).

* as well as training method and tricks
Training on GPU really does the trick.
I really wanted to try Data Augmentation, but failed to launch it out of the box and decided to go without it.
Though, I will master it this weekend.

That, having wasted 4 evenings of my life training, got

* accuracy on training: 99.615 %
* accuracy on validation: 80.46 %
* accuracy on test: 80.04306 % (50 batch size)


I guess I should have added Dropout to prevent such overfitting..

Architecture:

In [35]:
print(model)

Sequential(
  (conv_1): Conv2d (3, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv1_relu): ReLU()
  (conv_2): Conv2d (10, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_2_bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
  (conv2_relu): ReLU()
  (pool_1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv_3): Conv2d (16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_3_bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
  (conv3_relu): ReLU()
  (conv_4): Conv2d (32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (conv4_relu): ReLU()
  (pool_2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (conv_5): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_5_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (conv5_relu): ReLU()
  (conv_6): Conv2d (128, 256, kernel_size=(3, 3), str

C'est tout