# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve Tiny ImageNet image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.
 
## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 25% (50% points)
    * 30% (60% points)
    * 32.5% (70% points)
    * 35% (80% points)
    * 37.5% (90% points)
    * 40% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 40%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the anytask atttachments). After that, you can use whatever you want.
* you __can't__ do anything with validation data apart from running the evaluation procedure. Please, split train images on train and validation parts

## Tips on what can be done:


 * __Network size__
   * MOAR neurons, 
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, PyTorch will do the rest
       * Can be done manually or like [this](https://discuss.pytorch.org/t/simple-l2-regularization/139/2).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.
 
   
### Data augmemntation
   * getting 5x as large dataset for free is a great 
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image): 
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.ImageFolder(root=path_to_tiny_imagenet, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   


In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import torch, torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import transforms
import skimage
from skimage import io
from PIL import Image
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import Dataset, DataLoader
import time

%matplotlib inline

In [2]:
# from tiny_img import download_tinyImg200
# data_path = '.'
# download_tinyImg200(data_path)
transform_train = transforms.Compose([
    transforms.RandomChoice(
        [transforms.RandomRotation(5),
         transforms.RandomHorizontalFlip(0.3),
         transforms.RandomPerspective(distortion_scale=0.5, p=0.5),
         transforms.RandomAffine(2, scale=(1,1.3))]),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/train', transform=transform_train)
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [80000, 20000])

In [3]:
# we'll need this later, when testing the model at the end
class_to_idx = dataset.class_to_idx

In [4]:
batch_size = 64
train_sampler = SubsetRandomSampler(train_dataset.indices)
val_sampler = SubsetRandomSampler(val_dataset.indices)

train_loader = torch.utils.data.DataLoader(train_dataset, 
                            batch_size=batch_size,
                            shuffle=True,
                            num_workers=4)

val_loader = torch.utils.data.DataLoader(val_dataset, 
                            batch_size=batch_size,
                            shuffle=True,
                            num_workers=4)

In [5]:
device = torch.device("cuda:0")

In [6]:
model = nn.Sequential()

model.add_module('conv1', nn.Conv2d(3, 200, kernel_size=(3,3), stride=1))
model.add_module('bn1_1', nn.BatchNorm2d(200))
model.add_module('relu1_1', nn.ReLU())
model.add_module('conv1_2', nn.Conv2d(200, 200, kernel_size=(3,3), stride=1))
model.add_module('bn1_2', nn.BatchNorm2d(200))
model.add_module('relu1_2', nn.ReLU())
model.add_module('maxpool1', nn.MaxPool2d(3))

model.add_module('conv2_1', nn.Conv2d(200, 400, kernel_size=(3,3), stride=1))
model.add_module('bn2_1', nn.BatchNorm2d(400))
model.add_module('relu2_1', nn.ReLU())
model.add_module('conv2_2', nn.Conv2d(400, 400, kernel_size=(3,3), stride=1))
model.add_module('bn2_2', nn.BatchNorm2d(400))
model.add_module('relu2_2', nn.ReLU())
model.add_module('maxpool2', nn.MaxPool2d(3))

model.add_module('flatten', nn.Flatten())
model.add_module('fc1', nn.Linear(10000, 1000))
model.add_module('dp1', nn.Dropout(0.5))
model.add_module('fc2', nn.Linear(1000, 200))
model = model.to(device)
loss_fn = nn.CrossEntropyLoss()

#L2 regularization is added through weight_decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=0.00001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.1)

In [7]:
def train_model(model, train_loader, val_loader, optimizer, num_epochs, scheduler = None):
    for epoch in range(num_epochs):
        
        model.train(True)
        start_time = time.time()
        for x_batch, y_batch in train_loader:
            
            # I. Training
            #1 Train on GPU
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            
            #2 Clear the gradients
            optimizer.zero_grad()
            
            #3 Forward
            predictions = model.forward(x_batch)
            
            #4 Calculating loss
            loss = loss_fn(predictions, y_batch)
            
            #5 Calculating gradients
            loss.backward()

            #6 Optimizer step
            optimizer.step()
            
            # II. Tracking the training
            train_loss.append(loss.cpu().data.numpy())  
        
        if scheduler:
            scheduler.step()
        # III. Validation
        model.train(False) # disable dropout / use averages for batch_norm
        for x_batch, y_batch in val_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            predictions = model.forward(x_batch)
            y_pred = predictions.max(1)[1].data
            val_accuracy.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))
        
        validation_accuracy = np.mean(val_accuracy[-len(val_dataset) // batch_size :]) * 100
        # IV. Reporting
        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss[-len(train_dataset) // batch_size :])))
        print("  validation accuracy: \t\t\t{:.2f} %".format(
            validation_accuracy))
        
        if validation_accuracy > 40:
            print(f'Fitted the model to exceed 40% on the validation set.Exiting loop on epoch {epoch}.')
            break

In [8]:
train_loss = []
val_accuracy = []
train_model(model, train_loader, val_loader, optimizer, 100, scheduler = scheduler)

Epoch 1 of 100 took 109.162s
  training loss (in-iteration): 	4.654612
  validation accuracy: 			14.59 %
Epoch 2 of 100 took 109.329s
  training loss (in-iteration): 	3.920082
  validation accuracy: 			20.14 %
Epoch 3 of 100 took 109.387s
  training loss (in-iteration): 	3.606608
  validation accuracy: 			23.89 %
Epoch 4 of 100 took 109.428s
  training loss (in-iteration): 	3.388545
  validation accuracy: 			27.09 %
Epoch 5 of 100 took 109.625s
  training loss (in-iteration): 	3.224967
  validation accuracy: 			28.96 %
Epoch 6 of 100 took 109.388s
  training loss (in-iteration): 	3.086675
  validation accuracy: 			30.04 %
Epoch 7 of 100 took 109.595s
  training loss (in-iteration): 	2.965666
  validation accuracy: 			31.13 %
Epoch 8 of 100 took 109.481s
  training loss (in-iteration): 	2.854407
  validation accuracy: 			33.02 %
Epoch 9 of 100 took 109.745s
  training loss (in-iteration): 	2.753513
  validation accuracy: 			32.06 %
Epoch 10 of 100 took 109.453s
  training loss (in-itera

In [None]:
# torch.save(model, 'cumbersome_model.pth')

In [9]:
# feel free to copypaste code from seminar03 as a basic template for training

When everything is done, please calculate accuracy on `tiny-imagenet-200/val`

In [10]:
labels = pd.read_csv('tiny-imagenet-200/val/val_annotations.txt', sep='\t', header=None)
labels.columns = ['imname', 'id', 'bb1', 'bb2', 'bb3', 'bb4']

In [11]:
class TestDataset(Dataset):
    def __init__(self, root_folder, labels_frame, class_to_idx, transform=None):

        self.transform = transform
        self.root_folder = root_folder
        self.labels_frame = labels_frame
        self.class_to_idx = class_to_idx

    def __len__(self):
        return len(self.labels_frame) - 1
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        img_name = os.path.join(self.root_folder,
                                self.labels_frame.loc[idx, 'imname'])
        image = io.imread(img_name)
        
        # Treating greyscale images
        if len(image.shape) < 3:
            image = skimage.color.grey2rgb(image, alpha=None)

        if self.transform:
            image = self.transform(image)
        category = self.class_to_idx[self.labels_frame.loc[idx, 'id']]
        return image, category

In [12]:
test_dataset = TestDataset(root_folder='tiny-imagenet-200/val/images/',
                           labels_frame=labels,
                           class_to_idx=class_to_idx,
                           transform = transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
                           ]))

In [13]:
test_loader = torch.utils.data.DataLoader(test_dataset, 
                            batch_size=10,
                            shuffle=True,
                            num_workers=1)

In [14]:
model.train(False)
correct_samples = 0
total_samples = 0
for x_batch, y_batch in test_loader:
    x_batch = x_batch.to(device)
    y_batch = y_batch.to(device)
    predictions = model.forward(x_batch)
    y_pred = predictions.max(1)[1].data
    correct_samples += torch.sum(y_pred == y_batch)
    total_samples += y_batch.shape[0]

In [15]:
test_accuracy = float(correct_samples) / total_samples

In [16]:
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

if test_accuracy * 100 > 40:
    print("Achievement unlocked: 110lvl Warlock!")
elif test_accuracy * 100 > 35:
    print("Achievement unlocked: 80lvl Warlock!")
elif test_accuracy * 100 > 30:
    print("Achievement unlocked: 70lvl Warlock!")
elif test_accuracy * 100 > 25:
    print("Achievement unlocked: 60lvl Warlock!")
else:
    print("We need more magic! Follow instructons below")

Final results:
  test accuracy:		42.07 %
Achievement unlocked: 110lvl Warlock!


```

```

```

```

```

```


# Report

All creative approaches are highly welcome, but at the very least it would be great to mention
* the idea;
* brief history of tweaks and improvements;
* what is the final architecture and why?
* what is the training method and, again, why?
* Any regularizations and other techniques applied and their effects;


There is no need to write strict mathematical proofs (unless you want to).
 * "I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
 * "I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
 * "I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - __not_ok__

### Hi, my name is `Oleg Polivin`, and here's my story

A long time ago in a galaxy far far away, when it was still more than an hour before the deadline, i got an idea:

##### I gonna build a neural network, that
Includes several convolutional layers followed by non-linearities, and in the end by some fully-connected layers, so that in the end it would be possible to apply softmax for 200 classes. However, I was a bit confused about the order in which batchnorms, non-linearities, maxpool and dropout layers shall be applied. Watching seminar 3 helped a lot: there Vika Checkalina explained well that BN is normally followed by ReLUs and not vice-versa. This helped to establish the structure in the network.

Actually, I was not naive at all. My final architecture is not that different from where I started, but what is interesting is that about three days separate my first architecture from my last one. I did many experiments just to come back there where I started (all experiments gave worse results).

##### One day, with no signs of warning,
This thing has finally converged and
* Some explaination about what were the results,

Right in the beginning, I started to achieve accuracy close to 34%. And here is my first architecture:
```
model = nn.Sequential()
model.add_module('conv1', nn.Conv2d(3, 200, kernel_size=(3,3), stride=1))
model.add_module('bn1', nn.BatchNorm2d(200))
model.add_module('relu1', nn.ReLU())
model.add_module('maxpool1', nn.MaxPool2d(3))
model.add_module('conv2', nn.Conv2d(200, 400, kernel_size=(3,3), stride=1))
model.add_module('bn2', nn.BatchNorm2d(400))
model.add_module('relu2', nn.ReLU())
model.add_module('maxpool2', nn.MaxPool2d(3))
model.add_module('flatten', nn.Flatten())
model.add_module('fc1', nn.Linear(14400, 1000))
model.add_module('fc2', nn.Linear(1000, 200))
```

* what worked and what didn't

 I was quite happy, and thought that it was an easy task. However, I could not increase the accuracy further. And then I told myself: look, it is stated in the homework that we should `Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.` So I started trying smaller networks, but that contained more layers: I added Leaky ReLUs, Dropout between Linear layers, tried to make a network similar to VGG16 one, but it didn't help!!! Actually, I still don't understand well why. My idea was to start with not many `out_channels` in the first convolution layer, something like `16` or `32` and then increase further. But to no avail: networks were stuck at 5%, 10%, 15%, 25% accuracy, but didn't improve further.  I introduced scheduler with cosine annealing, with decrease every N epochs, but it didn't help neither. I was really confused. Why it didn't work, even when it resembled so closely the VGG16 network? I started adding augmentations, it helped, but still I couldn't get better than 32% in accuracy.

Increasing the default Adam LR didn't help. 
Decreasing made the learning too slow.

* most importantly - what next steps were taken, if any
* and what were their respective outcomes

By the third day I remembered that my first network, where I didn't try to reproduce VGG, actually, worked better than all my experiments so far. And the key moment was that it contained quite a large number of out channels in the first convolutional layer. I thought it contradicted the advice given in the beginning of the homework (about the overkill), but it gave the best results, so I decided to pursure that way. In the end, what worked:

[-] having 200 out channels in the first convolution layer.

[-] introducing data augmentations. Even not much (I have 4 to choose from, but apply only 1 to a given image), gave an increase in accuracy. Say, from 33-34% I went to 35-36%.

[-] But how to increase more???

[-] What helped me a lot in this stage, is the image of the VGG16 architecture that I was looking at. Understand, I was too used to architectures that either always increase or always decrease in the number of neurons/out channels. But VGG repeats some convolutions, in the sense that in channel number of filters = out channel number of filters. The same for the fully-connected layers. 4096 -> 4096 -> number of classes. I decided to dot it for my convolution layers, and it gave a great boost! 

These are the layers:

[-] model.add_module('conv1_2', nn.Conv2d(200, 200, kernel_size=(3,3), stride=1))

[-] model.add_module('conv2_2', nn.Conv2d(400, 400, kernel_size=(3,3), stride=1))

It increased accuracy on the validation set to 39%. But I was stuck there! I wanted that accuracy on validation set goes over 40%! So the final part in the quest, was an introduction of the scheduler. A simple StepLR one, that decreases learning rate every 15 epochs ten times. 

With a very small boost of 1%, I started to get over 40% each time I started teaching the network.

##### Finally, after ~1000  iterations, 15 mugs of tea and 15 mugs of coffee
* what was the final architecture
* as well as training method and tricks

Here is the architecture:
```
model.add_module('conv1', nn.Conv2d(3, 200, kernel_size=(3,3), stride=1))
model.add_module('bn1_1', nn.BatchNorm2d(200))
model.add_module('relu1_1', nn.ReLU())
model.add_module('conv1_2', nn.Conv2d(200, 200, kernel_size=(3,3), stride=1))
model.add_module('bn1_2', nn.BatchNorm2d(200))
model.add_module('relu1_2', nn.ReLU())
model.add_module('maxpool1', nn.MaxPool2d(3))

model.add_module('conv2_1', nn.Conv2d(200, 400, kernel_size=(3,3), stride=1))
model.add_module('bn2_1', nn.BatchNorm2d(400))
model.add_module('relu2_1', nn.ReLU())
model.add_module('conv2_2', nn.Conv2d(400, 400, kernel_size=(3,3), stride=1))
model.add_module('bn2_2', nn.BatchNorm2d(400))
model.add_module('relu2_2', nn.ReLU())
model.add_module('maxpool2', nn.MaxPool2d(3))

model.add_module('flatten', nn.Flatten())
model.add_module('fc1', nn.Linear(10000, 1000))
model.add_module('dp1', nn.Dropout(0.5))
model.add_module('fc2', nn.Linear(1000, 200))
```

So it heavily relies on convolution and maxpooling.
Training methods and tricks were described above. Most important are using Adam as the gradient descent method + using data augmentations + repeating some convolutions in the sense that I added convolution filters that have in_channels = out_channels.

That, having wasted 10 hours of my life training, got

* accuracy on training: I didn't check it.
* accuracy on validation: 40.09%
* accuracy on test: 42.07%


[an optional afterword and mortal curses on assignment authors]

Thanks a lot for the task!!!