# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve Tiny ImageNet image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.
 
## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 25% (50% points)
    * 30% (60% points)
    * 32.5% (70% points)
    * 35% (80% points)
    * 37.5% (90% points)
    * 40% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 40%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the anytask atttachments). After that, you can use whatever you want.
* you __can't__ do anything with validation data apart from running the evaluation procedure. Please, split train images on train and validation parts

## Tips on what can be done:


 * __Network size__
   * MOAR neurons, 
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, PyTorch will do the rest
       * Can be done manually or like [this](https://discuss.pytorch.org/t/simple-l2-regularization/139/2).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.
 
   
### Data augmemntation
   * getting 5x as large dataset for free is a great 
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image): 
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.ImageFolder(root=path_to_tiny_imagenet, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   


In [0]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import torchvision
import torch
from torchvision import transforms
from torch import nn

import torch.nn.functional as F
from torch.autograd import Variable

from tqdm.auto import tqdm

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
from tiny_img import download_tinyImg200
data_path = './drive/My Drive/data/'
download_tinyImg200(data_path)

./drive/My Drive/data/tiny-imagenet-200.zip


In [0]:
! rm -r ./tiny-imagenet-200/test

In [0]:
import io
import glob
import os
from shutil import move
from os.path import join
from os import listdir, rmdir

target_folder = './tiny-imagenet-200/val/'
test_folder   = './tiny-imagenet-200/test/'

os.mkdir(test_folder)
val_dict = {}
with open('./tiny-imagenet-200/val/val_annotations.txt', 'r') as f:
    for line in f.readlines():
        split_line = line.split('\t')
        val_dict[split_line[0]] = split_line[1]
        
paths = glob.glob('./tiny-imagenet-200/val/images/*')
for path in paths:
    file = path.split('/')[-1]
    folder = val_dict[file]
    if not os.path.exists(target_folder + str(folder)):
        os.mkdir(target_folder + str(folder))
        os.mkdir(target_folder + str(folder) + '/images')
    if not os.path.exists(test_folder + str(folder)):
        os.mkdir(test_folder + str(folder))
        os.mkdir(test_folder + str(folder) + '/images')
        
        
for path in paths:
    file = path.split('/')[-1]
    folder = val_dict[file]
    if len(glob.glob(target_folder + str(folder) + '/images/*')) <25:
        dest = target_folder + str(folder) + '/images/' + str(file)
    else:
        dest = test_folder + str(folder) + '/images/' + str(file)
    move(path, dest)

rmdir('./tiny-imagenet-200/val/images')

In [0]:
transform_train = transforms.Compose([
   transforms.RandomChoice([
   transforms.RandomRotation(25),
   transforms.RandomResizedCrop(64, scale=(0.6, 1)),
   transforms.RandomHorizontalFlip(p=0.5),
   transforms.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15)                       
   ]),
   transforms.ToTensor()
   ])


In [0]:
train_dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/train', transform=transform_train)
test_dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/test', transform=transforms.ToTensor())
val_dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/val', transform=transforms.ToTensor())

--------------

In [0]:
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

In [0]:
num_classes = 200

model = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=3, stride=2),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.Conv2d(64, 128, kernel_size=3, padding=3, stride=2),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 256, kernel_size=3, padding=3, stride=2),
    nn.BatchNorm2d(256),
    nn.ReLU(),
    nn.Flatten(),
    nn.Dropout(p=0.3),
    nn.Linear(36864, 4096),
    nn.BatchNorm1d(4096),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(4096, 1024),
    nn.BatchNorm1d(1024),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(1024, num_classes)
)

In [0]:
from torchsummary import summary

summary(model.cuda(), (3, 64, 64))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 64, 34, 34]           1,792
       BatchNorm2d-2           [-1, 64, 34, 34]             128
              ReLU-3           [-1, 64, 34, 34]               0
            Conv2d-4          [-1, 128, 19, 19]          73,856
       BatchNorm2d-5          [-1, 128, 19, 19]             256
              ReLU-6          [-1, 128, 19, 19]               0
            Conv2d-7          [-1, 256, 12, 12]         295,168
       BatchNorm2d-8          [-1, 256, 12, 12]             512
              ReLU-9          [-1, 256, 12, 12]               0
          Flatten-10                [-1, 36864]               0
          Dropout-11                [-1, 36864]               0
           Linear-12                 [-1, 4096]     150,999,040
      BatchNorm1d-13                 [-1, 4096]           8,192
             ReLU-14                 [-

-------

In [0]:
# init parameters

# learning rate
learning_rate = 1e-3

# optimizer
opt = torch.optim.Adam(model.parameters())

# loss function
loss = torch.nn.modules.loss.CrossEntropyLoss()

# number of epochs
num_epochs = 30

# batch size 
batch_size = 512

# device
device = torch.device("cuda")

In [0]:
train_batch_gen = torch.utils.data.DataLoader(train_dataset, 
                                              batch_size=batch_size,
                                              shuffle=True,
                                              num_workers=7)

In [0]:
val_batch_gen = torch.utils.data.DataLoader(val_dataset, 
                                              batch_size=batch_size,
                                              shuffle=True,
                                              num_workers=7)

In [0]:
def compute_loss(X_batch, y_batch):
    X_batch = Variable(torch.FloatTensor(X_batch)).cuda()
    y_batch = Variable(torch.LongTensor(y_batch)).cuda()
    logits = model.cuda()(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

In [None]:
scheduler = torch.optim.lr_scheduler.StepLR(opt, step_size=7, gamma=0.1)

train_loss = []
train_accuracy = []
val_accuracy = []
import time

for epoch in range(num_epochs):
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for i_batch, (X_batch, y_batch) in enumerate(tqdm(train_batch_gen)):
        # train on batch
        loss = compute_loss(X_batch, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.cpu().data.numpy())

        logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        train_accuracy.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))

    
    model.train(False) # disable dropout / use averages for batch_norm
    for X_batch, y_batch in tqdm(val_batch_gen):
        logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        val_accuracy.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))
    
    scheduler.step()
    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(train_dataset) // batch_size :])))
    print("  train accuracy: \t\t\t{:.2f} %".format(
        np.mean(train_accuracy[-len(train_dataset) // batch_size :]) * 100))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        np.mean(val_accuracy[-len(val_dataset) // batch_size :]) * 100))

In [0]:
torch.save(model.state_dict(), './drive/My Drive/trained_models/seq_final_acc_40.data')

When everything is done, please calculate accuracy on `tiny-imagenet-200/val`

In [0]:
test_batch_gen = torch.utils.data.DataLoader(test_dataset, 
                                              batch_size=500,
                                              shuffle=False,
                                              num_workers=1)

In [0]:
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
for (X_batch, y_batch) in test_batch_gen:
    logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
    y_pred = logits.max(1)[1].data
    test_batch_acc.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))


test_accuracy = np.mean(test_batch_acc)

In [0]:
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

Final results:
  test accuracy:		40.18 %


```

```

```

```

```

```


# Report

All creative approaches are highly welcome, but at the very least it would be great to mention
* the idea;
* brief history of tweaks and improvements;
* what is the final architecture and why?
* what is the training method and, again, why?
* Any regularizations and other techniques applied and their effects;


There is no need to write strict mathematical proofs (unless you want to).
 * "I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
 * "I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
 * "I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - __not_ok__

### Hi, my name is `Daria Sharypina`, and here's my story

A long time ago in a galaxy far far away, when it was still more than an hour before the deadline, i got an idea:

##### I gonna build a neural network, that

Я решила начать с `AlexNet5`, попробовала потренировать с разными оптимазерами (`SGD-momentum`, `Adam`, `RMSprop`), пробовала менять learning rate, попробовала разные функции активации, попыталась использовать аугментации. Еще добавляла scheduler для learning rate.

How could i be so naive?!

##### One day, with no signs of warning,
This thing has finally converged and

Я реализовала `AlexNet5` через `nn.Sequential`

Максимальный результат, который я получила, $\approx$ 20%

Лучше всего получилось с оптимайзерами `Adam`, `RMSprop`, начальным `lr = 1e-3`, `lr_scheduler.StepLR` с `gamma = 0.1` и `step_size = 7`.
Эксперименты с функциями активации ничего не дали, особой разницы я не увидела.

Потом я добавила `BatchNorm` и получила примерно +5% к accuracy.
Аугментации у меня так и не получилось использовать здесь. Каждый раз, когда я добавляла `transform` к тренировочному датасету скор почему-то доходил только до 3% и всё. Я проверяла, что происходит с картинками после аугментаций и они были нормальные, картинки не портились.

Потом я решила перейти к `resnet18`, потом к `resnet34` и `resnet50`. Здесь у меня все еще были те же проблемы с аугментациями. С `Adam`, `lr = 1e-3` получилось добиться accuracy около 32%

Потом я поняла, что так и не попробовала совсем простые архитектуры и решила вернуться к `nn.Sequential` с парой свёрточных слоёв. У меня получилось добиться 30% с тремя свёрточными слоями, BatchNorm-ом, DropOut-ом и линейным слоем в конце. 
Аугментации все еще не работали, пока тут что-то мне пришло в голову убрать Normalize из аугментаций. И всё стало хорошо! Аугментации заработали и добавили еще примерно 5%.

##### Finally, after 20  iterations, 2 mugs of coffee

В результате оставила 3 сверточных слоя, в конце 2 линейных слоя с 2 DropOut-ами + BatchNorm. Использовала `Adam` и `lr = 1e-3`. Добавила scheduler с `step_size=7` и `gamma=0.1`. И использовала аугментации (RandomRotation, RandomResizedCrop, RandomHorizontalFlip, ColorJitter)

That, having wasted days of my life training, got

* accuracy on training: 50%
* accuracy on validation: 38%
* accuracy on test: 40%


Нужно было сразу начинать с простой архитектуры, постепенно её меняя. А мне же говорили...
> Please do try a few simple architectures before you go for resnet-152.
