# Decision Process of Network Hyper-Parameters
## Prepared by Furkan Küçük for DataBoss Analytics Job Application

Decision process of a machine learning pipeline can get very trivial. In this notebook, I will try to explain my general approach on hyper-parameter tuning. 

First of all, hyper-parameter decision process might take some time. This may be due to shortage of resources (like having a low-spec hardware) or time. Especially tasks on computer vision may take some time since datasets of computer vision tasks may be challenging, have clutter, hard to label etc. Besides, since computer vision tasks are relatively more complicated, it may need more complex deep learning architectures.

Important note: This task is being done with a relatively low-spec hardware(Google Colab). Hence, one may need to tune hyper-parameters with pre-acquired insights. The experiments will be held with a light network architecture and will have an assumption for found parameters will apply for all conditions. However, this is not an accurate assumption since different hyper-parameters may perform better under different conditions.

In [2]:
# imports
import math
import os
from torch import nn
from torch.optim.lr_scheduler import _LRScheduler
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
from dataset_utils import alternativeSeperation

Before beggining any machine learning task, one have to prepare project spesific data. Tiny Imagenet dataset comes with 3 folders; for training, for validation and for testing. Training folder is well structured for PyTorch's prebuilt ImageFolder dataset handler. However, validation folder has all images in one folder, and labels of them in a text file. I implemented 2 alternatives for dealing such a task. 
- (Alternative) First one of the alternatives, is an extension for PyTorch's "nn.Dataset" class, gets images according to text file. This alternative introduces some overhead to system since it labels all images as the training goes. Further explaination can be obtained from class doc.
- (Preferred) Second alternative is a function that seperates images among class folders that created correspondingly. With this function, the need of RAM usage for holding label information becomes unnecessary. This function also enables user to use PyTorch's "ImageFolder" implementation.

In [None]:
#alternativeSeperation()

Definition: A loss landscape is a multi-dimensional representation of loss functions. For more information please take a look at [1].

In computer vision tasks, most of the loss landscapes are not convex. This property of loss landscapes makes it hard to find the global optima. There are 2 main elements that affects the loss landscape severely; loss function and data distribution. (Other hyper-parameters also affect loss landscapes) In computer vision, this data distribution can be adjusted slightly to lead the machine learning pipeline to find the global optima. (or at least a good local optima)

- First adjustment should be channel normalization. Normalization is an important step to have an easier loss landscape. In machine learning projects, all features may have their own distributions. However, numerically, a distribution significantly different from other ones (e.g. a distribution with much larger samples) may affect loss landscape severely and decreases gradients on a dimension while increases on another one. (In other words, the optimizer may think that a feature is more important than other one) With channel normalization, one may prevent that issue. This process can be done by statistically analyzing datasets channel distribution. 

- Data augmentation may be nice to have tool for exploration of loss landscape. A dataset can only represent a partition of a real distribution. With data augmentation, one can slightly increase the coverage of that representation and in some ways enhance it. Some of the benefits of data augmentation are:
    - May prevent overfitting (in some cases) for classes by changing translation, rotation, angle etc.
    - May balance a dataset that is unbalanced so, the bias of model can be decreased.
    - Increase of coverage may lead to better generalization.
    
However, data augmentation is just replicating an image with some distortions and noise. Overdoing may harm generalization.

In this part, we will find out which augmentation performs better for our case.

[1] https://arxiv.org/pdf/1712.09913.pdf

In [None]:
transformations = {
    'train': transforms.Compose([
        transforms.RandomHorizontalFlip(),
        # transforms.RandomCrop(32, padding=4),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ]),
    'val': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ]),
    'test': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])
}

data_dir = "data"

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          transformations[x])
                  for x in ['train', 'val']}
dataloaders = {x: DataLoader(image_datasets[x], batch_size=32,
                             shuffle=True)
               for x in ['train', 'val']}



In [None]:
class SDGR(_LRScheduler):

    def __init__(self, optimizer, t_max, eta_min=1e-6, last_epoch=-1, t_mult=2):
        super().__init__(optimizer, last_epoch)
        self.t_max = t_max
        self.t_mult = t_mult
        self.restart_point = t_max
        self.eta_min = eta_min
        self.restarted_at = 0

    def restart(self):
        self.restart_point *= self.t_mult
        self.restarted_at = self.last_epoch

    def cosine(self, base_lr):
        return self.eta_min + (base_lr - self.eta_min) * (1 + math.cos(math.pi * self.step_n / self.restart_point)) / 2

    @property
    def step_n(self):
        return self.last_epoch - self.restarted_at

    def get_lr(self):
        if self.step_n >= self.restart_every:
            self.restart()
        return [self.cosine(base_lr) for base_lr in self.base_lrs]
        

In [None]:
model1 = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=8, kernel_size=3, stride=1),
    nn.ReLU(),
    nn.BatchNorm2d(8),
)

In [4]:
alternativeSeperation()

In [None]:
transformations = {
    'train': transforms.Compose([
        transforms.RandomHorizontalFlip(),
        # transforms.RandomCrop(32, padding=4),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]),
    'val': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]),
    'test': transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
}

data_dir = "data"

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          transformations[x])
                  for x in ['train', 'val', 'test']}
dataloaders = {x: DataLoader(image_datasets[x], batch_size=32,
                             shuffle=True)
               for x in ['train', 'val', 'test']}



In [1]:
model1 = nn.Sequential(
    
)

NameError: name 'nn' is not defined