# Homework 2. Training networks in PyTorch

Это домашнее задание посвящено отработки навыков по написанию и обучению нейронных сетей. Ваше задание реализовать обучение нейронной сети и выполнить задания по анализу сети в конце ноутбука. Удачи!

<font color='red'> **Дедлайн 4 октября 23:59 (жесткий)**  </font>

### Data loading in pytorch

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.utils.data

In [4]:
from matplotlib import pyplot as plt
%matplotlib inline

You will works with a MNIST dataset. It contains grayscale images of handwritten digits of size 28 x 28. The number of training objects is 60000.


In pytorch, there is a special module to download MNIST. But for us it is more convinient to load the data ourselves.

In [5]:
from util import load_mnist

In [6]:
X_train, y_train, X_test, y_test = load_mnist()

The code below prepares short data (train and val) for seminar purposes (use this data to quickly learn model on CPU and to tune the hyperparameters). Also, we prepare the full data (train_full and test) to train a final model.

In [7]:
# shuffle data
X_train, y_train, X_test, y_test = load_mnist()
np.random.seed(0)
idxs = np.random.permutation(np.arange(X_train.shape[0]))
X_train, y_train = X_train[idxs], y_train[idxs]

X_train.shape

(60000, 1, 28, 28)

Pytorch offers convinient class DataLoader for mini batch generation. You should pass instance of Tensor Dataset to it.

In [8]:
def get_loader(X, y, batch_size=64):
    train = torch.utils.data.TensorDataset(torch.from_numpy(X).float(),
                                       torch.from_numpy(y).long())
    train_loader = torch.utils.data.DataLoader(train,
                                               batch_size=batch_size)
    return train_loader

X_train, y_train, X_test, y_test = load_mnist()
np.random.seed(0)
idxs = np.random.permutation(np.arange(X_train.shape[0]))
X_train, y_train = X_train[idxs], y_train[idxs]
# for final model:
train_loader_full = get_loader(X_train, y_train)
test_loader = get_loader(X_test, y_test)
# for validation purposes:
train_loader = get_loader(X_train[:15000], y_train[:15000])
val_loader = get_loader(X_train[15000:30000], y_train[15000:30000])

  torch.from_numpy(y).long())


In [9]:
# check number of objects
val_loader.dataset.tensors[0].shape

torch.Size([15000, 1, 28, 28])

### Building LeNet-5

Convolutional layer (from Anton Osokin's presentation):

![slide](https://github.com/nadiinchi/dl_labs/raw/master/convolution.png)

You need to implement Lenet-5:

![Архитектура LeNet-5](https://www.researchgate.net/profile/Vladimir_Golovko3/publication/313808170/figure/fig3/AS:552880910618630@1508828489678/Architecture-of-LeNet-5.png)

Construct a network according to the image and code examples given above. Use ReLU nonlinearity (after all linear and convolutional layers). The network must support multiplying the number of convolutions in each convolutional layer by k.

Please note that on the scheme the size of the image is 32 x 32 but in our code the size is 28 x 28.

Do not apply softmax at the end of the forward pass!

### <font color='red'>[TODO] Написание архитектуры Le-Net-5 </font>

В этой части вам нужно реализовать архитектуру Le-Net-5, но учтите, что на вход изображения приходит 28x28.

Для того, написать архитектуру используйте [nn.Conv2D](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html), [nn.AvgPool2d](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html), [nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html). Ориентируйтесь на картинку сверху в реализации

In [10]:
import torchvision
import torch.nn as nn

In [34]:
class CNN(nn.Module):
    def __init__(self, k=1):
        #super(CNN, self).__init__(num_channels = 1)
        super(CNN, self).__init__()
        ### your code here: define layers
        #self.padding = torchvision.transforms.Pad(padding=2, fill=0,padding_mode='constant')
        self.conv1   = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2, bias=True)
        self.pool1   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.conv2   = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, bias=True)
        self.pool2   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.flat    = nn.Flatten()
        self.linear1 = nn.Linear(in_features=400, out_features=120, bias=True)
        self.linear2 = nn.Linear(in_features=120, out_features=84, bias=True)
        self.linear3 = nn.Linear(in_features=84, out_features=10, bias=True)
        self.relu    = nn.ReLU()

    def forward(self, x):
        ### your code here: transform x using layers
        x = self.relu(self.conv1(x))
        x = self.pool1(x)
        x = self.relu(self.conv2(x))
        x = self.pool2(x)
        x = self.flat(x)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.linear3(x)

        return x

class CNN_v2(nn.Module):
    def __init__(self, k=1):
        #super(CNN, self).__init__(num_channels = 1)
        super(CNN_v2, self).__init__()
        ### your code here: define layers
        #self.padding = torchvision.transforms.Pad(padding=2, fill=0,padding_mode='constant')
        self.conv1   = nn.Conv2d(in_channels=1, out_channels=24, kernel_size=5, stride=1, padding=2, bias=True)
        self.pool1   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.conv2   = nn.Conv2d(in_channels=24, out_channels=64, kernel_size=5, stride=1, bias=True)
        self.pool2   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.flat    = nn.Flatten()
        self.linear1 = nn.Linear(in_features=1600, out_features=120, bias=True)
        self.linear2 = nn.Linear(in_features=120, out_features=84, bias=True)
        self.linear3 = nn.Linear(in_features=84, out_features=10, bias=True)
        self.relu    = nn.ReLU()

    def forward(self, x):
        ### your code here: transform x using layers
        x = self.relu(self.conv1(x))
        x = self.pool1(x)
        x = self.relu(self.conv2(x))
        x = self.pool2(x)
        x = self.flat(x)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.linear3(x)

        return x

class CNN_v3(nn.Module):
    def __init__(self, k=1):
        #super(CNN, self).__init__(num_channels = 1)
        super(CNN_v3, self).__init__()
        ### your code here: define layers
        #self.padding = torchvision.transforms.Pad(padding=2, fill=0,padding_mode='constant')
        self.conv1   = nn.Conv2d(in_channels=1, out_channels=24, kernel_size=5, stride=1, padding=2, bias=True)
        self.pool1   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.conv2   = nn.Conv2d(in_channels=24, out_channels=64, kernel_size=5, stride=1, bias=True)
        self.pool2   = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        self.flat    = nn.Flatten()
        self.linear1 = nn.Linear(in_features=1600, out_features=120, bias=True)
        self.linear2 = nn.Linear(in_features=120, out_features=10, bias=True)
        self.relu    = nn.ReLU()

    def forward(self, x):
        ### your code here: transform x using layers
        x = self.relu(self.conv1(x))
        x = self.pool1(x)
        x = self.relu(self.conv2(x))
        x = self.pool2(x)
        x = self.flat(x)
        x = self.relu(self.linear1(x))
        x = self.linear2(x)
        return x

Let's count the number of the parameters in the network:

In [None]:
from torchinfo import summary
cnn = CNN()
summary(cnn, input_size=(32768, 1, 28, 28))

In [33]:
cnn_v2 = CNN_v2()
summary(cnn_v2, input_size=(32768, 1, 28, 28))

Layer (type:depth-idx)                   Output Shape              Param #
CNN_v2                                   [32768, 10]               --
├─Conv2d: 1-1                            [32768, 24, 28, 28]       624
├─ReLU: 1-2                              [32768, 24, 28, 28]       --
├─AvgPool2d: 1-3                         [32768, 24, 14, 14]       --
├─Conv2d: 1-4                            [32768, 64, 10, 10]       38,464
├─ReLU: 1-5                              [32768, 64, 10, 10]       --
├─AvgPool2d: 1-6                         [32768, 64, 5, 5]         --
├─Flatten: 1-7                           [32768, 1600]             --
├─Linear: 1-8                            [32768, 120]              192,120
├─ReLU: 1-9                              [32768, 120]              --
├─Linear: 1-10                           [32768, 84]               10,164
├─ReLU: 1-11                             [32768, 84]               --
├─Linear: 1-12                           [32768, 10]               850


In [27]:
cnn_v3 = CNN_v3()
summary(cnn_v3, input_size=(1024, 1, 28, 28))

Layer (type:depth-idx)                   Output Shape              Param #
CNN_v3                                   [1024, 10]                --
├─Conv2d: 1-1                            [1024, 24, 28, 28]        624
├─ReLU: 1-2                              [1024, 24, 28, 28]        --
├─AvgPool2d: 1-3                         [1024, 24, 14, 14]        --
├─Conv2d: 1-4                            [1024, 64, 10, 10]        38,464
├─ReLU: 1-5                              [1024, 64, 10, 10]        --
├─AvgPool2d: 1-6                         [1024, 64, 5, 5]          --
├─Flatten: 1-7                           [1024, 1600]              --
├─Linear: 1-8                            [1024, 120]               192,120
├─ReLU: 1-9                              [1024, 120]               --
├─Linear: 1-10                           [1024, 10]                1,210
Total params: 232,418
Trainable params: 232,418
Non-trainable params: 0
Total mult-adds (G): 4.64
Input size (MB): 3.21
Forward/backward pass 

### Training

Let's define the loss function:

In [12]:
criterion = nn.CrossEntropyLoss() # loss includes softmax

Also, define a device where to store the data and the model (cpu or gpu):

In [13]:
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
cnn = cnn.to(device)

During training, we will control the quality on the training and validation set. This produces duplicates of the code. That's why we will define a function evaluate_loss_acc to evaluate our model on different data sets. In the same manner, we define function train_epoch to perform one training epoch on traiing data. Please note that we will compute the training loss _after_ each epoch (not averaging it during epoch).

In the propotypes, train and eval modes are noted. In our case, we don't need them (because we don't use neither dropout nor batch normalization). However, we will switch the regime so you can use this code in the future.

### <font color='red'>[TODO] Реализуйте функции обучение модели </font>

В части вам нужно написать циклы обучения моделей, вы можете ориентировать на ноутбук семинара при их выполнении

In [58]:
import wandb
import torch.optim as optim
from tqdm import tqdm
import gc, os
class Trainer:
    def set_wandb(self):
        self.config = {
                'architecture'    : 'Lenet-5_v3',
                'optimizer'       : 'AdamW',
                'learning_rate'   : 3e-3,
                'weight_decay'    : 1e-6,
                'optimizer_kwargs': {'betas': (0.9, 0.999), 'eps': 1e-7},
                'scheduler_name'  : 'StepLr',
                'scheduler_kwargs': {'eta_min': 2e-4, 'T_max': 300, 'gamma': 0.85,'step_size': 7, },
                'epochs'          : 300,
                'batch_size'      : 2048
            }
        wandb.login(key='84d6a92704bf4bf19d2ecc87a55eea5ce77a8725')
        self.run = wandb.init(project='dl_homework', config=self.config)
        self.model_name = 'run_' + self.run.name + '_model'
        self.start_epoch = 0
        self.num_epochs  = self.config['epochs']

    def __init__(self):
        self.set_wandb()
        self.set_data()
        self.set_net()
        self.set_opt_sched()
        self.entropy = nn.CrossEntropyLoss()
        self.softmax = nn.Softmax(dim=1)
        self.best_acc = 0

    def criterion(self, outputs, targets, config):
        loss_entropy = self.entropy(outputs, targets)
        config['entropy_loss'].append(loss_entropy.item())
        config['loss'].append(loss_entropy.item())
        return loss_entropy

    def metrics(self, outputs, targets, config):
        predict = self.softmax(outputs)
        predict = torch.argmax(outputs, dim=1)
        config['Accuracy'].append(predict.eq(targets).sum().item()/targets.size(0))

    def validate(self, i_epoch):
        # Test
        self.net.eval()
        gc.collect()
        torch.cuda.empty_cache()
        config = self.get_config()

        with torch.no_grad():
            loop = tqdm(enumerate(self.test_loader), total=len(self.test_loader), leave=False)
            for batch_idx, (inputs, targets) in loop:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                outputs = self.net(inputs)
                loss = self.criterion(outputs, targets, config)
                # Calculate and summary metrics
                self.metrics(outputs, targets, config)
                # LOOPA and PUPA
                loop.set_description(f"Epoch (Test)[{i_epoch}/{self.num_epochs}]")
                loop.set_postfix(Accuracy=np.mean(config['Accuracy']), loss=np.mean(config['loss']))
                gc.collect()
                torch.cuda.empty_cache()

            config['i_epoch'] = i_epoch
            self.wandb_log(config, name='Test')

        # Save checkpoint.
        acc = 100.*np.mean(config['Accuracy'])
        if acc > self.best_acc:
            self.best_acc = acc
            self.save(i_epoch, self.model_name + f'_best')

    def train(self, i_epoch):
        # Train
        config = self.get_config()
        self.net.train()
        loop = tqdm(enumerate(self.train_loader), total=len(self.train_loader), leave=False)
        for batch_idx, (inputs, targets) in loop:
            inputs, targets = inputs.to(self.device), targets.to(self.device)
            outputs = self.net(inputs)
            loss = self.criterion(outputs, targets, config)
            gc.collect()
            torch.cuda.empty_cache()
            # Make backward step
            loss.backward()
            self.optimizer.step()
            self.optimizer.zero_grad()
            # Calculate and summary metrics
            self.metrics(outputs, targets, config)
            # LOOPA and PUPA
            loop.set_description(f"Epoch (Train)[{i_epoch}/{self.num_epochs}]")
            loop.set_postfix(Accuracy=np.mean(config['Accuracy']), loss=np.mean(config['loss']))
            gc.collect()
            torch.cuda.empty_cache()
        config['i_epoch'] = i_epoch
        self.wandb_log(config, name='Train')

    def fit(self):
        for i_epoch in range(self.num_epochs):
            self.train(i_epoch)
            self.validate(i_epoch)
            self.scheduler.step()
        self.run.finish()

    def set_opt_sched(self):
        self.optimizer = optim.AdamW(
            params       = self.net.parameters(),
            lr           = self.config['learning_rate'],
            betas        = self.config['optimizer_kwargs']['betas'],
            eps          = self.config['optimizer_kwargs']['eps'],
            weight_decay = self.config['weight_decay']
            )
       
        self.scheduler = torch.optim.lr_scheduler.StepLR(
            self.optimizer,
            step_size = self.config['scheduler_kwargs']['step_size'],
            gamma     = self.config['scheduler_kwargs']['gamma']
            )

    def wandb_log(self, config, name):
        wandb.log({f'{name} entropy_loss': np.mean(config['entropy_loss']),
                   f'{name} Accuracy'    : np.mean(config['Accuracy']),
                   'Epoch'               : config['i_epoch']})

    def load(self, filename):
        checkpoint = torch.load(f'./checkpoint/{filename}.pth')
        self.set_net()
        self.net.load_state_dict(checkpoint['net'])
        self.set_opt_sched()
        self.net         = self.net.to(self.device)
        self.best_acc    = checkpoint['acc']
        self.start_epoch = checkpoint['epoch']
        wandb.watch(self.net, log_freq=100)

    def save(self, i_epoch, name):
        state = {
            'net'      : self.net.state_dict(),
            'acc'      : self.best_acc,
            'epoch'    : i_epoch,
            }
        if not os.path.isdir('checkpoint'):
            os.mkdir('checkpoint')
        torch.save(state, f'./checkpoint/{name}.pth')
        print(f'Saved... Epoch[{i_epoch}]')

    def set_net(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        torch.cuda.set_device(1)
        #self.net = CNN()
        #self.net = CNN_v2()
        self.net = CNN_v3()

        self.net = self.net.to(self.device)
        wandb.watch(self.net, log_freq=100)

    
    def set_data(self):
        X_train, y_train, X_test, y_test = load_mnist()
        X_train, y_train = torch.from_numpy(X_train).float(), torch.from_numpy(y_train).long()
        X_test, y_test   = torch.from_numpy(X_test).float(),torch.from_numpy(y_test).long()

        dataset_train = torch.utils.data.TensorDataset(X_train, y_train)
        self.train_loader = torch.utils.data.DataLoader(dataset_train, 
                                             batch_size=self.config['batch_size'], 
                                             num_workers=10,
                                             pin_memory=True,
                                             shuffle=True,
                                             drop_last=False)

        dataset_test = torch.utils.data.TensorDataset(X_test, y_test)
        self.test_loader = torch.utils.data.DataLoader(dataset_test, 
                                             batch_size=self.config['batch_size'], 
                                             num_workers=10,
                                             pin_memory=True,
                                             shuffle=False,
                                             drop_last=False)

    def get_config(self):
        return {
            'loss'        : [],
            'entropy_loss': [],
            'i_epoch'     : 0,
            'Accuracy'    : [],
            }

### <font color='red'>[TODO] Обучение модели </font>

Train the neural network, using defined functions. Use Adam as an optimizer, learning_rate=0.001, number of epochs = 20. For hold out, use val_loader, not test_loader.

In [59]:
model = Trainer()
#model.load('run_fresh-sunset-13_model_best')
model.fit()



VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
Test Accuracy,▁▁▁▁▂▂▂▂▃▃▄▄▄▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇████
Test entropy_loss,████▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁
Train Accuracy,▁▁▁▂▂▂▂▃▃▃▄▄▄▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇████
Train entropy_loss,████▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁

0,1
Epoch,56.0
Test Accuracy,0.3241
Test entropy_loss,2.27962
Train Accuracy,0.32027
Train entropy_loss,2.27974


  0%|          | 0/30 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 182, in close
    self._close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 182, in close
    self._close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 366, in _close
    _close(self._handle)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 366, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
OSError: [Errno 9] Bad file descriptor
                                                                                                 

Saved... Epoch[0]


                                                                                                 

Saved... Epoch[1]


                                                                                                 

Saved... Epoch[2]


                                                                                                 

Saved... Epoch[3]


                                                                                                  

Saved... Epoch[4]


                                                                                                  

Saved... Epoch[5]


                                                                                                  

Saved... Epoch[6]


                                                                                                  

Saved... Epoch[7]


                                                                                                  

Saved... Epoch[8]


                                                                                                  

Saved... Epoch[9]


                                                                                                   

Saved... Epoch[11]


                                                                                                   

Saved... Epoch[12]


                                                                                                   

Saved... Epoch[13]


                                                                                                   

Saved... Epoch[14]


                                                                                                   

Saved... Epoch[17]


                                                                                                   

Saved... Epoch[19]


                                                                                                   

Saved... Epoch[20]


                                                                                                   

Saved... Epoch[23]


                                                                                                   

Saved... Epoch[29]


                                                                                                   

Saved... Epoch[37]


                                                                                                    

KeyboardInterrupt: 

### <font color='red'>[TODO] Проведите эксперименты с моделью </font>


### Choosing  learning_rate and batch_size

Plot accuracy on the training and testing set v. s. training epoch for different learning parameters: learning rate$ \in \{0.0001, 0.001, 0.01\}$, batch size $\in \{64, 256\}$.

The best option is to plot training curves on the left graph and validation curves on the right graph with the shared y axis (use plt.ylim).

How do learning rate and batch size affect the final quality of the model?

### Ответ:
Я не знаю, как правильно прикрепить графики экспериментов из wandb в тетрадку, но могу скинуть скришноты оттуда или как-то дать доступ к проекту.
Батч сайз стоит брать побольше, т.к. с увеличентем батча уменьшается стохастичность градиента исходной, что положительно сказывается на скорости сходимости, но тут тоже нужно следить, т.к. при размере батча в 256 сходимость была быстрее за эпоху, чем при батче в 256, но это компенсируется сравнительной скоростью, с которой проходит обучение одной эпохи, при батче 1024.

В качестве оптимизатора был выбран AdamW (он самый хайповый), т.к. многие эксперты отмечают его при оптимизации нейронных сетей, что сходится с моим интересом к нему, в качестве beta для него были выбраны стандартные $=(0.9, 0.999)$, weight_decay $= 1e-6$, learning rate $= 3e-3$, что в 10 раз больше рекомендации Андрея Карпатова, в качестве шедулера выбран StepLr с параметрами: step_size $=7$ и gamma $=0.85$.

Также в качестве изменения был убран последний слой в сетке c softmax, чтобы была вариативность использования после (но здесь я никак ей не пользовался).

Также был выбран batch_size $= 32768$, чтобы позволило убрать всю стохастичность в задаче и теперь бы берем полный градиент по всей выборке, скорость работы увеличилась в разы, т.к. pin_memory в loader позволяет не тратить время на загрузку, надо выбрать learning rate побольше при таком размере, но зато эпохи происходят мгновенно, но тут нужно брать lerning rate поменьше и тренировать без шедулера, тогда сетка показывает наилучший результат, что в целом сходится с теорией из оптимизаци. Также была применена техника рестартов сетки с претрейна...

Был попробован SGD с batch_size $=32768$, но он не дает результата ни при достаточно большом learning_rate $=2e-2$, так и при маленьком $=1e-5$.

### Changing the architecture

Try to modify our architecture: increase the number of filters and to reduce the number of fully-connected layers.

Insert numbers in the brackets:
* LeNet-5 classic (6 and 16 convolutions):  training acc: (0.9986)  validation acc: (0.9913)
* Number of convolutions x 4 (24 и 64 convolutions):  training acc: (0.9978)  validation acc: (0.9928)
* Removing fully connected layer: the previous network with 1 FC layer: training acc: (0.9973)  validation acc: (0.9929)

P.s. есть ещё training acc: (0.9999), validation acc: (0.9913)
    
    

Choose the learning rate, batch size and the architecture based on your experiments. Train a network on the full dataset and print accuracy on the full test set.

Ссылка на графики и описания экспериментов:
https://api.wandb.ai/links/kreininmv/vz4rl7eu