<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego 
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej" 
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

Code based on https://github.com/pytorch/examples/blob/master/mnist/main.py

This exercise covers two aspects:
* In tasks 1-6 you will implement mechanisms that allow training deeper models (better initialization, batch normalization). Note that for dropout and batch norm you are expected to implement it yourself without relying on ready-made components from Pytorch.
* In task 7 you will implement a convnet using [conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html).


Tasks:
1. Check that the given implementation reaches 95% test accuracy for
   architecture input-64-64-10 in a few thousand batches.
2. Improve initialization and check that the network learns much faster
   and reaches over 97% test accuracy. A good basic initialization scheme is so-called Glorot initialization. For a set of weights going from a layer with $n_{in}$ neurons to a layer with $n_{out}$ neurons, it samples each weight from normal distribution with $0$ mean and standard deviation of $\sqrt{\frac{2}{n_{in}+n_{out}}}$.
3. Check, that with proper initialization we can train architecture
   input-64-64-64-64-64-10, while with bad initialization it does
   not even get off the ground.
4. Add dropout implemented in pytorch
5. Check that with 10 hidden layers (64 units each) even with proper
    initialization the network has a hard time to start learning.
6. Implement batch normalization (use train mode also for testing - it should perform well enough):
    * compute batch mean and variance
    * add new variables beta and gamma
    * check that the networks learns much faster for 5 layers
    * check that the network learns even for 10 hidden layers.
7. So far we worked with a fully connected network. Design and implement in pytorch (by using pytorch functions) a simple convolutional network and achieve 99% test accuracy. The architecture is up to you, but even a few convolutional layers should be enough.

In [183]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.parameter import Parameter
from torch.nn import init
import torchvision
import torchvision.transforms as transforms
import ipykernel
import sys
print(ipykernel.__version__)
print(sys.version)

6.19.2
3.10.8 (main, Nov  1 2022, 17:01:49) [GCC 12.2.0]


In [184]:
class Linear(torch.nn.Module):
    def __init__(self, in_features, out_features):
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))
        self.bias = Parameter(torch.Tensor(out_features))
        self.reset_parameters()

    def reset_parameters(self):
        self.weight.data.normal_(mean=0,std=0.25)
        init.zeros_(self.bias)

    def forward(self, x):
        r = x.matmul(self.weight.t())
        r += self.bias
        return r


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = Linear(784, 64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


In [185]:
class MnistTrainer(object):
    def __init__(self, batch_size, net, epochs=1):
        transform = transforms.Compose(
                [transforms.ToTensor()])
        self.trainset = torchvision.datasets.MNIST(
            root='./data',
            download=True,
            train=True,
            transform=transform)
        self.trainloader = torch.utils.data.DataLoader(
            self.trainset, batch_size=batch_size, shuffle=True, num_workers=2)

        self.testset = torchvision.datasets.MNIST(
            root='./data',
            train=False,
            download=True, transform=transform)
        self.testloader = torch.utils.data.DataLoader(
            self.testset, batch_size=1, shuffle=False, num_workers=2)
        self.net = net
        self.epochs = epochs

    def train(self):
        net = self.net

        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(net.parameters(), lr=0.05, momentum=0.9)

        for epoch in range(self.epochs):
            running_loss = 0.0
            net.train()
            for i, data in enumerate(self.trainloader, 0):
                inputs, labels = data
                optimizer.zero_grad()
                
                outputs = net(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                running_loss += loss.item()
                if i % 100 == 99:
                    print('[%d, %5d] loss: %.3f' %
                          (epoch + 1, i + 1, running_loss / 100))
                    running_loss = 0.0
            correct = 0
            total = 0
            net.eval()
            with torch.no_grad():
                for data in self.testloader:
                    images, labels = data
                    outputs = net(images)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()

            print('Accuracy of the network on the {} test images: {} %'.format(
                total, 100 * correct / total))

In [186]:
from math import sqrt
import numpy as np

class LinearGlorot(Linear):
    def reset_parameters(self):
        std = sqrt(2.0 / (self.in_features + self.out_features))
        self.weight.data.normal_(mean=0,std=std)
        init.zeros_(self.bias)

class MyNet(nn.Module):
    def __init__(self, sizes, linearClass, p=0.0):
        super(MyNet, self).__init__()
        # After flattening an image of size 28x28 we have 784 inputs
        self.p = p
        self.fcs = nn.ModuleList([linearClass(a, b) for a, b in zip([784]+sizes, sizes+[10])])

    def dropout(self, x):
        if not self.training or self.p == 0.0:
            return x
        dropout = torch.from_numpy(np.random.choice([0, 1.0/(1-self.p)], size=x.shape, p=[self.p, 1-self.p]))
        return torch.mul(x, dropout.float())

    def forward(self, x):
        x = torch.flatten(x, 1)
        for fc in self.fcs[:-1]:
            x = fc(x)
            x = self.dropout(x)
            x = F.relu(x)
        x = self.fcs[-1](x)
        return x


In [187]:
def better_initialization():
    print("Bad initialization, sizes=[784,64,64,10]")
    MnistTrainer(batch_size=128, net=MyNet([64,64], Linear)).train()
    print("Glorot initialization, sizes=[784,64,64,10]")
    MnistTrainer(batch_size=128, net=MyNet([64,64], LinearGlorot)).train()
    print("Bad initialization, sizes=[784,64,64,64,64,64,10]")
    MnistTrainer(batch_size=128, net=MyNet([64,64,64,64,64], Linear)).train()
    print("Glorot initialization, sizes=[784,64,64,64,64,64,10]")
    MnistTrainer(batch_size=128, net=MyNet([64,64,64,64,64], LinearGlorot)).train()

In [188]:
def longer_networks_should_not_learn():
    print("Checking whether network with 10 hidden layers will have trouble starting learning")
    MnistTrainer(batch_size=128, net=MyNet([64]*10, LinearGlorot)).train()
    print("It does not seem to have problems, lets check 100 layers")
    MnistTrainer(batch_size=128, net=MyNet([64]*20, LinearGlorot)).train()
    print("Now it does not get off the ground")

In [189]:
class LinearBN(torch.nn.Module):
    def __init__(self, in_features, out_features):
        super(LinearBN, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))
        self.bias = Parameter(torch.Tensor(out_features))
        self.a = Parameter(torch.Tensor(1))
        self.b = Parameter(torch.Tensor(1))
        self.reset_parameters()

    def reset_parameters(self):
        std = sqrt(2.0 / (self.in_features + self.out_features))
        self.weight.data.normal_(mean=0,std=std)
        init.zeros_(self.bias)
        self.a.data[0] = 1
        self.b.data[0] = 0

    def forward(self, x):
        r = x.matmul(self.weight.t())
        r += self.bias
        std = torch.std(r, dim=0)
        r = (r - torch.mean(r, dim=0)) / torch.maximum(std, torch.full_like(std, 1e-12))
        return self.a * r + self.b

In [190]:
def compare_batch_normalization(l):
    print("length: ", l)
    print("With batch normalization")
    MnistTrainer(batch_size=128, net=MyNet([64]*l, LinearBN)).train()
    print("Without batch normalization")
    MnistTrainer(batch_size=128, net=MyNet([64]*l, LinearGlorot)).train()
    #TODO validation of batch normalized network, normalisation layers should use running statistics

In [191]:
class MyCNN(nn.Module):
    def __init__(self):
        super(MyCNN, self).__init__()
        self.seq = nn.Sequential(      # 1x28x28
            nn.Conv2d(1,4,3),          # 4x26x26
            nn.BatchNorm2d(4),
            nn.ReLU(),

            nn.AvgPool2d(2,2),         # 4x13x13
            nn.ReLU(),
            
            nn.Conv2d(4,8,3),          # 8x11x11
            nn.BatchNorm2d(8),
            nn.ReLU(),

            nn.MaxPool2d(2,1),         # 8x10x10
            nn.ReLU(),

            nn.Conv2d(8,16,3),         # 16x8x8
            nn.BatchNorm2d(16),
            nn.ReLU(),

            nn.MaxPool2d(2,2),         # 16x4x4
            nn.ReLU(),

            nn.Flatten(),              # 16*4*4

            nn.Linear(16*4*4, 128),    # 128
            nn.BatchNorm1d(128),
            nn.ReLU(),

            nn.Linear(128, 64),        # 64
            nn.BatchNorm1d(64),
            nn.ReLU(),
            
            nn.Linear(64, 32),         # 32
            nn.BatchNorm1d(32),
            nn.ReLU(),

            nn.Linear(32, 10),         # 10
        )

    def forward(self, x):
        return self.seq(x)

In [192]:
def train_conv():
    MnistTrainer(batch_size=128, net=MyCNN(), epochs=3).train()

In [193]:
#better_initialization()
#longer_networks_should_not_learn()
#compare_batch_normalization(10)
#compare_batch_normalization(20)
#compare_batch_normalization(100) #TODO why here result is nan
train_conv()

[1,   100] loss: 0.379
[1,   200] loss: 0.095
[1,   300] loss: 0.077
[1,   400] loss: 0.063
Accuracy of the network on the 10000 test images: 98.48 %
[2,   100] loss: 0.047
[2,   200] loss: 0.047
[2,   300] loss: 0.048
[2,   400] loss: 0.042
Accuracy of the network on the 10000 test images: 99.04 %
[3,   100] loss: 0.032
[3,   200] loss: 0.031
[3,   300] loss: 0.035
[3,   400] loss: 0.036
Accuracy of the network on the 10000 test images: 98.71 %
