<a href="https://colab.research.google.com/github/lagom-QB/M11/blob/master/Practice_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keywords: modules, optimizers, dense layer


# High level concepts

## Modules

Modules helps organizing and composing functions and inputs (weights) together.

In [0]:
from torch import nn
from torch.nn import init
from torch.nn.modules import loss
import torch

Some examples:

In [0]:
linear = nn.Linear(10, 10)
linear

In [0]:
linear(torch.tensor([1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,0.0]))

In [0]:
relu = nn.ReLU()
relu


In [0]:
x = torch.tensor([-1.0])
relu(x)

In [0]:
tanh = nn.Tanh()
tanh

In [0]:
dropout = nn.Dropout(0.45, inplace=True)
dropout

In [0]:
sequential = nn.Sequential(nn.Linear(10, 100), nn.Tanh(), nn.Linear(100,100), nn.Dropout(0.4, inplace = True), nn.Linear(100,10))
sequential

In [0]:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lin1 = nn.Linear(10,100)
        self.act1 = nn.Tanh()
        self.lin2 = nn.Linear(100,100)
        self.lin3 = nn.Linear(100,100)
        self.lin4 = nn.Linear(100,10)
        
    def forward(self, x):
        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = self.act1(x)
        x = self.lin3(x)
        x = self.act1(x)
        x = self.lin4(x)
        return x
net = Net()
net


In [0]:
cross_entropy = loss.CrossEntropyLoss()
cross_entropy


In [0]:
from torch.nn import Module

In [0]:
from torch.nn import Parameter

In [0]:
class Power(Module):

    __constants__ = ['exponent']

    def __init__(self, exponent=3):
        super().__init__()
        self.exponent = exponent

    def forward(self, input):
        return torch.pow(input, self.exponent)

    def extra_repr(self):
        return f'exponent={self.exponent}'

In [0]:
Power(exponent = 4)



In [0]:
class WPower(Module):    
    def __init__(self, ):
        super().__init__()
        self.exponent = Parameter(torch.Tensor(1))
        self.reset_parameters()

    def reset_parameters(self):
        init.uniform_(self.exponent, a=math.sqrt(5))

    def forward(self, input):
        return torch.pow(input, self.exponent)


## Parameters

Some models are not just functions, but they also have internal parameters (weights/graph inputs).

In [0]:
list(linear.parameters())


In [0]:
linear.weight


In [0]:
linear.bias


In [0]:
list(tanh.parameters())


In [0]:
list(dropout.parameters())


In [0]:
dropout.p 

In [107]:
list(cross_entropy.parameters())


[]

In [0]:
list(map(lambda x: x.shape, list(sequential.parameters())))


In [0]:
list(map(lambda x: x.shape, list(net.parameters())))


In [0]:
list(map(lambda x: x.requires_grad, list(net.parameters())))


## Eval

Each module can be in either `eval` or `train` state.

In [0]:
dropout.train()


In [0]:
dropout(torch.ones(10))


In [0]:
dropout.eval()


In [0]:
newseq = nn.Sequential(nn.Dropout(), nn.Dropout())
newseq(torch.ones(10))


In [0]:
newseq.eval()
newseq(torch.ones(10))

**Important**! Train / eval mode has nothing to do with weight training. It just changes behaviour of some modules (i.e. `dropout`, `batchnorm`). For composite modules `.eval()`/`.train()` sets corresponding mode for each of its components.

## Initialization

Most of module have default way of parameter initialization, but sometimes we might want to init them explicitly.

In [0]:
linear.weight

In [0]:
init.xavier_uniform_(linear.weight)


In [0]:
init.constant_(linear.weight, 1.0)


In [0]:
list(linear.parameters())


In [0]:
for param in linear.parameters():
    init.uniform_(param, -12, 12)
list(linear.parameters())

You can find more initialization functions here: https://pytorch.org/docs/master/nn.html#torch-nn-init.

## Optimizers

Torch has a reach collection of optimizers built-in.

In [0]:
import torch.optim as optim

In [0]:
x = torch.tensor([1.0], requires_grad = True)

In [0]:
sgd = optim.SGD([x], lr=0.1)

In [0]:
y = x * 2


In [0]:
y.backward()

In [0]:
x.grad

In [0]:
sgd.step()

In [0]:
x

In [0]:
x.grad


In [0]:
sgd.zero_grad()


In [0]:
x.grad


# First Training Loop

In [0]:
from torchvision import datasets, transforms

Let's downlad MNIST --- dataset of handwritten digits.

In [0]:
train_dataset = datasets.MNIST('/data', train=True, download=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                    transforms.Normalize((0.1307,), (0.3081,))
                                ]))


In [0]:
test_dataset = datasets.MNIST('../data', train=False, download=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                    transforms.Normalize((0.1307,), (0.3081,))
                                ]))

Dataloaders are responsible for data loading. They help us to split dataset in batches and shuffles the dataset(otherwise each buch will have only variants of a single digit). We will look inside them later.

In [0]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=True)

In [0]:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
toPIL = transforms.ToPILImage()

In [0]:
def example(i):
    print(train_dataset[i][1])
    return toPIL(train_dataset[i][0]).resize((256, 256))

In [0]:
example(9)

In [0]:
example(10)


In [0]:
train_loader.__iter__().__next__()[1]


In [0]:
train_loader.__iter__().__next__()[0].shape


In [0]:
toPIL(train_loader.__iter__().__next__()[0][0]).resize((256,256))

Let's write a simple helper module.

In [0]:
class Flatten(torch.nn.Module):
    def forward(self, x):
        batch_size = x.shape[0]
        return x.view(batch_size, -1)


In [0]:
model = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
for param in model.parameters():
    init.uniform_(param, -0.1, 0.1)

Why do we need `Flatten` module?

Setup an optimizer:


In [0]:
optimizer = optim.SGD(model.parameters(), lr=0.1)

Choose a loss function:

In [0]:
loss_function = loss.CrossEntropyLoss()


And start training:

In [0]:
def train(model, train_loader, optimizer, loss_function, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_function(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 200 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

In [0]:
def test(model, test_loader, loss_function):
    model.eval()
    test_loss = 0

    y_prediction = []
    
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            output = model(data)
            test_loss += loss_function(output, target).sum().item()
            pred = output.argmax(dim=1, keepdim=True)
            y_prediction.append(output)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    # print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    #     test_loss, correct, len(test_loader.dataset),
    #     100. * correct / len(test_loader.dataset)))

    flat_list = [item for sublist in y_prediction for item in sublist]

    print(len(flat_list))
    return flat_list

In [0]:
 %%time

#  for epoch in range(1, 10):
#         train(model, train_loader, optimizer, loss_function, epoch)
#         test(model, test_loader, loss_function)

# Assignment

## Due to 10AM, 20.05.2020

## 1. MNIST playground [10]

In [0]:
# -------------------- 1. ----------------------------------- 
# For the test set; it has a constant percentage loss of 98% from the get go, 
#     Lying between Accuracy: 9821/10000 and Accuracy: 9819/10000
#     So i'll say the test accuracy is somewhat comstant

# -------------------- 2. ----------------------------------- 
# example(4321) #The 0 looks like a 6 to me
# example(600) #The nine looks like a 1 to me
# example(4542) #Is that a 2?
# example(56742) #This 9 is a joke

In [0]:
def ass_test(model, test_loader, loss_function):
    y_predictions = []

    model.eval()
    test_loss = 0
    correct = 0

    with torch.no_grad():
        for data, target in test_loader:
            
            output = model(data)
            
            test_loss += loss_function(output, target).sum().item()
            
            pred = output.argmax(dim=1, keepdim=True)
            
            y_predictions.append(pred)
            
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

    return [item for sublist in y_predictions for item in sublist]

In [96]:
#---------------------------------------------3 ----------------------------------------------
from sklearn.metrics import confusion_matrix

labels = test_dataset.targets
y_pred = ass_test(model, test_loader, loss_function)

confusion_matrix(labels, y_pred)

array([[164, 114,  10, 352,  15,  66, 149,  36,  17,  57],
       [177, 124,  17, 411,  19,  67, 193,  37,  21,  69],
       [186, 122,  13, 379,  23,  69, 144,  23,  17,  56],
       [177, 127,   7, 375,  19,  59, 145,  31,  22,  48],
       [167,  98,  10, 363,  15,  87, 123,  35,  21,  63],
       [155,  86,  12, 321,  16,  50, 143,  30,  10,  69],
       [163,  97,  14, 366,   9,  76, 144,  22,  18,  49],
       [190, 118,  10, 364,  21,  67, 151,  29,  10,  68],
       [172, 103,  10, 351,  12,  73, 155,  30,  12,  56],
       [183, 103,  14, 353,  15,  76, 167,  26,  23,  49]])

In [112]:
%%time
#  -----------------------------------------4 -----------------------------------------------
model1 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
optimizer1 = optim.SGD(model1.parameters(), lr=0.1)
for param in model1.parameters():
    init.uniform_(param, -0.1, 0.1)
for epoch in range(0, 5):
      train(model1, train_loader, optimizer1, loss_function, epoch)
      ass_test(model1, test_loader, loss_function)

# This takes about 2mins (probably bc i did just 5 iterations) and has test accuracy of 95%-97%


Test set: Average loss: 0.0024, Accuracy: 9542/10000 (95%)


Test set: Average loss: 0.0018, Accuracy: 9656/10000 (97%)


Test set: Average loss: 0.0015, Accuracy: 9721/10000 (97%)


Test set: Average loss: 0.0013, Accuracy: 9728/10000 (97%)

CPU times: user 1min 12s, sys: 260 ms, total: 1min 12s
Wall time: 1min 13s


In [110]:
%%time
model2 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
optimizer2 = optim.SGD(model2.parameters(), lr=0.1)
for param in model2.parameters():
    init.uniform_(param, -1, 1)
for epoch in range(0, 5):
      train(model2, train_loader, optimizer2, loss_function, epoch)
      ass_test(model2, test_loader, loss_function)

#  --> Gives a test accuracy of 83% - 87%.
#  --> Takes approximately 2 minutes ; almost same time as the previous one


Test set: Average loss: 0.0086, Accuracy: 8255/10000 (83%)


Test set: Average loss: 0.0068, Accuracy: 8676/10000 (87%)


Test set: Average loss: 0.0062, Accuracy: 8780/10000 (88%)


Test set: Average loss: 0.0058, Accuracy: 8851/10000 (89%)

CPU times: user 1min 13s, sys: 262 ms, total: 1min 14s
Wall time: 1min 14s


In [111]:
%%time
model3 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
optimizer3 = optim.SGD(model3.parameters(), lr=0.1)
for param in model3.parameters():
    init.uniform_(param, 0)
for epoch in range(1, 5):
      train(model3, train_loader, optimizer3, loss_function, epoch)
      ass_test(model3, test_loader, loss_function)

#  --> Poorest model with a Test Accuracy of 11%
#  --> Takes the same amount of time as the previous 2


Test set: Average loss: 0.0361, Accuracy: 1135/10000 (11%)


Test set: Average loss: 0.0361, Accuracy: 1135/10000 (11%)


Test set: Average loss: 0.0361, Accuracy: 1135/10000 (11%)


Test set: Average loss: 0.0361, Accuracy: 1135/10000 (11%)

CPU times: user 1min 12s, sys: 265 ms, total: 1min 13s
Wall time: 1min 13s


In [115]:
%%time
#  ----------------------------------------------- 5 --------------------------------------------------
model4 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Sigmoid(),
                      nn.Linear(512, 64), 
                      nn.Sigmoid(),
                      nn.Linear(64, 10))
optimizer4 = optim.SGD(model4.parameters(), lr=0.1)
for epoch in range(1, 5):
      train(model4, train_loader, optimizer4, loss_function, epoch)
      ass_test(model4, test_loader, loss_function)

#  --> This runs for about 2 minutes and has a test accuracy in the range 88% - 94% 


Test set: Average loss: 0.0065, Accuracy: 8849/10000 (88%)


Test set: Average loss: 0.0047, Accuracy: 9139/10000 (91%)


Test set: Average loss: 0.0039, Accuracy: 9274/10000 (93%)


Test set: Average loss: 0.0035, Accuracy: 9354/10000 (94%)

CPU times: user 1min 14s, sys: 288 ms, total: 1min 14s
Wall time: 1min 14s


In [122]:
%%time
# ------------------------------------------------ 6 -------------------------------------------------------------

model5 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 256), 
                      nn.Tanh(),
                      nn.Linear(256, 1024), 
                      nn.Tanh(),
                      nn.Linear(1024, 10))
optimizer5 = optim.SGD(model5.parameters(), lr=0.1)

print(len(list(model5.parameters())))

for epoch in range(0, 5):
      train(model5, train_loader, optimizer5, loss_function, epoch)
      ass_test(model5, test_loader, loss_function)

# --> This cell takes about 2minutes and has a range accuracy of 94% - 98%
# --> Here we have 6 parameters
# --> Because we are building 1024 ouputs from 256inputs, we should have more weights 
#       to enable those transformations from few inputs to so many

6

Test set: Average loss: 0.0023, Accuracy: 9551/10000 (96%)


Test set: Average loss: 0.0018, Accuracy: 9618/10000 (96%)


Test set: Average loss: 0.0014, Accuracy: 9698/10000 (97%)


Test set: Average loss: 0.0012, Accuracy: 9756/10000 (98%)


Test set: Average loss: 0.0013, Accuracy: 9748/10000 (97%)

CPU times: user 1min 39s, sys: 392 ms, total: 1min 39s
Wall time: 1min 39s


In [126]:
%%time
# ------------------------------------------------ 7 -------------------------------------------------------------
model6 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 256), 
                      nn.Tanh(),
                      nn.Linear(256, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
optimizer6 = optim.SGD(model6.parameters(), lr=0.1)
for epoch in range(0, 5):
      train(model6, train_loader, optimizer6, loss_function, epoch)
      # ass_test(model6, test_loader, loss_function)

print(len(list(model6.parameters())))

# --> Test accuracy lies bwtween 95% and 97%. 
# --> Because there is a layer more than the previous cells, it should take more time to complete and have more parameters
# --> This cell has 8 parameters

8
CPU times: user 1min 26s, sys: 347 ms, total: 1min 26s
Wall time: 1min 27s


In [133]:
%%time
model7 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5), 
                      nn.Tanh(),
                      nn.Linear(5, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
optimizer7 = optim.SGD(model7.parameters(), lr=0.1)
for epoch in range(0, 5):
      train(model7, train_loader, optimizer7, loss_function, epoch)
      ass_test(model7, test_loader, loss_function)

print(len(list(model7.parameters())))

# --> It takes a few seconds more to complete compared to the q4 and q5 because of the extra layer
# --> It should have 8 layers like in the other cell
# --> The accuracy should pumelt because we are extracting 64 features from 5 
#     features after loosing over half of our features from 512


Test set: Average loss: 0.0041, Accuracy: 9314/10000 (93%)


Test set: Average loss: 0.0071, Accuracy: 8659/10000 (87%)


Test set: Average loss: 0.0025, Accuracy: 9577/10000 (96%)


Test set: Average loss: 0.0028, Accuracy: 9516/10000 (95%)


Test set: Average loss: 0.0031, Accuracy: 9460/10000 (95%)

8
CPU times: user 1min 34s, sys: 370 ms, total: 1min 34s
Wall time: 1min 35s


In [134]:
%%time
# ------------------------------------------- 8 -----------------------------------------------------
model8 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5),
                      nn.Dropout(0.35, inplace=True),
                      nn.Linear(5, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))
print(len(list(model8.parameters())))

optimizer8 = optim.SGD(model8.parameters(), lr=0.1)
for epoch in range(0, 5):
      train(model8, train_loader, optimizer8, loss_function, epoch)
      ass_test(model8, test_loader, loss_function)

# --> Training using dropout ensures that some features aren't considered more important than others by randomly deselecting 
#     some nodes during the training process. I think a model which uses dropuout is barely biased on the training.

8

Test set: Average loss: 0.0053, Accuracy: 9033/10000 (90%)


Test set: Average loss: 0.0034, Accuracy: 9403/10000 (94%)


Test set: Average loss: 0.0054, Accuracy: 8923/10000 (89%)


Test set: Average loss: 0.0035, Accuracy: 9372/10000 (94%)


Test set: Average loss: 0.0021, Accuracy: 9655/10000 (97%)

CPU times: user 1min 33s, sys: 337 ms, total: 1min 33s
Wall time: 1min 34s


In [141]:
%%time
# ----------------------------------------------- 9 ---------------------------------------------------------------------

ass_train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=False)

model9 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5),
                      nn.ReLU(),
                      nn.Linear(5, 64), 
                      nn.Tanh(),
                      nn.Linear(64, 10))

optimizer9 = optim.SGD(model9.parameters(), lr=0.1)

for param in model9.parameters():
    init.uniform_(param, -0.1, 0.1)
    
print(len(list(model9.parameters())))

for epoch in range(0, 5):
      train(model9, ass_train_loader, optimizer9, loss_function, epoch)
      ass_test(model9, test_loader, loss_function)

# --> When training, the loss reduces faster than in the shuffled model because the data is probably correlated

8

Test set: Average loss: 0.0034, Accuracy: 9364/10000 (94%)


Test set: Average loss: 0.0024, Accuracy: 9536/10000 (95%)


Test set: Average loss: 0.0021, Accuracy: 9611/10000 (96%)


Test set: Average loss: 0.0019, Accuracy: 9666/10000 (97%)


Test set: Average loss: 0.0019, Accuracy: 9682/10000 (97%)

CPU times: user 1min 31s, sys: 353 ms, total: 1min 32s
Wall time: 1min 32s


In [147]:
%%time
# ---------------------------------------------- 10 ----------------------------------------------------
#  50

b_size = len(train_dataset)//2
ass_2_train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=b_size, shuffle=False)

model10 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5),
                      nn.ReLU(),
                      nn.Linear(5, 10))

optimizer10 = optim.SGD(model10.parameters(), lr=0.1)

for param in model10.parameters():
    init.uniform_(param, -0.1, 0.1)

print(len(list(model10.parameters())))

for epoch in range(0, 5):
      train(model10, ass_2_train_loader, optimizer10, loss_function, epoch)
      ass_test(model10, test_loader, loss_function)

# --> The test accuracy is barely significant
# --> The model doesn't have enough data to train on hence the test accuracy
# --> This model barely learns anything

6

Test set: Average loss: 0.0360, Accuracy: 1076/10000 (11%)


Test set: Average loss: 0.0357, Accuracy: 1140/10000 (11%)


Test set: Average loss: 0.0353, Accuracy: 1460/10000 (15%)


Test set: Average loss: 0.0349, Accuracy: 2097/10000 (21%)


Test set: Average loss: 0.0342, Accuracy: 2617/10000 (26%)

CPU times: user 1min 25s, sys: 236 ms, total: 1min 25s
Wall time: 1min 25s


In [148]:
%%time
# 30%

b_size = int(round((3 * len(train_dataset))//10))

ass_3_train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=b_size, shuffle=False)

model11 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5),
                      nn.ReLU(),
                      nn.Linear(5, 64))

optimizer11 = optim.SGD(model11.parameters(), lr=0.1)

for param in model11.parameters():
    init.uniform_(param, -0.1, 0.1)

print(len(list(model11.parameters())))

for epoch in range(0, 5):
      train(model11, ass_3_train_loader, optimizer11, loss_function, epoch)
      ass_test(model11, test_loader, loss_function)

# --> The test accuracy is barely significant
# --> The model doesn't have enough data to train on hence the test accuracy
# --> This model barely learns anything

6

Test set: Average loss: 0.0615, Accuracy: 1385/10000 (14%)


Test set: Average loss: 0.0462, Accuracy: 1268/10000 (13%)


Test set: Average loss: 0.0338, Accuracy: 3198/10000 (32%)


Test set: Average loss: 0.0294, Accuracy: 4029/10000 (40%)


Test set: Average loss: 0.0263, Accuracy: 4493/10000 (45%)

CPU times: user 1min 25s, sys: 174 ms, total: 1min 25s
Wall time: 1min 26s


In [0]:
%%time
# 10%

b_size = int(round((1 * len(train_dataset))//10))

ass_4_train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=b_size, shuffle=False)

model12 = nn.Sequential(Flatten(), 
                      nn.Linear(784, 512), 
                      nn.Tanh(),
                      nn.Linear(512, 5),
                      nn.ReLU(),
                      nn.Linear(5, 64))

optimizer12 = optim.SGD(model12.parameters(), lr=0.1)

for param in model12.parameters():
    init.uniform_(param, -0.1, 0.1)

print(len(list(model12.parameters())))

for epoch in range(0, 5):
      train(model12, ass_4_train_loader, optimizer12, loss_function, epoch)
      ass_test(model12, test_loader, loss_function)

# --> The test accuracy is barely significant
# --> The model doesn't have enough data to train on hence the test accuracy
# --> This model is a joke 

6

Test set: Average loss: 0.0501, Accuracy: 2666/10000 (27%)



**Important!** This task is not too hard, but it is pretty time-consuming. Total computation time is about 4 hours.

1. Find out how many epochs are needed for our network to stop improving on test dataset (let's stop on 5 epochs without accuracy improvement on the test set). How long does it take? [1]
2. Find some problematic examples and show them with `example()` function we defined in class.[1]
3. Draw a confusion matrix for your model on test dataset. It is a 10x10 matrix, and in the cell `(i,j)` there is a number of digits `i` classified as digit `j`.[1]
4. By default weight of linear layer is initialized with `kaiming_uniform` function and bias is unitialized with `uniform` function (see reset parameters method of Linear class https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/linear.py). Initialize all weights as `uniform(-0.1,0.1)` and test. How does this modification affect training process? Is it faster/slower? Is the end result better/worse? Same question form `uniform(-1, 1)`. Same question for `constant(0)` initialization. Don't forget to recreate optimizer for your new model (otherwise you'll optimize parameters of the old model using values from the new one, which does not work).[1]
5. Try replacing `Tanh` activation by `Sigmoid` test, how does this modification affect training process? These and further questions assumes that you are changing the initial model (i.e. all modification from previous step are undone). [1]
6. Try changing output dimension of the first linear layer  (and input of the second) to `256`, to `1024`. How does this modification affect training process? How does the number of model parameters changes? [1]
7. Our model has 2 hidden layers of sizes `512` and `64`. Let's use 3 hidden layers of sizes `512`, `256` and `64`.  How does this modification affect training process? How does the number of model parameters changes? Same question for 3 layers of sizes `512`, `5` and `64`(don't forget to add activation function between linear layers). [1]
8. Try adding dropout after first/second layer. How does this modification affect training process? [1]
9. Try disabling shuffle in the train dataloader (leave it unchanged in the test dataloader, otherwise testing will not be fair). How does this modification affect training process? Do not forget to reset training weights of the model. [1]
10. Try training, using half of the training dataset. 30%. 10%. How does this affect training process? Do not forget to reset training weights of the model. [1] 

