<a href="https://colab.research.google.com/github/lemacdonald/LinearPerceptronSGDExercise/blob/main/LinearPerceptronWithSGD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Solving a linear perceptron with SGD**

Previously, we solved the linear perceptron on MNIST with the square loss using QR decomposition.  In this exercise, we will train a linear perceptron using SGD with the square loss, the L1 loss, and the cross-entropy loss.

First, import the necessary packages and set up the datasets and loaders.  We will be using SGD with a batch size of 32

In [7]:
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
import torch.optim as optim

device = 'cuda:0'

transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (1.0,))])

# training dataset and training loader
trainset = datasets.MNIST(root='../data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=2)

# testing dataset and testing loader
testset = datasets.MNIST(root='../data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=32,
                                         shuffle=False, num_workers=2)


Now, we set up our linear perceptron neural net, our optimiser, and the three cost functions we will use.

In [24]:
class LinearPerceptron(nn.Module):
  def __init__(self):
    super(LinearPerceptron, self).__init__()
    self.fc = nn.Linear(784, 10)
  
  def forward(self, x):
    # flatten the batch of image tensors into a batch of vectors
    x = x.flatten(start_dim = 1)
    # compute the forward pass
    return self.fc(x)

net = LinearPerceptron().to(device)

# save the initialisation parameters, which we will reuse for each training run
path = 'init.pt'
torch.save(net.state_dict(), path)

# our optimiser
optimiser = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# the cost functions we will be using
l2 = nn.MSELoss()
l1 = nn.L1Loss()
ce = nn.CrossEntropyLoss()

Finally, let's train our network.  The code is essetially the same as we used for the MLP.  First, with the L2 (mean square) loss.

In [27]:
# load the initialisation parameters into the model
net.load_state_dict(torch.load(path))

for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0
    
    # Simply for time keeping
    start_time = time.time()
    # Loop over all training data
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        # turn the labels into a one-hot vector for comparison with the model outputs
        labels = F.one_hot(labels, num_classes=10).float()

        # zero the parameter gradients
        optimiser.zero_grad()

        outputs = net(inputs.to(device))
        loss = l2(outputs, labels.to(device))

        # Compute Gradients
        loss.backward()
        # BackProp
        optimiser.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0
        # endif
    # end for over minibatches epoch finishes
    end_time = time.time()

    # test the network every epoch on test example
    correct = 0
    total = 0

    # Test after the epoch finishes (no gradient computation needed)
    with torch.no_grad():
        VIS = True
        for data in testloader:
            # load images and labels
            images, labels = data

            outputs = net(images.to(device))
            # note here we take the max of all probability
            predicted = torch.argmax(outputs.cpu(), 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

      #end for
    #end with
    print('Epoch', epoch+1, 'took', end_time-start_time, 'seconds')
    print('Accuracy of the network with L2 loss after', epoch+1, 'epochs is' , 100*correct/total)

print('Finished Training')

[1,   100] loss: 0.098
[1,   200] loss: 0.071
[1,   300] loss: 0.062
[1,   400] loss: 0.058
[1,   500] loss: 0.055
[1,   600] loss: 0.053
[1,   700] loss: 0.052
[1,   800] loss: 0.051
[1,   900] loss: 0.050
[1,  1000] loss: 0.050
[1,  1100] loss: 0.049
[1,  1200] loss: 0.047
[1,  1300] loss: 0.047
[1,  1400] loss: 0.047
[1,  1500] loss: 0.047
[1,  1600] loss: 0.047
[1,  1700] loss: 0.045
[1,  1800] loss: 0.046
Epoch 1 took 15.733246564865112 seconds
Accuracy of the network with L2 loss after 1 epochs is 84.08
[2,   100] loss: 0.045
[2,   200] loss: 0.045
[2,   300] loss: 0.045
[2,   400] loss: 0.045
[2,   500] loss: 0.045
[2,   600] loss: 0.044
[2,   700] loss: 0.045
[2,   800] loss: 0.044
[2,   900] loss: 0.044
[2,  1000] loss: 0.044
[2,  1100] loss: 0.044
[2,  1200] loss: 0.045
[2,  1300] loss: 0.044
[2,  1400] loss: 0.044
[2,  1500] loss: 0.043
[2,  1600] loss: 0.044
[2,  1700] loss: 0.045
[2,  1800] loss: 0.043
Epoch 2 took 15.56887674331665 seconds
Accuracy of the network with L2 

Now the L1 loss.

In [25]:
# load the initialisation parameters into the model
net.load_state_dict(torch.load(path))

for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0
    
    # Simply for time keeping
    start_time = time.time()
    # Loop over all training data
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        # turn the labels into a one-hot vector for comparison with model outputs
        labels = F.one_hot(labels, num_classes=10).float()

        # zero the parameter gradients
        optimiser.zero_grad()

        outputs = net(inputs.to(device))
        loss = l1(outputs, labels.to(device))

        # Compute Gradients
        loss.backward()
        # BackProp
        optimiser.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0
        # endif
    # end for over minibatches epoch finishes
    end_time = time.time()

    # test the network every epoch on test example
    correct = 0
    total = 0

    # Test after the epoch finishes (no gradient computation needed)
    with torch.no_grad():
        VIS = True
        for data in testloader:
            # load images and labels
            images, labels = data

            outputs = net(images.to(device))
            # note here we take the max of all probability
            predicted = torch.argmax(outputs.cpu(), 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

      #end for
    #end with
    print('Epoch', epoch+1, 'took', end_time-start_time, 'seconds')
    print('Accuracy of the network with L1 loss after', epoch+1, 'epochs is' , 100*correct/total)

print('Finished Training')

[1,   100] loss: 0.188
[1,   200] loss: 0.150
[1,   300] loss: 0.141
[1,   400] loss: 0.136
[1,   500] loss: 0.132
[1,   600] loss: 0.130
[1,   700] loss: 0.128
[1,   800] loss: 0.126
[1,   900] loss: 0.123
[1,  1000] loss: 0.122
[1,  1100] loss: 0.121
[1,  1200] loss: 0.120
[1,  1300] loss: 0.120
[1,  1400] loss: 0.120
[1,  1500] loss: 0.118
[1,  1600] loss: 0.117
[1,  1700] loss: 0.116
[1,  1800] loss: 0.116
Epoch 1 took 15.605018854141235 seconds
Accuracy of the network with L1 loss after 1 epochs is 64.62
[2,   100] loss: 0.114
[2,   200] loss: 0.115
[2,   300] loss: 0.114
[2,   400] loss: 0.114
[2,   500] loss: 0.113
[2,   600] loss: 0.112
[2,   700] loss: 0.113
[2,   800] loss: 0.112
[2,   900] loss: 0.112
[2,  1000] loss: 0.112
[2,  1100] loss: 0.111
[2,  1200] loss: 0.111
[2,  1300] loss: 0.110
[2,  1400] loss: 0.110
[2,  1500] loss: 0.111
[2,  1600] loss: 0.110
[2,  1700] loss: 0.109
[2,  1800] loss: 0.109
Epoch 2 took 15.598732948303223 seconds
Accuracy of the network with L1

And finally the cross-entropy loss.

In [26]:
# load the initialisation parameters into the model
net.load_state_dict(torch.load(path))

for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0
    
    # Simply for time keeping
    start_time = time.time()
    # Loop over all training data
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimiser.zero_grad()

        outputs = net(inputs.to(device))
        loss = ce(outputs, labels.to(device))

        # Compute Gradients
        loss.backward()
        # BackProp
        optimiser.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0
        # endif
    # end for over minibatches epoch finishes
    end_time = time.time()

    # test the network every epoch on test example
    correct = 0
    total = 0

    # Test after the epoch finishes (no gradient computation needed)
    with torch.no_grad():
        VIS = True
        for data in testloader:
            # load images and labels
            images, labels = data

            outputs = net(images.to(device))
            # note here we take the max of all probability
            predicted = torch.argmax(outputs.cpu(), 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

      #end for
    #end with
    print('Epoch', epoch+1, 'took', end_time-start_time, 'seconds')
    print('Accuracy of the network with cross-entropy loss after', epoch+1, 'epochs is' , 100*correct/total)

print('Finished Training')

[1,   100] loss: 1.937
[1,   200] loss: 1.358
[1,   300] loss: 1.073
[1,   400] loss: 0.920
[1,   500] loss: 0.827
[1,   600] loss: 0.767
[1,   700] loss: 0.698
[1,   800] loss: 0.659
[1,   900] loss: 0.637
[1,  1000] loss: 0.600
[1,  1100] loss: 0.625
[1,  1200] loss: 0.573
[1,  1300] loss: 0.551
[1,  1400] loss: 0.521
[1,  1500] loss: 0.516
[1,  1600] loss: 0.501
[1,  1700] loss: 0.507
[1,  1800] loss: 0.520
Epoch 1 took 15.783464193344116 seconds
Accuracy of the network with cross-entropy loss after 1 epochs is 88.29
[2,   100] loss: 0.484
[2,   200] loss: 0.478
[2,   300] loss: 0.497
[2,   400] loss: 0.470
[2,   500] loss: 0.462
[2,   600] loss: 0.459
[2,   700] loss: 0.423
[2,   800] loss: 0.422
[2,   900] loss: 0.467
[2,  1000] loss: 0.441
[2,  1100] loss: 0.458
[2,  1200] loss: 0.426
[2,  1300] loss: 0.440
[2,  1400] loss: 0.425
[2,  1500] loss: 0.420
[2,  1600] loss: 0.413
[2,  1700] loss: 0.419
[2,  1800] loss: 0.429
Epoch 2 took 15.69955325126648 seconds
Accuracy of the netwo