# Neural Networks for Fashion MNIST in PyTorch
We will extend our previous MLP from scratch example by re-implementing the same content in PyTorch. This may seem like a tour-de-force, but will show just exactly how much of the complicated underlying implementation is abstracted away from the user in modern Deep Learning frameworks.

### Dataset class extended to use directly in PyTorch
We can basically take our given dataset loader and use it almost as is.
There is one modification that we absolutely have to make which is converting the numpy arrays to torch tensors.
The function "torch.from_numpy()" can be used for this purpose. 

Two additional features we can add is the use of PyTorch dataset and dataloader structures that are very convenient to use and highly efficient. 
These are called "torch.utils.data.TensorDataset" and "torch.utils.data.DataLoader" and allow for the use of a multi-threaded dataset loader.  

In [12]:
import torch
import torch.utils.data
import torchvision.datasets as datasets
import os
import struct
import gzip
import errno
import numpy as np

class FashionMNIST:
    """
    Fashion MNIST dataset featuring gray-scale 28x28 images of
    fashion items belonging to ten different classes.
    Dataloader adapted from MNIST.
    We do not define __getitem__ and __len__ in this class
    as we are using torch.utils.data.TensorDataSet which
    already implements these methods.

    Parameters:
        args (dict): Dictionary of (command line) arguments.
            Needs to contain batch_size (int) and workers(int).
        is_gpu (bool): True if CUDA is enabled.
            Sets value of pin_memory in DataLoader.

    Attributes:
        trainset (torch.utils.data.TensorDataset): Training set wrapper.
        valset (torch.utils.data.TensorDataset): Validation set wrapper.
        train_loader (torch.utils.data.DataLoader): Training set loader with shuffling.
        val_loader (torch.utils.data.DataLoader): Validation set loader.
    """

    def __init__(self, is_gpu, batch_size, workers):
        self.path = os.path.expanduser('datasets/FashionMNIST')
        self.__download()

        self.trainset, self.valset = self.get_dataset()

        self.train_loader, self.val_loader = self.get_dataset_loader(batch_size, workers, is_gpu)

        self.val_loader.dataset.class_to_idx = {'T-shirt/top': 0,
                                                'Trouser': 1,
                                                'Pullover': 2,
                                                'Dress': 3,
                                                'Coat': 4,
                                                'Sandal': 5,
                                                'Shirt': 6,
                                                'Sneaker': 7,
                                                'Bag': 8,
                                                'Ankle boot': 9}

    def __check_exists(self):
        """
        Checks if dataset has already been downloaded

        Returns:
             bool: True if downloaded dataset has been found
        """

        return os.path.exists(os.path.join(self.path, 'train-images-idx3-ubyte.gz')) and \
               os.path.exists(os.path.join(self.path, 'train-labels-idx1-ubyte.gz')) and \
               os.path.exists(os.path.join(self.path, 't10k-images-idx3-ubyte.gz')) and \
               os.path.exists(os.path.join(self.path, 't10k-labels-idx1-ubyte.gz'))

    def __download(self):
        """
        Downloads the Fashion-MNIST dataset from the web if dataset
        hasn't already been downloaded.
        """

        from six.moves import urllib

        if self.__check_exists():
            return

        print("Downloading FashionMNIST dataset")
        urls = [
            'https://cdn.rawgit.com/zalandoresearch/fashion-mnist/ed8e4f3b/data/fashion/train-images-idx3-ubyte.gz',
            'https://cdn.rawgit.com/zalandoresearch/fashion-mnist/ed8e4f3b/data/fashion/train-labels-idx1-ubyte.gz',
            'https://cdn.rawgit.com/zalandoresearch/fashion-mnist/ed8e4f3b/data/fashion/t10k-images-idx3-ubyte.gz',
            'https://cdn.rawgit.com/zalandoresearch/fashion-mnist/ed8e4f3b/data/fashion/t10k-labels-idx1-ubyte.gz',
        ]

        # download files
        try:
            os.makedirs(self.path)
        except OSError as e:
            if e.errno == errno.EEXIST:
                pass
            else:
                raise

        for url in urls:
            print('Downloading ' + url)
            data = urllib.request.urlopen(url)
            filename = url.rpartition('/')[2]
            file_path = os.path.join(self.path, filename)
            with open(file_path, 'wb') as f:
                f.write(data.read())

        print('Done!')

    def __get_fashion_mnist(self, path, kind='train'):
        """
        Load Fashion-MNIST data

        Parameters:
            path (str): Base directory path containing .gz files for
                the Fashion-MNIST dataset
            kind (str): Accepted types are 'train' and 't10k' for
                training and validation set stored in .gz files

        Returns:
            numpy.array: images, labels
        """

        labels_path = os.path.join(path,
                                   '%s-labels-idx1-ubyte.gz'
                                   % kind)
        images_path = os.path.join(path,
                                   '%s-images-idx3-ubyte.gz'
                                   % kind)

        with gzip.open(labels_path, 'rb') as lbpath:
            struct.unpack('>II', lbpath.read(8))
            labels = np.frombuffer(lbpath.read(), dtype=np.uint8)

        with gzip.open(images_path, 'rb') as imgpath:
            struct.unpack(">IIII", imgpath.read(16))
            images = np.frombuffer(imgpath.read(), dtype=np.uint8).reshape(len(labels), 784)

        return images, labels

    def get_dataset(self):
        """
        Loads and wraps training and validation datasets

        Returns:
             torch.utils.data.TensorDataset: trainset, valset
        """

        x_train, y_train = self.__get_fashion_mnist(self.path, kind='train')
        x_val, y_val = self.__get_fashion_mnist(self.path, kind='t10k')

        # convert to torch tensors in range [0, 1]
        x_train = torch.from_numpy(x_train).float() / 255
        y_train = torch.from_numpy(y_train).int()
        x_val = torch.from_numpy(x_val).float() / 255
        y_val = torch.from_numpy(y_val).int()

        # resize flattened array of images for input to a CNN
        x_train.resize_(x_train.size(0), 1, 28, 28)
        x_val.resize_(x_val.size(0), 1, 28, 28)

        # TensorDataset wrapper
        trainset = torch.utils.data.TensorDataset(x_train, y_train)
        valset = torch.utils.data.TensorDataset(x_val, y_val)

        return trainset, valset

    def get_dataset_loader(self, batch_size, workers, is_gpu):
        """
        Defines the dataset loader for wrapped dataset

        Parameters:
            batch_size (int): Defines the batch size in data loader
            workers (int): Number of parallel threads to be used by data loader
            is_gpu (bool): True if CUDA is enabled so pin_memory is set to True

        Returns:
             torch.utils.data.TensorDataset: trainset, valset
        """

        train_loader = torch.utils.data.DataLoader(self.trainset, batch_size=batch_size, shuffle=True,
                                                   num_workers=workers, pin_memory=is_gpu, sampler=None)
        test_loader = torch.utils.data.DataLoader(self.valset, batch_size=batch_size, shuffle=True,
                                                  num_workers=workers, pin_memory=is_gpu, sampler=None)

        return train_loader, test_loader


In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
is_gpu = torch.cuda.is_available()
batch_size = 128
workers = 4
    
dataset = FashionMNIST(torch.cuda.is_available(), batch_size, workers)

### The MLP model in PyTorch
We now show how to implement a 2 hidden layer MLP in PyTorch. 
Suitable hidden-layer sizes for this task could be 300 and 100. 
Depending on the optimization criterion you use, you may want to add something like a Softmax function to your network. 

In [14]:
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, img_size, num_classes):
        super(MLP, self).__init__()
        
        self.img_size = img_size
        
        self.fc1 = nn.Linear(img_size, 300)
        self.act1 = nn.ReLU()

        self.fc2 = nn.Linear(300, 100)
        self.act2 = nn.ReLU()
        
        self.fc3 = nn.Linear(100, num_classes)

    def forward(self, x):
        # The view flattens the data to a vector (the representation needed by the MLP)
        x = x.view(-1, self.img_size)
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        x = self.fc3(x)
        return x

### Defining optimization criterion and optimizer
A good baseline is a Cross Entropy loss (Log Softmax + negative log-likelihood) and a stochastic gradient descent (SGD) algorithm with a baseline learning rate of 0.01. If we want to we can use additional momenta or regularization terms (such as L2 - Tikhonov regularization commonly reffered to as weight-decay in ML). 

In [15]:
# Define optimizer and loss function (criterion)
img_size = 28*28
num_classes = 10

model = MLP(img_size, num_classes).to(device)

criterion = nn.CrossEntropyLoss().to(device)

# we can use advanced stochastic gradient descent algorithms 
# with regularization (weight-decay) or momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01,
                            momentum=0.9,
                            weight_decay=5e-4)

### Monitoring and calculating accuracy
We add a convenience class to keep track and average concepts such as processing or data loading speeds, losses and accuracies. For this we need to define a function to define accuracy, which could be based on the absolute accuracy, or top-1 accuracy. Often times in Machine Learning other metrics are employed. For example, in the ImageNet ILSVRC challenge with a classification problem containing a 1000 classes, it is common to report the top-5 accuracy. Here a prediction is counted as accurate if the correct class lies within the top-5 most likely output classes. 

In [16]:
class AverageMeter(object):
    """
    Computes and stores the average and current value
    """
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def accuracy(output, target, topk=(1,)):
    """
    Evaluates a model's top k accuracy

    Parameters:
        output (torch.autograd.Variable): model output
        target (torch.autograd.Variable): ground-truths/labels
        topk (list): list of integers specifying top-k precisions
            to be computed

    Returns:
        float: percentage of correct predictions
    """

    maxk = max(topk)
    batch_size = target.size(0)

    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))

    res = []
    for k in topk:
        correct_k = correct[:k].view(-1).float().sum(0)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res

### Training function (sometimes referred to as "hook")
The training function needs to loop through the entire dataset in steps of mini-batches (for SGD). For each mini-batch the output of the model and losses are calculated and a "backward" pass is done in order to do an update to the model's weights. When the entire dataset has been processed once, one epoch of the training has been conducted. It is common to shuffle the dataset after each epoch. In this implementation this is handled by the "sampler" of the dataset loader. 

In [17]:
def train(train_loader, model, criterion, optimizer, device):
    """
    Trains/updates the model for one epoch on the training dataset.

    Parameters:
        train_loader (torch.utils.data.DataLoader): The trainset dataloader
        model (torch.nn.module): Model to be trained
        criterion (torch.nn.criterion): Loss function
        optimizer (torch.optim.optimizer): optimizer instance like SGD or Adam
        device (string): cuda or cpu
    """

    losses = AverageMeter()
    top1 = AverageMeter()

    # switch to train mode
    model.train()

    for i, (inp, target) in enumerate(train_loader):
        inp = inp.to(device)
        target = target.to(device).long() # is expected to be int64

        # compute output
        output = model(inp)
        loss = criterion(output, target)

        # measure accuracy and record loss
        prec1, _ = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), inp.size(0))
        top1.update(prec1.item(), inp.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        if i % 100 == 0:
            print('Loss {loss.val:.4f} ({loss.avg:.4f})\t'
                  'Prec@1 {top1.val:.3f} ({top1.avg:.3f})'.format(
                   loss=losses, top1=top1))

### Validation function
Validation is similar to the training loop, but on a separate dataset with the exception that no update to the weights is performed. This way we can monitor the generalization ability of our model and check whether it is overfitting (memorizing) the training dataset.  

In [18]:
from torchnet import meter

def validate(val_loader, model, criterion, device):
    """
    Evaluates/validates the model

    Parameters:
        val_loader (torch.utils.data.DataLoader): The validation or testset dataloader
        model (torch.nn.module): Model to be evaluated/validated
        criterion (torch.nn.criterion): Loss function
        device (string): cuda or cpu
    """

    losses = AverageMeter()
    top1 = AverageMeter()

    confusion = meter.ConfusionMeter(len(val_loader.dataset.class_to_idx))

    # switch to evaluate mode
    model.eval()

    for i, (inp, target) in enumerate(val_loader):
        inp = inp.to(device)
        target = target.to(device).long() # is expected to be int64

        # compute output
        output = model(inp)

        # compute loss
        loss = criterion(output, target)

        # measure accuracy and record loss
        prec1, _ = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), inp.size(0))
        top1.update(prec1.item(), inp.size(0))

        # add to confusion matrix
        confusion.add(output.data, target)

    print(' * Validation accuracy: Prec@1 {top1.avg:.3f} '.format(top1=top1))

### Running the training of the model

In [19]:
total_epochs = 10
for epoch in range(total_epochs):
    print("EPOCH:", epoch + 1)
    print("TRAIN")
    train(dataset.train_loader, model, criterion, optimizer, device)
    print("VALIDATION")
    validate(dataset.val_loader, model, criterion, device)

EPOCH: 1
TRAIN
Loss 2.3082 (2.3082)	Prec@1 11.719 (11.719)
Loss 0.9513 (1.5651)	Prec@1 66.406 (50.178)
Loss 0.7446 (1.1670)	Prec@1 71.875 (60.281)
Loss 0.6037 (0.9909)	Prec@1 78.125 (65.903)
Loss 0.5902 (0.8900)	Prec@1 81.250 (69.278)
VALIDATION
 * Validation accuracy: Prec@1 82.640 
EPOCH: 2
TRAIN
Loss 0.4923 (0.4923)	Prec@1 80.469 (80.469)
Loss 0.5031 (0.5134)	Prec@1 82.031 (81.327)
Loss 0.5391 (0.5043)	Prec@1 82.031 (81.915)
Loss 0.5327 (0.4964)	Prec@1 79.688 (82.291)
Loss 0.3788 (0.4890)	Prec@1 84.375 (82.688)
VALIDATION
 * Validation accuracy: Prec@1 83.580 
EPOCH: 3
TRAIN
Loss 0.4262 (0.4262)	Prec@1 85.938 (85.938)
Loss 0.3988 (0.4469)	Prec@1 84.375 (84.011)
Loss 0.3667 (0.4426)	Prec@1 86.719 (84.336)
Loss 0.3316 (0.4359)	Prec@1 89.844 (84.632)
Loss 0.3408 (0.4383)	Prec@1 83.594 (84.513)
VALIDATION
 * Validation accuracy: Prec@1 85.930 
EPOCH: 4
TRAIN
Loss 0.4299 (0.4299)	Prec@1 85.156 (85.156)
Loss 0.3902 (0.4108)	Prec@1 85.938 (85.094)
Loss 0.4699 (0.4165)	Prec@1 80.469 (84.911

### Moving from MLP to CNN
Now that we have seen how our two-hidden layer MLP performs, let's see how we can move on to a convolutional neural network (CNN). The advantage of a CNN is that the we no longer have an all-to-all connectivity structure between layers, but rather take a look at local (2-D or even 3-D) neighborhoods. This spatial (or even temporal) filter is then convolved over the whole input (here an image) by "sharing the weights" to every position. The outcome is typically referred to as a feature map and in order to check for multiple features we apply a set of such filters in parallel.  

Let us see how to build a CNN with 2 layers with a fully-connected classifier on top. You will notice that we have included pooling layers. These layers generally subsample the input and introduce translation invariance (to an extent)

In [20]:
class CNN(nn.Module):
    def __init__(self, num_classes):
        super(CNN, self).__init__()
        
        self.conv1 = nn.Conv2d(1, 32, 5) # input features, output features, kernel size
        self.act1 = nn.ReLU()
        self.mp1 = nn.MaxPool2d(2, 2) # kernel size, stride
        
        self.conv2 = nn.Conv2d(32, 64, 5) # input features, output features, kernel size
        self.act2 = nn.ReLU()
        self.mp2 = nn.MaxPool2d(2, 2) # kernel size, stride
        
        self.fc = nn.Linear(64*4*4, num_classes) # 4x4 is the remaining spatial resolution here

    def forward(self, x):
        x = self.mp1(self.act1(self.conv1(x)))
        x = self.mp2(self.act2(self.conv2(x)))
        # The view flattens the output to a vector (the representation needed by the classifier)
        x = x.view(-1, 64*4*4)
        x = self.fc(x)
        return x

### Constructing and running the CNN

In [21]:
model = CNN(num_classes).to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.01,
                            momentum=0.9,
                            weight_decay=5e-4)

total_epochs = 10
for epoch in range(total_epochs):
    print("EPOCH:", epoch + 1)
    print("TRAIN")
    train(dataset.train_loader, model, criterion, optimizer, device)
    print("VALIDATION")
    validate(dataset.val_loader, model, criterion, device)

EPOCH: 1
TRAIN
Loss 2.3053 (2.3053)	Prec@1 10.938 (10.938)
Loss 0.7837 (1.2844)	Prec@1 72.656 (55.097)
Loss 0.6906 (0.9735)	Prec@1 73.438 (65.403)
Loss 0.4244 (0.8376)	Prec@1 85.156 (69.996)
Loss 0.5215 (0.7568)	Prec@1 82.031 (72.874)
VALIDATION
 * Validation accuracy: Prec@1 80.360 
EPOCH: 2
TRAIN
Loss 0.6053 (0.6053)	Prec@1 78.125 (78.125)
Loss 0.4939 (0.4694)	Prec@1 81.250 (83.308)
Loss 0.4421 (0.4580)	Prec@1 84.375 (83.773)
Loss 0.4357 (0.4508)	Prec@1 85.938 (84.043)
Loss 0.4957 (0.4406)	Prec@1 85.156 (84.352)
VALIDATION
 * Validation accuracy: Prec@1 86.730 
EPOCH: 3
TRAIN
Loss 0.3465 (0.3465)	Prec@1 92.188 (92.188)
Loss 0.3168 (0.3922)	Prec@1 91.406 (86.402)
Loss 0.5141 (0.3897)	Prec@1 81.250 (86.408)
Loss 0.4393 (0.3870)	Prec@1 82.031 (86.374)
Loss 0.3463 (0.3842)	Prec@1 87.500 (86.415)
VALIDATION
 * Validation accuracy: Prec@1 87.210 
EPOCH: 4
TRAIN
Loss 0.3173 (0.3173)	Prec@1 89.062 (89.062)
Loss 0.3939 (0.3592)	Prec@1 85.156 (86.966)
Loss 0.2729 (0.3546)	Prec@1 89.844 (87.271

We can see that by changing to a CNN for images we have gained around 2 percent accuracy already. If you want to play around with this example you will be able to gain even more by modifying the network to include regularization methods such as dropout, augmenting or preprocessing your data, constructing larger and deeper models and finding better hyperparameters such as learning rates or mini-batch sizes.  

### How well did the model do?
In Machine Learning research it is crucial to compare and contrast a model to other researchers implementations. Many of the current Machine Learning datasets are posed as benchmarks where results are rigorously tracked in order to examine the efficiency and efficacy of a model or algorithm proposition.

For the fashion MNIST dataset you can check how well both of your models (from scratch and in PyTorch) perform here:
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#

Do keep in mind that in order to analyze the usefulness of a method one should always compare and contrast on a variety of different datasets with varying task and complexity.