# MNIST: Handwritten digit recognition with SuperDuperDB

The [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) is the classic hello-world for machine learning and AI.

In this tutorial we implement MNIST classification using the paradigmatic "LeNet" based on the un-preprocessed images.

First we import the SuperDuperDB client, and create a fresh collection.

In [1]:
from pinnacledb.datalayer.mongodb.client import SuperDuperClient

import io
import numpy
import PIL.Image
import PIL.JpegImagePlugin
import PIL.PngImagePlugin
import torch
from torchvision import transforms
import torchvision

the_client = SuperDuperClient()
docs = the_client.mnist.digits

INFO:root:These are the RAY_MODELS: [('mnist', 'lenet')]


In [5]:
docs.find_one()

DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 41 203


{'_id': ObjectId('645e1873ea68526c890a55f3'),
 'img': <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
 'class': 4,
 '_fold': 'train'}

The data is available as `PIL.Image` images with labels in the `torchvision` package:

In [None]:
import random

mnist_data = list(torchvision.datasets.MNIST(root='./data', download=True))
random.shuffle(mnist_data)

SuperDuperDB is based on MongoDB, which does not support images and tensors and other special data types out of the box. In order to remedy this, we create custom SuperDuperDB **types**.
These types handle:
    
- How to store images as bytes in SuperDuperDB and how to reinstantiate the images from the bytes
- How to store tensors in SuperDuperDB and how to retrieve these from the database again

We do this by creating the classes in python with `.encode` and `.decode` methods:

In [None]:
class Image:
    types = (PIL.JpegImagePlugin.JpegImageFile, PIL.Image.Image,
             PIL.PngImagePlugin.PngImageFile)

    @staticmethod
    def encode(x):
        buffer = io.BytesIO()
        x.save(buffer, format='png')
        return buffer.getvalue()

    @staticmethod
    def decode(bytes_):
        return PIL.Image.open(io.BytesIO(bytes_))


class FloatTensor:
    types = (torch.FloatTensor, torch.Tensor)

    @staticmethod
    def encode(x):
        x = x.numpy()
        assert x.dtype == numpy.float32
        return memoryview(x).tobytes()

    @staticmethod
    def decode(bytes_):
        array = numpy.frombuffer(bytes_, dtype=numpy.float32)
        return torch.from_numpy(array).type(torch.float)

Once these classes are ready, we can add instances of these to SuperDuperDB, giving each type a suitable name:

In [None]:
docs.create_type('image', Image(), serializer='dill')
docs.create_type('float_tensor', FloatTensor(), serializer='dill')

Now we have these custom types, we can insert the data into SuperDuperDB:

In [None]:
_, jobs = docs.insert_many([{'img': x[0], 'class': x[1]} for x in mnist_data[:-1000]])

In [None]:
docs.count_documents({})

In [None]:
jobs

When data is inserted into SuperDuperDB, certain actions/ jobs are triggered. These include downloading content into the database from provided URIs, and running model **watchers** over the added data, if these have been added.

The second output from the `insert_many` statement give a dictionary of ids of the asynchronous jobs which were created.
These can also be seen in the job list:

In [None]:
docs.database.list_jobs()

In this case, only one job was created - to download content. We can watch the stdout/stderr of this job like this:

In [None]:
docs.watch_job(jobs['_download_content'][0])

You can see that all data apart from the final 1000 have been added:

In [None]:
docs.count_documents({})

You'll see that when we fetch data from the database, it's in exactly the form that we want. For example, the images have been saved and recalled as `PIL.Image` types.

In [None]:
docs.find_one()['img']

Now that we've added the data to SuperDuperDB, we're ready to create a model. This is a simple PyTorch model implementing the iconic [LeNet architecture](https://en.wikipedia.org/wiki/LeNet). In addition to the standard PyTorch `.forward` method, SuperDuperDB allows users to specify `.preprocess` and `.postprocess` methods which define respectively:

- how the data is converted from the in-database form, to tensor
- from the output form back into the form to be stored in the database

In [6]:
def label(x):
    return torch.tensor(x)


class LeNet5(torch.nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.layer1 = torch.nn.Sequential(
            torch.nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
            torch.nn.BatchNorm2d(6),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = torch.nn.Sequential(
            torch.nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
            torch.nn.BatchNorm2d(16),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = torch.nn.Linear(400, 120)
        self.relu = torch.nn.ReLU()
        self.fc1 = torch.nn.Linear(120, 84)
        self.relu1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(84, num_classes)

    def preprocess(self, x):
        return transforms.Compose([
            transforms.Resize((32, 32)),
            transforms.ToTensor(),
            transforms.Normalize(mean=(0.1307,), std=(0.3081,))]
        )(x)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.relu(out)
        out = self.fc1(out)
        out = self.relu1(out)
        out = self.fc2(out)
        return out

    def postprocess(self, x):
        return int(x.topk(1)[1].item())

Let's test this model on a single image. `Collection.apply_model` applies the `preprocess`, `forward` and `postprocess` methods serially, creating a singleton batch prior to the `forward` and unpacking the batch after the `forward`. A similar logic is applied in all functionality which involves applying a model to the data. 

In [7]:
docs.predict_one(LeNet5(10), docs.find()[23]['img'])

DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 41 185


Now that we have our model we need a target for learning - for this we use `label`:

In [8]:
docs.create_model('lenet', LeNet5(10), serializer='dill')
docs.create_model('label', label, serializer='dill')

In [11]:
layer = docs.models['lenet']

In [12]:
layer.predict_one(docs.find_one()['img'])

DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 41 203


7

And a metric to measure performance. Defined metrics are applied serially over individual data points, and the results are averaged:

In [None]:
def accuracy(x, y):
    return x == y

docs.create_metric('accuracy', accuracy, serializer='dill')

When measuring performance, SuperDuperDB requires users to create a separate validation set, which is saved for posterity and reproducibility reasons - edits and deletes on the main collection don't affect this validation set.

In [None]:
docs.create_validation_set('classification', sample_size=250)

Now we're ready to create the model using the `Collection.create_imputation`. This trains a model to predict one part of the data using another part, specified respectively by `model_key` and `target_key`; these are subkeys of the collection documents.

In [None]:
from pinnacledb.training.torch.trainer import TorchTrainerConfiguration

jobs = docs.create_learning_task(
    ['lenet', 'label'],
    ['img', 'class'],
    identifier='predictor',
    configuration=TorchTrainerConfiguration(
        objective=torch.nn.CrossEntropyLoss,
        n_iterations=1000,
        validation_interval=50,
        loader_kwargs={'batch_size': 25, 'num_workers': 0},
    ),
    metrics=['accuracy'],
    validation_sets=('classification',),
)

This command creates two jobs, one to train the model, and another to apply the model to the data after training. As before, the jobs are spawned asynchronously. We can watch the output of the jobs, but in the meantime we can also do other things in our environment, and with the database. Stopping the `watch_job` command, simply breaks the connection to the logs (doesn't stop the job).

In [None]:
docs.watch_job(jobs[0])

We can continue watching the job, by executing the command again:

In [None]:
docs.watch_job(jobs[0])

Now the training has finished, we can watch the computation of model outputs:

In [None]:
docs.watch_job(jobs[1])

In [None]:
info = docs.database.get_object_info('predictor', 'learning_task')
info

In [None]:
from matplotlib import pyplot as plt

plt.plot(info['metric_values']['classification']['accuracy'])

Accuracy is good, and we can see the outputs have been added to the documents (`_outputs.img.lenet`):

In [None]:
docs.find_one()

After training, you'll see that a model **watcher** has been created, which keeps the `img` key up-to-date

In [None]:
docs.list_watchers()

When new data are added, the trained model kicks into action 
and it's outputs are added/ enriched to the newly added data

In [None]:
_, jobs = docs.insert_many([{'img': x[0], 'class': x[1], 'update': True} for x in mnist_data[-1000:]])

We can watch the progress of adding this new data as before:

In [None]:
docs.watch_job(jobs['watcher', 'predictor'][0])

After inserting and training the model, the model is automatically served on the SuperDuperDB model-server. If you're deployment is exposed to the internet, then these predictions are available anywhere:

In [None]:
im = docs.find_one({'_fold': 'valid'})['img']
im

In [None]:
docs.apply_model('lenet', im)