# Trainable MNIST

Raytune has a great collection of [examples](https://github.com/ray-project/ray/tree/master/python/ray/tune/examples) and I have picked a simple one to impement in a notebook. This is godo to play with if you are within a GPU or want to try a *big* parameter search as it's relatively quick and easy to train.


## What is MNIST?

In case you are unfamilar MNIST is a standard handwritten character dataset used widely in Machine Learning examples. The dataset consists of 70,000 28x28 images (60,000 training and 10,000 test)


Here are some examples from the dataset [source: wikipedia](https://en.wikipedia.org/wiki/MNIST_database#/media/File:MnistExamples.png)

![MNIST Examples](MnistExamples.png)



### Load Dependencies

We load the usual deps and also load [PyTorch](https://pytorch.org/docs/stable/index.html) so that we can define a small Neural Network to train as a classifer

In [1]:
%load_ext autoreload
%autoreload 2

from dependencies import *

Loading dependencies we have already seen...
Importing ray...
Done...


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

## Dataloaders

We'll re-use some standard dataloaders from the ray examples. This function returns 2 pytorch dataloaders, one for train and one for test

In [3]:
from mnist_pytorch import get_data_loaders

## Network

We define a very small ConvNET with a single Conv2D layer and single Linear Layer

Things to Try:
 - Try building up larger networks to see if performance improves
 - Try adding options to the `__init__()` function to vary layer properties or the number of layers and tune these new parameters

In [4]:
class ConvNet(nn.Module):
    def __init__(self, width=3):
        super(ConvNet, self).__init__()       
        self.width=width
        
        self.conv1 = nn.Conv2d(1, width, kernel_size=3)
        self.fc = nn.Linear(width*8*8, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, self.width*8*8)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)

## Train & Test Functions

We define some functions for out training and testing loops

During training, we iterate over examples while calculating losses, backpropagating and optimising the network

In [5]:
EPOCH_SIZE = 512

def train(model, optimizer, train_loader, device=None):
    device = device or torch.device("cpu")
    
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

During test time we iterate over the test set and calculate performance metrics

In [6]:
TEST_SIZE=256

def test(model, data_loader, device=None):
    device = device or torch.device("cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total

## Check for CUDA

In [7]:
print('CUDA Available :D') if torch.cuda.is_available() else print('CPU Only :O')

CUDA Available :D


## Start Ray

In [None]:
ray.shutdown()
ray.init(num_cpus=10, num_gpus=1)

In [8]:
from os import path

class TrainMNIST(tune.Trainable):
    
    def _setup(self, config):
        # detect if cuda is availalbe as ray will assign GPUs if available and configured
        self.device = torch.device("cpu")
        
        #
        # In this case this will also fetch the MNIST data on first run
        #
        self.train_loader, self.test_loader = get_data_loaders()
        
        # create the network
        self.model = ConvNet(
            # pass in parameters here if we want to tune network internals
        ).to(self.device)
        
        #setup the optimiser (try Adam instead and change parameters we are tuning)
        self.optimizer = optim.SGD(
            self.model.parameters(),
            lr=config.get("lr", 0.01),
            momentum=config.get("momentum", 0.9))
                

    def _train(self):
        
        train(self.model, self.optimizer, self.train_loader, device=self.device)
        
        acc = test(self.model, self.test_loader, self.device)
        
        return {"mean_accuracy": acc}
    
    
    def _save(self, checkpoint_dir):
        checkpoint_path = path.join(checkpoint_dir, "model.pth")
        torch.save(self.model.state_dict(), checkpoint_path)
        return checkpoint_path
    
    
    def _restore(self, checkpoint_path):
        self.model.load_state_dict(torch.load(checkpoint_path))

In [9]:
config = {
        "lr": tune.uniform(0.001, 0.1),
        "momentum": tune.uniform(0.1, 0.9),
    }


analysis = tune.run(
    TrainMNIST,
    config=config,
    local_dir="~/ray_results/torch_mnist",
    mode='max',
    resources_per_trial={
        "cpu": 2,
        "gpu": 0
    },
    num_samples=3,
    checkpoint_at_end=True,
    checkpoint_freq=10,
#     keep_checkpoints_num=3, # only keep n best checkpoints
    stop={
        "mean_accuracy": 0.99,
        "training_iteration": 100,
    })



2020-11-09 18:04:22,815	INFO services.py:1164 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Trial name,status,loc,lr,momentum
TrainMNIST_9e8f1_00000,RUNNING,,0.0469921,0.490219
TrainMNIST_9e8f1_00001,PENDING,,0.0920939,0.800545
TrainMNIST_9e8f1_00002,PENDING,,0.0311037,0.12874


Result for TrainMNIST_9e8f1_00002:
  date: 2020-11-09_18-04-26
  done: false
  experiment_id: 937abac09fa74c4988e7d49efbcd75fc
  experiment_tag: 2_lr=0.031104,momentum=0.12874
  hostname: Schlepptop
  iterations_since_restore: 1
  mean_accuracy: 0.1625
  node_ip: 192.168.123.68
  pid: 8803
  time_since_restore: 0.3307211399078369
  time_this_iter_s: 0.3307211399078369
  time_total_s: 0.3307211399078369
  timestamp: 1604941466
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9e8f1_00002
  
Result for TrainMNIST_9e8f1_00000:
  date: 2020-11-09_18-04-26
  done: false
  experiment_id: ee158f9e055d4b18939eaac9458ffbf9
  experiment_tag: 0_lr=0.046992,momentum=0.49022
  hostname: Schlepptop
  iterations_since_restore: 1
  mean_accuracy: 0.4875
  node_ip: 192.168.123.68
  pid: 8801
  time_since_restore: 0.3922460079193115
  time_this_iter_s: 0.3922460079193115
  time_total_s: 0.3922460079193115
  timestamp: 1604941466
  timesteps_since_restore: 0
  training_iteration: 1
  trial

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.875,11,3.77106
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.8625,11,3.70851
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.85625,12,4.00706


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.9125,27,8.81664
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.915625,27,8.79603
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.93125,28,9.01744


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.915625,43,14.0079
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.934375,44,14.2103
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.89375,44,13.922


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.946875,58,19.0229
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.946875,59,19.1478
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.925,60,19.2009


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.909375,73,24.2677
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.940625,73,23.9264
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.921875,75,24.0689


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,RUNNING,192.168.123.68:8801,0.0469921,0.490219,0.93125,88,29.1328
TrainMNIST_9e8f1_00001,RUNNING,192.168.123.68:8802,0.0920939,0.800545,0.915625,90,29.2737
TrainMNIST_9e8f1_00002,RUNNING,192.168.123.68:8803,0.0311037,0.12874,0.94375,90,29.0021


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_9e8f1_00000,TERMINATED,,0.0469921,0.490219,0.971875,100,33.016
TrainMNIST_9e8f1_00001,TERMINATED,,0.0920939,0.800545,0.921875,100,32.4403
TrainMNIST_9e8f1_00002,TERMINATED,,0.0311037,0.12874,0.953125,100,32.0869


In [10]:
from pprint import pprint
print("Best config is:")
pprint(analysis.get_best_trial(metric="mean_accuracy"))
pprint(analysis.get_best_config(metric="mean_accuracy"))

Best config is:
TrainMNIST_9e8f1_00000
{'lr': 0.046992089578200634, 'momentum': 0.49021888849247197}


In [11]:
%load_ext tensorboard
from tensorboard import notebook
%tensorboard --logdir "~/ray_results/torch_mnist"

Launching TensorBoard...

## Load a model and check results

In [None]:
train_loader, test_loader = get_data_loaders()

In [None]:
X, y = list(test_loader)[np.random.randint(0, len(test_loader))]

#### Find the model checkpoint you want to load

In [None]:
checkpoint_path = path.join(path.expanduser('~'),'ray_results','torch_mnist','TrainMNIST',
                          'TrainMNIST_1_lr=0.02837,momentum=0.40653_2020-06-12_11-27-558a5m3jyk',
                         'checkpoint_100', 'model.pth')
print(path.exists(checkpoint_path))

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

with torch.set_grad_enabled(False):
    model = ConvNet()
    model.load_state_dict(torch.load(checkpoint_path))
    
    model.to(device)
    
    y_ = model(X.to(device)).cpu()
    
    _, predicted = torch.max(y_, 1)

In [None]:
fig, axs = plt.subplots(8,8, figsize=(20,20))
axsf = [item for s in axs for item in s]

for n,ax in enumerate(axsf):
    ax.imshow(X[n].squeeze().numpy())
    ax.axis('off')
    ax.set_title(predicted[n].item())
    
plt.show()

In [None]:
ray.shutdown()


In [None]:
# https://github.com/ray-project/ray/issues/4569

In [None]:
# Exercises
# - change out the optimiser for adam
# - add network hyperparameters