# PyTorch with Kubeflow Fairing 

In this notebook we will walk through training a character recongition model using the MNIST dataset on Pytorch. 
We will then show you how to use Kubeflow Fairing to run the same training job on both Kubeflow and CMLE

In [1]:
#you can skip this step if you have already installed the necessary dependencies
!pip install -U -r requirements.txt

Requirement already up-to-date: torch in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 1)) (1.6.0)
Requirement already up-to-date: torchvision in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 2)) (0.7.0)
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [9]:
%%writefile train_model.py
import argparse
import os
import subprocess
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets
from torchvision import transforms
# For mac users you may get hit with this bug https://github.com/pytorch/pytorch/issues/20030
# temporary solution is "brew install libomp"

Writing train_model.py


## PyTorch Model Defintion

Setup a Convolution Nueral network using Pytorch!

In [10]:
%%writefile train_model.py -a
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

Appending to train_model.py


## PyTorch Training and Test Functions
A simple training function that batches the data set. 

In [11]:
%%writefile train_model.py -a
def train(model, device, train_loader, optimizer, epoch, log_interval):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0 and batch_idx>0:
            print('Train Epoch: {}\t[{}/{}\t({:.0f}%)]\tLoss: {:.6f}'.format(
              epoch, batch_idx * len(data), len(train_loader.dataset),
              100. * batch_idx / len(train_loader), loss.item()))

Appending to train_model.py


In [12]:
%%writefile train_model.py -a
def test(model, device, test_loader, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(
              output, target, size_average=False).item()  # sum up batch loss
            pred = output.max(
              1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

        test_loss /= len(test_loader.dataset)
        print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
          test_loss, correct, len(test_loader.dataset),
          100. * correct / len(test_loader.dataset)))

Appending to train_model.py


In [13]:
%%writefile train_model.py -a
def nkTrain(batch_size=64, epochs=1, log_interval=100, lr=0.01, model_dir=None, momentum=0.5, 
                       no_cuda=False, seed=1, test_batch_size=1000):

    use_cuda = not no_cuda and torch.cuda.is_available()
    torch.manual_seed(seed)
    device = torch.device('cuda' if use_cuda else 'cpu')
    print("Using {} for training.".format(device))

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
      datasets.MNIST(
          'data',
          train=True,
          download=True,
          transform=transforms.Compose([
              transforms.ToTensor(),
              # Normalize a tensor image with mean and standard deviation
              transforms.Normalize(mean=(0.1307,), std=(0.3081,))
          ])),
      batch_size=batch_size,
      shuffle=True,
      **kwargs)
    test_loader = torch.utils.data.DataLoader(
      datasets.MNIST(
          'data',
          train=False,
          transform=transforms.Compose([
              transforms.ToTensor(),
              # Normalize a tensor image with mean and standard deviation              
              transforms.Normalize(mean=(0.1307,), std=(0.3081,))
          ])),
      batch_size=test_batch_size,
      shuffle=True,
      **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

    for epoch in range(1, epochs + 1):
        start_time = time.time()
        train(model, device, train_loader, optimizer, epoch, log_interval)
        print("Time taken for epoch #{}: {:.2f}s".format(epoch, time.time()-start_time))
        test(model, device, test_loader, epoch)

    if model_dir:
        model_file_name = 'torch.model'
        tmp_model_file = os.path.join('/tmp', model_file_name)
        torch.save(model.state_dict(), tmp_model_file)
        subprocess.check_call([
            'gsutil', 'cp', tmp_model_file,
            os.path.join(model_dir, model_file_name)])

Appending to train_model.py


## Training locally

In [7]:
nkTrain()

Using cpu for training.
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100.1%

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


113.5%

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100.4%

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
Time taken for epoch #1: 19.26s





Test set: Average loss: 0.2063, Accuracy: 9387/10000 (94%)



## Build Docker Images and train using the container.

In this block we set some Docker config. Fairing will use this information to package up the `train_and_test` function i

In [14]:
!sh ~/nkodedemo01/nkode/remote_train.sh

About to train job setup...
[33m[W 201020 02:37:39 function:49][m The FunctionPreProcessor is optimized for using in a notebook or IPython environment. For it to work, the python version should be same for both local python and the python in the docker. Please look at alternatives like BasePreprocessor or FullNotebookPreprocessor.
[33m[W 201020 02:37:39 tasks:62][m Using builder: <class 'kubeflow.fairing.builders.cluster.cluster.ClusterBuilder'>
about to submit job
[32m[I 201020 02:37:39 tasks:66][m Building the docker image.
[32m[I 201020 02:37:39 cluster:46][m Building image using cluster builder.
[33m[W 201020 02:37:39 base:94][m /usr/local/lib/python3.6/dist-packages/kubeflow/fairing/__init__.py already exists in Fairing context, skipping...
[32m[I 201020 02:37:39 base:107][m Creating docker context: /tmp/fairing_context_7606uv9l
[33m[W 201020 02:37:39 base:94][m /usr/local/lib/python3.6/dist-packages/kubeflow/fairing/__init__.py already exists in Fairing context, ski