# Accelerate Deep Learning Model training with Watson Machine Learning Accelerator


### Notebook created by Kelvin Lui,  Xue Yin Zhuang in Jan 2021

### In this notebook, you will learn how to use the Watson Machine Learning Accelerator (WML-A) API and accelerate deep learning model training on GPU with Watson Machine Learning Accelerator.

This notebook uses the PyTorch Resnet18 model, which performs image classification using a basic computer vision image classification example. The model will be trained both on CPU and GPU to demonstrate that training models on GPU hardware deliver faster result times.


This notebook covers the following sections:

1. [Setting up required packages](#setup)<br>

2. [Configuring your environment and project details](#configure)<br>

3. [Training the model on CPU](#cpu)<br>

4. [Training the model on GPU with Watson Machine Learning Accelerator](#gpu)<br>

<a id = "setup"></a>
## Step 1: Setting up required packages


#### First, install torchvision which is required to train the PyTorch Resnet18 model on CPU.
Note: You will need to create a custom environment with 16VCPU and 32GB

In [1]:
! pip install torchvision



In [2]:
import torchvision

#### Next, define helper methods:

In [3]:
# import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import display, FileLink, clear_output

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib
import tarfile


def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

def query_executor_stdout_log(job_id) :

    execURL = dl_rest_url  +'/scheduler/applications/'+ job_id['id'] + '/executor/1/logs/stdout?lastlines=1000'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)
    
    
def query_train_metric(job_id) :

    #execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)

    # save result file    
def download_trained_model(job_id) :

    from IPython.display import display, FileLink

    # save result file
    commonHeaders3={'accept': 'application/octet-stream', 'X-Auth-Token': access_token}
    execURL = dl_rest_url  +'/execs/'+ r.json()['id'] + '/result'
    res = req.get(execURL, headers=commonHeaders3, verify=False, stream=True)
    print (execURL)

    tmpfile = '/project_data/data_asset/' +  r.json()['id'] +'.zip'
    print ('Save model: ', tmpfile )
    with open(tmpfile,'wb') as f:
        f.write(res.content)
        f.close()

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

Populating the interactive namespace from numpy and matplotlib


<a id = "configure"></a>
## Step 2: Configuring your environment and project details

To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [6]:
hostname='wmla-console-wmla-ns.apps.cpolab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
login='wendy:Passw0rd2021' # please enter the login and password
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)
a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
print(access_token)

d2VuZHk6UGFzc3cwcmQyMDIx
https://wmla-console-wmla-ns.apps.cpolab.ibm.com/auth/v1/logon
eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IkViSEpBdDhiVTNIUW1HeGw5ZmVJMXRuMWhkQTBSekV6TUNIZ3lsOTdfMVUifQ.eyJ1c2VybmFtZSI6IndlbmR5Iiwic3ViIjoid2VuZHkiLCJpc3MiOiJLTk9YU1NPIiwiYXVkIjoiRFNYIiwicm9sZSI6IkFkbWluIiwicGVybWlzc2lvbnMiOlsiYWRtaW5pc3RyYXRvciIsImNhbl9wcm92aXNpb24iLCJtYW5hZ2VfY2F0YWxvZyIsImNyZWF0ZV9wcm9qZWN0IiwiY3JlYXRlX3NwYWNlIiwibWFuYWdlX3F1YWxpdHkiLCJtYW5hZ2VfaW5mb3JtYXRpb25fYXNzZXRzIiwibWFuYWdlX2Rpc2NvdmVyeSIsIm1hbmFnZV9tZXRhZGF0YV9pbXBvcnQiLCJhdXRob3JfZ292ZXJuYW5jZV9hcnRpZmFjdHMiLCJtYW5hZ2VfZ292ZXJuYW5jZV93b3JrZmxvdyIsInZpZXdfZ292ZXJuYW5jZV9hcnRpZmFjdHMiLCJtYW5hZ2VfY2F0ZWdvcmllcyIsImFjY2Vzc19jYXRhbG9nIiwidmlld19xdWFsaXR5Il0sImdyb3VwcyI6WzEwMDAwXSwidWlkIjoiMTAwMDMzMTAwMiIsImF1dGhlbnRpY2F0b3IiOiJkZWZhdWx0IiwiZGlzcGxheV9uYW1lIjoiV2VuZHkgV2FuZyIsImNhbl9yZWZyZXNoX3VudGlsIjoxNjM4NDI5NzM1MjM4LCJjc3JmX3Rva2VuIjoiNWVmZThkYzcwNTYyMmU1ZDM5NTg2ZWU5MTE2MTlhOGYiLCJzZXNzaW9uX2lkIjoiZDVhYWM4ZTQtYjhjNC00Z

In [7]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

<a id = "cpu"></a>
## Step 3: Training the model on CPU

#### Prepare the model files for running on CPU:

In [8]:
import os
DATA_DIR='/project_data/data_asset/pytorch-resnet/data'
RESULT_DIR='/project_data/data_asset/pytorch-resnet/result'
model_dir = f'/project_data/data_asset/pytorch-resnet/resnet' 
model_main = f'main.py'
model_resnet = f'resnet.py'

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(RESULT_DIR, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

In [9]:
%%writefile {model_dir}/{model_main}

#!/usr/bin/env python
# coding: utf-8

# # Image Classification Using PyTorch Resnet with Watson Machine Learning Accelerator Notebook
# This asset details the process of performing a basic computer vision image classification example using the notebook functionality within Watson Machine Learning Accelerator. In this asset, you will learn how to accelerate your training with pytorch resnet model upon the cifar10 dataset.
#
# Please refer to [Resnet Introduction](https://arxiv.org/abs/1512.03385) for more details.



from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import torchvision.models as models
#from resnet import resnet18
import time
import numpy

import sys
import os
import glob
import argparse

log_interval = 10

seed = 1
use_cuda = False
completed_batch =0
completed_test_batch =0
criterion = nn.CrossEntropyLoss()


parser = argparse.ArgumentParser(description='Tensorflow MNIST Example')
parser.add_argument('--batch-size', type=int, default=32, metavar='N',
                    help='input batch size for training (default: 128)')
parser.add_argument('--epochs', type=int, default=5, metavar='N',
                    help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('--cuda', action='store_true', default=False,
                    help='disables CUDA training')
args = parser.parse_args()
print(args)


# ## Create the Resnet18 model
print("Use cuda: ", use_cuda)

# ## Download the Cifar10 dataset
# If you set download=True, the CIFAR-10 [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar.html) dataset is automatically downloaded and used by the Notebook. 
# If you want to use a different dataset or have previously downloaded a dataset, 
# set download=False and specify the directory that contains the dataset

# An exmpale to dowload the CIFAR-10 dataset:
# > mkdir ${DATA_DIR}/cifar10
# > cd ${DATA_DIR}/cifar10
# > wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
# > tar -zxf cifar-10-python.tar.gz


DATA_DIR='/project_data/data_asset/pytorch-resnet/data'
RESULT_DIR='/project_data/data_asset/pytorch-resnet/result'
model_dir = f'/project_data/data_asset/pytorch-resnet/resnet' 

def getDatasets():
    train_data_dir = DATA_DIR + '/cifar10'
    test_data_dir = DATA_DIR + '/cifar10'

    transform_train = transforms.Compose([
        transforms.Resize(224),
        #transforms.RandomCrop(self.resolution, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    transform_test = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    return (torchvision.datasets.CIFAR10(root=train_data_dir, train=True, download=True, transform = transform_train),
            torchvision.datasets.CIFAR10(root=test_data_dir, train=False, download=True, transform = transform_test)
            )

torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")
print ('device:', device)

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

train_dataset, test_dataset = getDatasets()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)


# ## Implement the customized train and test loop


def train(model, device, train_loader, optimizer, epoch):
    global completed_batch
    train_loss = 0
    correct = 0
    total = 0
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        completed_batch += 1

        print ('Train - batches : {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
           completed_batch, train_loss/(batch_idx+1), correct, total, 100.*correct/total))


def test(model, device, test_loader, epoch):
    global completed_test_batch
    global completed_batch
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    completed_test_batch = completed_batch -  len(test_loader)
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(test_loader):
            data, target = data.to(device), target.to(device)
            output = model(data)

            loss = criterion(output, target)

            test_loss += loss.item() # sum up batch loss
            _, pred = output.max(1) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)

            completed_test_batch += 1

    test_loss /= len(test_loader.dataset)
    test_acc = 100. * correct / len(test_loader.dataset)
    # Output test info for per epoch
    print('Test - batches: {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)\n'.format(
        completed_batch, test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


# ## Create the Resnet18 model
#use_cuda = not args.no_cuda
print("Use cuda: ", use_cuda)


model_type = "resnet18"
print("=> using pytorch build-in model '{}'".format(model_type))

model = models.resnet18()
#model = models.resnet50()


# Using pytorch built-in resnet18 model, the model is pre-trained on the ImageNet dataset,
# which has 1000 classifications. To transfer it to cifar10 dataset, we can modify the last fully-connected layer output size to 10

for param in model.parameters():
    param.requires_grad = True  # set False if you only want to train the last layer using pretrained model
    # Replace the last fully-connected layer
    # Parameters of newly constructed modules have requires_grad=True by default
    model.fc = nn.Linear(512, 10)


# (Optional) To use wmla pretrained resnet18 model for cifar10, load the model weight file. The pretrained model weight file can be downloaded [here](https://?).

weightfile = DATA_DIR + "/checkpoint/model_epoch_final.pth"
if os.path.exists(weightfile):
    print ("Initial weight file is " + weightfile)
    model.load_state_dict(torch.load(weightfile, map_location=lambda storage, loc: storage))


# ## Run the model trainings
#print(model)
model.to(device)
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)
epochs = args.epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, 30, 0.1, last_epoch=-1)

# Output total iterations info for deep learning insights
print("Total iterations: %s" % (len(train_loader) * epochs))

#print("RESULT_DIR: " + os.getenv("RESULT_DIR"))
#RESULT_DIR = os.getenv("RESULT_DIR")
os.makedirs(RESULT_DIR, exist_ok=True)

for epoch in range(1, epochs+1):
    print("\nRunning epoch %s ... It might take several minutes for each epoch to run." % epoch)
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader, epoch)
    scheduler.step()

    torch.save(model.state_dict(),  RESULT_DIR + "/model_epoch_%d.pth"%(epoch))

torch.save(model.state_dict(), RESULT_DIR + "/model_epoch_final.pth")


Writing /project_data/data_asset/pytorch-resnet/resnet/main.py


## Training results on CPU

#### Training was run from a Cloud Pak for Data Notebook utilizing a CPU kernel. 


In the custom environment that was created with **16vCPU** and **32GB**, it took **1560 seconds** (or approximately **26 minutes**) to complete 1 EPOCH training.


In [10]:
import datetime
starttime = datetime.datetime.now()

! python /project_data/data_asset/pytorch-resnet/resnet/main.py --epochs 1 

endtime = datetime.datetime.now()
print("Training cost: ", (endtime - starttime).seconds, " seconds.")

Namespace(batch_size=32, cuda=False, epochs=1, lr=0.01)
Use cuda:  False
device: cpu
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /project_data/data_asset/pytorch-resnet/data/cifar10/cifar-10-python.tar.gz
100%|███████████████████████▉| 170237952/170498071 [00:45<00:00, 2229642.38it/s]Extracting /project_data/data_asset/pytorch-resnet/data/cifar10/cifar-10-python.tar.gz to /project_data/data_asset/pytorch-resnet/data/cifar10
Files already downloaded and verified
Use cuda:  False
=> using pytorch build-in model 'resnet18'
Total iterations: 1563

Running epoch 1 ... It might take several minutes for each epoch to run.
Train - batches : 1, average loss: 2.4301, accuracy: 2/32 (6%)
Train - batches : 2, average loss: 2.4637, accuracy: 6/64 (9%)
170500096it [01:00, 2229642.38it/s]                                             Train - batches : 3, average loss: 2.4328, accuracy: 9/96 (9%)
Train - batches : 4, average loss: 2.4104, accuracy: 12/128 (9%)
Train - batches 

<a id = "gpu"></a>
## Step 4: Training the model on GPU with Watson Machine Learning Accelerator

#### Prepare the model files for running on GPU:

In [11]:
import os
model_dir = f'/project_data/data_asset/pytorch-resnet/resnet-wmla' 
model_main = f'main.py'

os.makedirs(model_dir, exist_ok=True)

In [12]:
%%writefile {model_dir}/{model_main}
#!/usr/bin/env python
# coding: utf-8

# # Image Classification Using PyTorch Resnet with Watson Machine Learning Accelerator Notebook
# This asset details the process of performing a basic computer vision image classification example using the notebook functionality within Watson Machine Learning Accelerator. In this asset, you will learn how to accelerate your training with pytorch resnet model upon the cifar10 dataset.
#
# Please refer to [Resnet Introduction](https://arxiv.org/abs/1512.03385) for more details.



from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import torchvision.models as models
import time

import sys
import os
import glob
import argparse

log_interval = 10

seed = 1
use_cuda = False
completed_batch =0
completed_test_batch =0
criterion = nn.CrossEntropyLoss()


parser = argparse.ArgumentParser(description='Tensorflow MNIST Example')
parser.add_argument('--batch-size', type=int, default=128, metavar='N',
                    help='input batch size for training (default: 128)')
parser.add_argument('--epochs', type=int, default=1, metavar='N',
                    help='number of epochs to train (default: 1)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('cuda', action='store_true', default=True,
                    help='enables CUDA training')
args = parser.parse_args()
print(args)


# ## Create the Resnet18 model
use_cuda = args.cuda
print("Use cuda: ", use_cuda)

# ## Download the Cifar10 dataset
# If you set download=True, the CIFAR-10 [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar.html) dataset is automatically downloaded and used by the Notebook. 
# If you want to use a different dataset or have previously downloaded a dataset, 
# set download=False and specify the directory that contains the dataset

# An exmpale to dowload the CIFAR-10 dataset:
# > mkdir ${DATA_DIR}/cifar10
# > cd ${DATA_DIR}/cifar10
# > wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
# > tar -zxf cifar-10-python.tar.gz

print("DATA_DIR: " + os.getenv("DATA_DIR"))
DATA_DIR = os.getenv("DATA_DIR")

def getDatasets():
    train_data_dir = DATA_DIR + "/cifar10"
    test_data_dir = DATA_DIR + "/cifar10"

    transform_train = transforms.Compose([
        transforms.Resize(224),
        #transforms.RandomCrop(self.resolution, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    transform_test = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    return (torchvision.datasets.CIFAR10(root=train_data_dir, train=True, download=True, transform = transform_train),
            torchvision.datasets.CIFAR10(root=test_data_dir, train=False, download=True, transform = transform_test)
            )

torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

train_dataset, test_dataset = getDatasets()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)


# ## Implement the customized train and test loop


def train(model, device, train_loader, optimizer, epoch):
    global completed_batch
    train_loss = 0
    correct = 0
    total = 0
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        completed_batch += 1

        print ('Train - batches : {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
           completed_batch, train_loss/(batch_idx+1), correct, total, 100.*correct/total))


def test(model, device, test_loader, epoch):
    global completed_test_batch
    global completed_batch
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    completed_test_batch = completed_batch -  len(test_loader)
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(test_loader):
            data, target = data.to(device), target.to(device)
            output = model(data)

            loss = criterion(output, target)

            test_loss += loss.item() # sum up batch loss
            _, pred = output.max(1) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)

            completed_test_batch += 1

    test_loss /= len(test_loader.dataset)
    test_acc = 100. * correct / len(test_loader.dataset)
    # Output test info for per epoch
    print('Test - batches: {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)\n'.format(
        completed_batch, test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


# ## Create the Resnet18 model

model_type = "resnet18"
#model_type = "resnet50"
print("=> using pytorch build-in model '{}'".format(model_type))

model = models.resnet18()


# Using pytorch build-in resnet18 model, the model is pre-trained on the ImageNet dataset,
# which has 1000 classifications. To transfer it to cifar10 dataset, we can modify the last fully-connected layer output size to 10

for param in model.parameters():
    param.requires_grad = True  # set False if you only want to train the last layer using pretrained model
    # Replace the last fully-connected layer
    # Parameters of newly constructed modules have requires_grad=True by default
    model.fc = nn.Linear(512, 10)


# (Optional) To use wmla pretrained resnet18 model for cifar10, load the model weight file. The pretrained model weight file can be downloaded [here](https://?).

weightfile = DATA_DIR + "/checkpoint/model_epoch_final.pth"
if os.path.exists(weightfile):
    print ("Initial weight file is " + weightfile)
    model.load_state_dict(torch.load(weightfile, map_location=lambda storage, loc: storage))


# ## Run the model trainings
model.to(device)
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)
epochs = args.epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, 30, 0.1, last_epoch=-1)

# Output total iterations info for deep learning insights
print("Total iterations: %s" % (len(train_loader) * epochs))

print("RESULT_DIR: " + os.getenv("RESULT_DIR"))
RESULT_DIR = os.getenv("RESULT_DIR")
os.makedirs(RESULT_DIR, exist_ok=True)

for epoch in range(1, epochs+1):
    print("\nRunning epoch %s ... It might take several minutes for each epoch to run." % epoch)
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader, epoch)
    scheduler.step()

    torch.save(model.state_dict(),  RESULT_DIR + "/model/model_epoch_%d.pth"%(epoch))

torch.save(model.state_dict(), RESULT_DIR + "/model/model_epoch_final.pth")

Writing /project_data/data_asset/pytorch-resnet/resnet-wmla/main.py


## Training results on GPU

#### Training was run from a Cloud Pak for Data Notebook utilizing a GPU kernel. 


In the custom environment that was created with **16vCPU** and **32GB**, it took **147seconds** (or approximately **2.5 minutes**) to complete 1 EPOCH training.


In [18]:
dl_rest_url

'https://wmla-console-wmla-ns.apps.cpolab.ibm.com/platform/rest/deeplearning/v1'

In [15]:
files = {'file': open('/project_data/data_asset/pytorch-resnet/resnet-wmla/main.py', 'rb')}

args = '--exec-start PyTorch --cs-datastore-meta type=fs \
                     --numWorker 2 \
                     --workerDeviceNum 1 \
                     --model-main main.py --epochs 1'


In [16]:
starttime = datetime.datetime.now()

r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
if not r.ok:
    print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    
job_status = query_job_status(r.json(),refresh_rate=5)

endtime = datetime.datetime.now()

print("\nTraining cost: ", (endtime - starttime).seconds, " seconds.")

submit job failed: code=500, b'Error 500: Error: [Error]: Maximum number of workers is 1. \n [Error]: Maximum number of workers is 1\n'


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## Training metrics and logs

#### Retrieve and display the model training metrics:

In [23]:
query_train_metric(r.json())

https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/platform/rest/deeplearning/v1/execs/wmla-327/log
Namespace(batch_size=128, cuda=True, epochs=1, lr=0.01)
Use cuda:  True
DATA_DIR: /gpfs/mydatafs
Files already downloaded and verified
Files already downloaded and verified
=> using pytorch build-in model 'resnet18'
Total iterations: 391
RESULT_DIR: /gpfs/myresultfs/dse_user/batchworkdir/wmla-327

Running epoch 1 ... It might take several minutes for each epoch to run.
Train - batches : 1, average loss: 2.4147, accuracy: 15/128 (12%)
Train - batches : 2, average loss: 2.3836, accuracy: 27/256 (11%)
Train - batches : 3, average loss: 2.3746, accuracy: 40/384 (10%)
Train - batches : 4, average loss: 2.3545, accuracy: 54/512 (11%)
Train - batches : 5, average loss: 2.3369, accuracy: 71/640 (11%)
Train - batches : 6, average loss: 2.3242, accuracy: 95/768 (12%)
Train - batches : 7, average loss: 2.3176, accuracy: 110/896 (12%)
Train - batches : 8, average loss: 2.3134, accuracy: 127/10

#### Retrieve and display the model training logs:

In [24]:
query_executor_stdout_log(r.json())

https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/platform/rest/deeplearning/v1/scheduler/applications/wmla-327/executor/1/logs/stdout?lastlines=1000
*Task <1> SubProcess*: 2021-02-10 16:39:33.321944 39 INFO Create log direcotry /wmla-logging/dli/wmla-327/dli/./app.wmla-327-task12n-jbbhg
*Task <1> SubProcess*: 2021-02-10 16:39:33.333660 39 INFO Running on kubernetes.
*Task <1> SubProcess*: 2021-02-10 16:39:33.346049 39 INFO List GPUs
*Task <1> SubProcess*: Wed Feb 10 16:39:33 2021       
*Task <1> SubProcess*: +-----------------------------------------------------------------------------+
*Task <1> SubProcess*: | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
*Task <1> SubProcess*: |-------------------------------+----------------------+----------------------+
*Task <1> SubProcess*: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
*Task <1> SubProcess*: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util

## Download trained model from Watson Machine Learning Accelerator 

In [34]:
download_trained_model(r.json())

https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/platform/rest/deeplearning/v1/execs/wmla-329/result
Save model:  /project_data/data_asset/wmla-329.zip
