## Training SageMaker Models using the DGL with Pytorch backend
The **SageMaker Python SDK** makes it easy to train DGL models. In this example, we train a simple graph neural network using the [DMLC DGL API](https://github.com/dmlc/dgl.git) and the [Cora dataset](https://relational.fit.cvut.cz/dataset/CORA). The Cora dataset describes a citation network. The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. The task at hand is to train a node classification model using Cora dataset. 

### Setup
We need to define a few variables that will be needed later in the example.

In [1]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to put your custom code.
custom_code_upload_location = 'customcode'

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()

### The training script
The pytorch_gcn.py script provides all the code we need for training a SageMaker model. 

In [2]:
!cat pytorch_gcn.py

import torch
import torch.nn as nn
from dgl.nn.pytorch import GraphConv

import os
import time
import json
import argparse
import numpy as np
import torch.nn.functional as F
from dgl import DGLGraph
from dgl.data import register_data_args, load_data

# define GCN layer
class GCN(nn.Module):
    def __init__(self,
                 g,
                 in_feats,
                 n_hidden,
                 n_classes,
                 n_layers,
                 activation,
                 dropout):
        super(GCN, self).__init__()
        self.g = g
        self.layers = nn.ModuleList()
        # input layer
        self.layers.append(GraphConv(in_feats, n_hidden, activation=activation))
        # hidden layers
        for i in range(n_layers - 1):
            self.layers.append(GraphConv(n_hidden, n_hidden, activation=activation))
        # output layer
        self.layers.append(GraphConv(n_hidden, n_classes))
        self.dropout = nn.Dropout(p=dropou

### SageMaker's  estimator class
The SageMaker Estimator allows us to run single machine in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role. We also provide a few other parameters. train_instance_count and train_instance_type determine the number and type of SageMaker instances that will be used for the training job. The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the pytorch_gcn.py script above.

Here we can use the official docker image for this example, please see https://github.com/aws/sagemaker-pytorch-container for more information.


In [5]:
from sagemaker.pytorch import PyTorch

CODE_PATH = 'pytorch_gcn.py'
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name

docker_name = 'beta-pytorch-training'
docker_tag = '1.3.0-py3-cpu-build'
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, docker_name, docker_tag)
print(image)

params = {}
params['dataset'] = 'cora'
estimator = PyTorch(entry_point=CODE_PATH,
                        role=role, 
                        train_instance_count=1, 
                        train_instance_type='ml.c4.xlarge',
                        image_name=image,
                        hyperparameters=params,
                        sagemaker_session=sess)

No framework_version specified, defaulting to version 0.4.


397262719838.dkr.ecr.us-east-2.amazonaws.com/beta-pytorch-training:1.3.0-py3-cpu-build


### Running the Training Job
After we've constructed our Estimator object, we can fit it using sagemaker (The dataset will be automatically downloaded). Below we run SageMaker training on one channels: training-code, the code to run.

In [6]:
estimator.fit()

2019-11-24 14:06:21 Starting - Starting the training job...
2019-11-24 14:06:22 Starting - Launching requested ML instances...
2019-11-24 14:07:16 Starting - Preparing the instances for training......
2019-11-24 14:08:06 Downloading - Downloading input data
2019-11-24 14:08:06 Training - Downloading the training image.........
2019-11-24 14:09:42 Training - Training image download completed. Training in progress.[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-11-24 14:09:43,350 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-11-24 14:09:43,353 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-24 14:09:43,364 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-11-24 14:09:46,382 sagemaker_pytorch_container.training INFO     Invoking user training script

[31m[2019-11-24 14:10:05.240 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:05.240 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:05.250 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:05.251 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:05.253 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:05.253 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:05.260 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfi


2019-11-24 14:10:18 Uploading - Uploading generated training model[31m[2019-11-24 14:10:06.896 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:06.897 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:06.897 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:06.897 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:06.924 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:06.926 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:06.927 algo-1:41 INFO json_config.py:86]

[31m[2019-11-24 14:10:10.829 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:10.833 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:10.836 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:10.836 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:10.842 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfig.json[0m
[31m[2019-11-24 14:10:10.843 algo-1:41 INFO singleton_utils.py:28] smdebug is disabled, since hook not created in code and no json config file.[0m
[31m[2019-11-24 14:10:10.843 algo-1:41 INFO json_config.py:86] Loaded Hook configuration from /opt/ml/input/config/debughookconfi


2019-11-24 14:10:24 Completed - Training job completed
Training seconds: 146
Billable seconds: 146


## Output
You can get the model training output from the Sagemaker Console by searching for the training task named pytorch-gcn and looking for the address of 'S3 model artifact'