# Training SageMaker Models for Molecular Property Prediction Using DGL with PyTorch Backend

The **SageMaker Python SDK** makes it easy to train DGL models. In this example, we train a simple graph neural network for molecular toxicity prediction using [DGL](https://github.com/dmlc/dgl) and Tox21 dataset.

The dataset contains qualitative toxicity measurement for 8014 compounds on 12 different targets, including nuclear 
receptors and stress response pathways. Each target yields a binary classification problem. We can model the problem as a graph classification problem. 

## Setup

We need to define a few variables that will be needed later in the example.

In [1]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to put your custom code.
custom_code_upload_location = 'customcode'

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()

## Training Script

`main.py` provides all the code we need for training a SageMaker model.

In [None]:
!cat main.py

## Get DGL Docker Image (Optional)

We provide dgl-0.4 gpu-docker at dockerhub under dgllib registry. You can pull it yourself and push it into your AWS ECR. Following script helps you to do so. You can skip this step, if you have already got/prepared your dgl docker image in you ECR.

In [3]:
%%sh
default_docker_name="dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit"
docker pull $default_docker_name

docker_name=sagemaker-dgl-pytorch-gcn-tox21

docker build -t $docker_name -f gcn_tox21.Dockerfile .

account=$(aws sts get-caller-identity --query Account --output text)
echo $account
region=$(aws configure get region)
region=${region:-us-east-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${docker_name}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${docker_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${docker_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

docker tag ${docker_name} ${fullname}

docker push ${fullname}

dgl_0.4_pytorch_1.2.0_rdkit: Pulling from dgllib/dgl-sagemaker-gpu
35c102085707: Pulling fs layer
251f5509d51d: Pulling fs layer
8e829fe70a46: Pulling fs layer
6001e1789921: Pulling fs layer
9f0a21d58e5d: Pulling fs layer
8810fcda1e6e: Pulling fs layer
d701a76e3193: Pulling fs layer
34be232fb7a6: Pulling fs layer
7e62b1ed3410: Pulling fs layer
47526c6630b9: Pulling fs layer
aed822d054ca: Pulling fs layer
4755ed5bed2d: Pulling fs layer
e08ece828d6d: Pulling fs layer
21f7e5dd23c0: Pulling fs layer
8385064a1256: Pulling fs layer
8810fcda1e6e: Waiting
4a916ee6dd88: Pulling fs layer
d701a76e3193: Waiting
5224460d79b7: Pulling fs layer
7e62b1ed3410: Waiting
969f958f63e3: Pulling fs layer
34be232fb7a6: Waiting
031b90ba4752: Pulling fs layer
4755ed5bed2d: Waiting
47526c6630b9: Waiting
d9075a74f235: Pulling fs layer
60019ab50995: Pulling fs layer
4a916ee6dd88: Waiting
aed822d054ca: Waiting
e08ece828d6d: Waiting
5225800ad4f4: Pulling fs layer
5224460d79b7: Waiting
5b628ec4dfa6: Pulling fs layer


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## SageMaker's Estimator Class

The SageMaker Estimator allows us to run a single machine in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role. We also provide a few other parameters. `train_instance_count` and `train_instance_type` determine the number and type of SageMaker instances that will be used for the training job. The hyperparameters can be passed to the training script via a dict of values. See `main.py` for how they are handled.

The entrypoint of sagemaker docker (e.g., dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit) is a train script under /usr/bin/. The train script inside dgl docker image provided above will try to get the real entrypoint from hyperparameters and run the real entrypoint under 'training-code' data channel (/opt/ml/input/data/training-code/) .

For this example, we will choose one ml.p3.2xlarge instance.

In [6]:
import boto3

# Set target dgl-docker name
docker_name='sagemaker-dgl-pytorch-gcn-tox21'

CODE_PATH = 'main.py'
code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location)

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, docker_name)
print(image)

estimator = sagemaker.estimator.Estimator(image,
                        role, 
                        train_instance_count=1, 
                        train_instance_type='ml.p3.2xlarge',
                        hyperparameters={'entrypoint': CODE_PATH},
                        sagemaker_session=sess)

397262719838.dkr.ecr.us-east-2.amazonaws.com/sagemaker-dgl-pytorch-gcn-tox21:latest


## Running the Training Job

After we've constructed an Estimator object, we can fit it using SageMaker. 

In [7]:
estimator.fit({'training-code': code_location})

2019-11-21 13:58:58 Starting - Starting the training job...
2019-11-21 13:58:59 Starting - Launching requested ML instances...
2019-11-21 13:59:57 Starting - Preparing the instances for training......
2019-11-21 14:00:47 Downloading - Downloading input data...
2019-11-21 14:01:01 Training - Downloading the training image............
2019-11-21 14:03:25 Training - Training image download completed. Training in progress..[31m/opt/ml/input/data/training-code /[0m
[31mDownloading /root/.dgl/tox21.csv.gz from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/tox21.csv.gz...[0m
[31mepoch 1/100, batch 1/49, loss 1.0049[0m
[31mepoch 1/100, batch 2/49, loss 1.1119[0m
[31mepoch 1/100, batch 3/49, loss 0.8967[0m
[31mepoch 1/100, batch 4/49, loss 1.0147[0m
[31mepoch 1/100, batch 5/49, loss 1.0258[0m
[31mepoch 1/100, batch 6/49, loss 0.9258[0m
[31mepoch 1/100, batch 7/49, loss 0.8880[0m
[31mepoch 1/100, batch 8/49, loss 1.0852[0m
[31mepoch 1/100, batch 9/49, loss 1.0548[0m
[3

[31mepoch 9/100, batch 15/49, loss 0.7249[0m
[31mepoch 9/100, batch 16/49, loss 0.6044[0m
[31mepoch 9/100, batch 17/49, loss 0.6095[0m
[31mepoch 9/100, batch 18/49, loss 0.7581[0m
[31mepoch 9/100, batch 19/49, loss 0.5651[0m
[31mepoch 9/100, batch 20/49, loss 0.6014[0m
[31mepoch 9/100, batch 21/49, loss 0.5908[0m
[31mepoch 9/100, batch 22/49, loss 0.5971[0m
[31mepoch 9/100, batch 23/49, loss 0.5548[0m
[31mepoch 9/100, batch 24/49, loss 0.7711[0m
[31mepoch 9/100, batch 25/49, loss 0.5820[0m
[31mepoch 9/100, batch 26/49, loss 0.5924[0m
[31mepoch 9/100, batch 27/49, loss 0.6475[0m
[31mepoch 9/100, batch 28/49, loss 0.5992[0m
[31mepoch 9/100, batch 29/49, loss 0.6495[0m
[31mepoch 9/100, batch 30/49, loss 0.6322[0m
[31mepoch 9/100, batch 31/49, loss 0.6523[0m
[31mepoch 9/100, batch 32/49, loss 0.5702[0m
[31mepoch 9/100, batch 33/49, loss 0.6297[0m
[31mepoch 9/100, batch 34/49, loss 0.6151[0m
[31mepoch 9/100, batch 35/49, loss 0.5301[0m
[31mepoch 9/


2019-11-21 14:04:16 Uploading - Uploading generated training model
2019-11-21 14:04:16 Completed - Training job completed
Training seconds: 209
Billable seconds: 209
