# Training Amazon SageMaker models for molecular property prediction by using DGL with PyTorch backend

The **Amazon SageMaker Python SDK** makes it easy to train Deep Graph Library (DGL) models. In this example, you train a simple graph neural network for molecular toxicity prediction by using [DGL](https://github.com/dmlc/dgl) and the Tox21 dataset.

The dataset contains qualitative toxicity measurements for 8,014 compounds on 12 different targets, including nuclear 
receptors and stress-response pathways. Each target yields a binary classification problem. You can model the problem as a graph classification problem. 

## Setup

Define a few variables that you need later in the example.

In [9]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()
# in case we need to use a different region than the one in our AWS CLI config:
# sess = sagemaker.Session(boto3.session.Session(region_name='eu-west-1'))

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to put your custom code.
custom_code_upload_location = "customcode"

# IAM execution role that gives Amazon SageMaker access to resources in your AWS account.
# You can use the Amazon SageMaker Python SDK to get the role from the notebook environment.
#role = get_execution_role()
role = 'arn:aws:iam::819888036505:role/service-role/AmazonSageMaker-ExecutionRole-20211101T165507'

## Training Script

`main.py` provides all the code you need for training a molecular property prediction model by using Amazon SageMaker.

In [14]:
!cat main.py

import argparse
import json
import os
import random
from datetime import datetime

import dgl
import numpy as np
import torch
from dgl import model_zoo
from dgl.data.chem import Tox21
from dgl.data.utils import split_dataset
from sklearn.metrics import roc_auc_score
from torch.nn import BCEWithLogitsLoss
from torch.optim import Adam
from torch.utils.data import DataLoader


def setup(args, seed=0):
    args["device"] = "cuda" if torch.cuda.is_available() else "cpu"

    # Set random seed
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
    return args


def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader."""
    smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    bg.set_n_initializer(dgl.init.zero_initializer)
    bg.set_e_initializer(dgl.init.zero_initializer)
    labels = torch.stack(labels, dim=0)
    masks = torch.stack(mask

## Bring Your Own Image for Amazon SageMaker

In this example, you need rdkit library to handle the tox21 dataset. The DGL CPU and GPU Docker has the rdkit library pre-installed at Dockerhub under dgllib registry (namely, dgllib/dgl-sagemaker-cpu:dgl_0.4_pytorch_1.2.0_rdkit for CPU and dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit for GPU). You can pull the image yourself according to your requirement and push it into your AWS ECR. Following script helps you to do so. You can skip this step if you have already prepared your DGL Docker image in your Amazon Elastic Container Registry (Amazon ECR).

In [11]:
%%sh
# For CPU default_docker_name="dgllib/dgl-sagemaker-cpu:dgl_0.4_pytorch_1.2.0_rdkit"
default_docker_name="dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit"
docker pull $default_docker_name

docker_name=sagemaker-dgl-pytorch-gcn-tox21

# For CPU docker build -t $docker_name -f gcn_tox21_cpu.Dockerfile .
docker build -t $docker_name -f gcn_tox21_gpu.Dockerfile .

account=$(aws sts get-caller-identity --query Account --output text)
echo $account
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${docker_name}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${docker_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${docker_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

docker tag ${docker_name} ${fullname}
docker push ${fullname}

dgl_0.4_pytorch_1.2.0_rdkit: Pulling from dgllib/dgl-sagemaker-gpu
Digest: sha256:5e99e0336dbd4ffab576d11f5dfbbcf020dffb7d22bd8204f9a56a0ec9e9a103
Status: Image is up to date for dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit
docker.io/dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit
819888036505
Login Succeeded
The push refers to repository [819888036505.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-dgl-pytorch-gcn-tox21]
503ea5ce1b0d: Preparing
5f5519b1e2be: Preparing
68cde6786b69: Preparing
bd4c9ad79e39: Preparing
e918970a6388: Preparing
3b183c3d9548: Preparing
9e174541fd90: Preparing
e18671bb6f71: Preparing
25f6fb6fff6f: Preparing
5bde1457d341: Preparing
dfe12520986d: Preparing
ff4d40f8732b: Preparing
41ceebac7737: Preparing
f398437e4634: Preparing
b42b4fab3e2e: Preparing
8464e4a1821e: Preparing
ed88571bd95c: Preparing
69d90c18d3c5: Preparing
76993a8d1a18: Preparing
2eafd5e86d56: Preparing
1673fa18caaf: Preparing
7545d8b4edec: Preparing
718bbdc0b45f: Preparing
4a78de7ea906: P

#1 [internal] load build definition from gcn_tox21_gpu.Dockerfile
#1 sha256:b8233447d862e61f63780f1cc64a43508dc1eb33736bb6bfa806a2e612930dfb
#1 transferring dockerfile: 483B 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 sha256:a97985ac6fd5e9b7592f3ddd56e1b222e171081cb0e0a20eb69a263158502b7d
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit
#3 sha256:8e0eded9f7afcc575efc5a4ec5255bb3fceae061ef4ee18bdda86de169fbd0d3
#3 DONE 0.0s

#4 [1/3] FROM docker.io/dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit
#4 sha256:27174b8b886c9788772c2d407aa529eec3fc19c5ac2439d1f783d3aaa8dd7cea
#4 DONE 0.0s

#5 [2/3] RUN pip install -U scikit-learn
#5 sha256:96b7ffb4b58b2cc2cec97abe59888849936269185d968c1b93c5e00fff4c92f4
#5 CACHED

#6 [3/3] RUN pip install pandas
#6 sha256:41c01a89c71eeb3a83a8bf104cc131f44fc0bb7a7e2c0744a5dc525de9fb50bf
#6 CACHED

#7 exporting to image
#7 sha256:e8c613e07b0b7ff33893

## The Amazon SageMaker Estimator class

The Amazon SageMaker Estimator allows you to run a single machine in Amazon SageMaker, using CPU or GPU-based instances.

When you create the estimator, pass in the file name of the training script and the name of the IAM execution role. Also provide a few other parameters. `train_instance_count` and `train_instance_type` determine the number and type of SageMaker instances that will be used for the training job. The hyperparameters can be passed to the training script via a dict of values. See `main.py` for how they are handled.

The entrypoint of Amazon SageMaker Docker (e.g., dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit) is a train script under /usr/bin/. The train script inside dgl docker image provided above will try to get the real entrypoint from the hyperparameters (with the key 'entrypoint') and run the real entrypoint under 'training-code' data channel (/opt/ml/input/data/training-code/) .

For this example, choose one ml.p3.2xlarge instance. You can also use a CPU instance such as ml.c4.2xlarge for the CPU image. You can also add a task_tag with value 'DGL' to help tracking the task.

In [12]:
import boto3

# Set target dgl-docker name
docker_name = "sagemaker-dgl-pytorch-gcn-tox21"

CODE_PATH = "main.py"
code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location)

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, docker_name)
print(image)
task_tags = [{"Key": "ML Task", "Value": "DGL"}]
estimator = sagemaker.estimator.Estimator(
    image,
    role,
    train_instance_count=1,
    train_instance_type="ml.p3.2xlarge",  #'ml.c4.2xlarge'
    #train_instance_type="ml.g2.2xlarge",
    hyperparameters={"entrypoint": CODE_PATH},
    tags=task_tags,
    sagemaker_session=sess,
)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


819888036505.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-dgl-pytorch-gcn-tox21:latest


## Running the Training Job

After you construct an Estimator object, fit it by using Amazon SageMaker. 

In [13]:
estimator.fit({"training-code": code_location})

2021-11-02 16:59:18 Starting - Starting the training job...
2021-11-02 16:59:41 Starting - Launching requested ML instancesProfilerReport-1635872358: InProgress
...
2021-11-02 17:00:14 Starting - Preparing the instances for training............
2021-11-02 17:02:22 Downloading - Downloading input data
2021-11-02 17:02:22 Training - Downloading the training image................[34m/opt/ml/input/data/training-code /[0m
[34mDownloading /root/.dgl/tox21.csv.gz from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/tox21.csv.gz...[0m
[34mdownload failed, retrying, 4 attempts left[0m
[34mDownloading /root/.dgl/tox21.csv.gz from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/tox21.csv.gz...[0m
[34mdownload failed, retrying, 3 attempts left[0m
[34mDownloading /root/.dgl/tox21.csv.gz from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/tox21.csv.gz...[0m
[34mdownload failed, retrying, 2 attempts left[0m
[34mDownloading /root/.dgl/tox21.csv.gz from https://s3.us-east-2.amazon

## Output
You can get the model training output from the Amazon Sagemaker console by searching for the training task and looking for the address of 'S3 model artifact'