<h1>Script-mode Custom Training Container</h1>

This notebook demonstrates how to build and use a custom Docker container for training with Amazon SageMaker that leverages on the <strong>Script Mode</strong> execution that is implemented by the sagemaker-containers library. Reference documentation is available at https://github.com/aws/sagemaker-containers

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'gianpo-ecr/'
prefix = 'script-mode-container'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

825935527263
eu-west-1
arn:aws:iam::825935527263:role/service-role/AmazonSageMaker-ExecutionRole-endtoendml
sagemaker-eu-west-1-825935527263


Let's take a look at the Dockerfile which defines the statements for building our script-mode custom training container:

In [2]:
! pygmentize ../docker/Dockerfile

[34mFROM[39;49;00m[33m ubuntu:16.04[39;49;00m

LABEL [31mmaintainer[39;49;00m=[33m"Amazon AI"[39;49;00m

[37m# Defining some variables used at build time to install Python3[39;49;00m
ARG [31mPYTHON[39;49;00m=python3
ARG [31mPYTHON_PIP[39;49;00m=python3-pip
ARG [31mPIP[39;49;00m=pip3
ARG [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m# Install some handful libraries like curl, wget, git, build-essential, zlib[39;49;00m
[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install -y --no-install-recommends [33m\[39;49;00m
        build-essential [33m\[39;49;00m
        ca-certificates [33m\[39;49;00m
        curl [33m\[39;49;00m
        wget [33m\[39;49;00m
        git [33m\[39;49;00m
        libopencv-dev [33m\[39;49;00m
        openssh-client [33m\[39;49;00m
        openss

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>We install the <strong>sagemaker-containers</strong> library</li>
    <li>Finally, we copy all contents in <strong>code/</strong> (which is where our training code is) under <strong>/opt/ml/code/</strong> which is the path where sagemaker-containers expect to find training code</li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [3]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [4]:
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

Sending build context to Docker daemon  15.87kB
Step 1/16 : FROM ubuntu:16.04
 ---> b9409899fe86
Step 2/16 : LABEL maintainer="Amazon AI"
 ---> Using cache
 ---> bab228941513
Step 3/16 : ARG PYTHON=python3
 ---> Using cache
 ---> 753bc9f6b601
Step 4/16 : ARG PYTHON_PIP=python3-pip
 ---> Using cache
 ---> 1d2afc099c45
Step 5/16 : ARG PIP=pip3
 ---> Using cache
 ---> 4637544f83e5
Step 6/16 : ARG PYTHON_VERSION=3.6.6
 ---> Using cache
 ---> f16297f44d34
Step 7/16 : RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common &&     add-apt-repository ppa:deadsnakes/ppa -y &&     apt-get update && apt-get install -y --no-install-recommends         build-essential         ca-certificates         curl         wget         git         libopencv-dev         openssh-client         openssh-server         vim         zlib1g-dev &&     rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> a2784e936cf9
Step 8/16 : RUN wget https://www.python.org/ftp/python/$PYTHON_VERSI

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [5]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

825935527263.dkr.ecr.eu-west-1.amazonaws.com/gianpo-ecr/script-mode-container:latest


Given the purpose of this example is explaining how to build custom script-mode containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the script first:

In [6]:
! pygmentize ../docker/code/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m save_model_artifacts, print_files_in_path

[34mdef[39;49;00m [32mtrain[39;49;00m(hp1, hp2, hp3, train_channel, validation_channel):

    [34mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in train channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_TRAIN[39;49;00m[33m'[39;49;00m])
    
    [34mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in validation channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_VALIDATION[39;49;00m[33m'[39;49;00m])
    
    [37m# Dummy 

You can realize that the training code has been implemented as a standard Python script, that will be invoked by the sagemaker-containers library passing hyperparameters as arguments. This way of invoking training script is indeed called <strong>Script Mode</strong> for Amazon SageMaker containers.

Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [7]:
! echo "val1, val2, val3" > dummy.csv
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/train'))
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/val'))
! rm dummy.csv

's3://sagemaker-eu-west-1-825935527263/script-mode-container/val/dummy.csv'

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [8]:
import sagemaker

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix)

est.set_hyperparameters(hp1="value1",
                        hp2=300,
                        hp3=0.001)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

2019-10-24 16:27:47 Starting - Starting the training job...
2019-10-24 16:27:48 Starting - Launching requested ML instances...
2019-10-24 16:28:44 Starting - Preparing the instances for training...
2019-10-24 16:29:17 Downloading - Downloading input data
2019-10-24 16:29:17 Training - Downloading the training image.....[31m2019-10-24 16:29:55,274 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-10-24 16:29:55,274 sagemaker-containers INFO     Failed to parse hyperparameter hp1 value value1 to Json.[0m
[31mReturning the value itself[0m

2019-10-24 16:29:55 Training - Training image download completed. Training in progress.[31m2019-10-24 16:30:01,520 sagemaker-containers INFO     Failed to parse hyperparameter hp1 value value1 to Json.[0m
[31mReturning the value itself[0m
[31m2019-10-24 16:30:01,523 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-10-24 16:30:01,536 sagemaker-containers INFO     