<h1>Custom Framework Container</h1>

This notebook demonstrates how to build and use a simple custom Docker container for training with Amazon SageMaker that leverages on the sagemaker-containers library to define framework containers; framework containers can load user code dynamically, either from Amazon S3 or by pointing a GitHub repository. Reference documentation is available at https://github.com/aws/sagemaker-containers

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [14]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'gianpo-ecr/'
prefix = 'framework-container'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

825935527263
eu-west-1
arn:aws:iam::825935527263:role/service-role/AmazonSageMaker-ExecutionRole-endtoendml
sagemaker-eu-west-1-825935527263


Let's take a look at the Dockerfile which defines the statements for building our custom framework container:

In [15]:
! pygmentize ../docker/Dockerfile

[34mFROM[39;49;00m[33m ubuntu:16.04[39;49;00m

LABEL [31mmaintainer[39;49;00m=[33m"Amazon AI"[39;49;00m

[37m# Defining some variables used at build time to install Python3[39;49;00m
ARG [31mPYTHON[39;49;00m=python3
ARG [31mPYTHON_PIP[39;49;00m=python3-pip
ARG [31mPIP[39;49;00m=pip3
ARG [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m# Install some handful libraries like curl, wget, git, build-essential, zlib[39;49;00m
[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install -y --no-install-recommends [33m\[39;49;00m
        build-essential [33m\[39;49;00m
        ca-certificates [33m\[39;49;00m
        curl [33m\[39;49;00m
        wget [33m\[39;49;00m
        git [33m\[39;49;00m
        libopencv-dev [33m\[39;49;00m
        openssh-client [33m\[39;49;00m
        openss

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We copy a .tar.gz package named <strong>custom_framework_training-1.0.0.tar.gz</strong> in the WORKDIR</li>
    <li>We then install some Python libraries like numpy, pandas, ScikitLearn <strong>and the package we copied at the previous step</strong></li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we set the value of the environment variable <strong>SAGEMAKER_TRAINING_MODULE</strong> to the training packaged we installed</li>
</ul>

<h2>Training module</h2>

When looking at the Dockerfile above, you might be askiong yourself what the <strong>custom_framework_training-1.0.0.tar.gz</strong> package is.
When building a framework container, sagemaker-containers expects that you include in the container a training module that will be responsibile of invoking a user-provided module or script.

Our training module is part of a Python package - that you can find in the folder ../package/ - distributed as a .tar.gz by the Python setuptools library (https://setuptools.readthedocs.io/en/latest/).

Setuptools uses a setup.py file to build the package. Following is the content of this file:

In [16]:
!pygmentize ../package/setup.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mos.path[39;49;00m [34mimport[39;49;00m basename
[34mfrom[39;49;00m [04m[36mos.path[39;49;00m [34mimport[39;49;00m splitext

[34mfrom[39;49;00m [04m[36msetuptools[39;49;00m [34mimport[39;49;00m find_packages, setup

setup(
    name=[33m'[39;49;00m[33mcustom_framework_training[39;49;00m[33m'[39;49;00m,
    version=[33m'[39;49;00m[33m1.0.0[39;49;00m[33m'[39;49;00m,
    description=[33m'[39;49;00m[33mCustom framework container training package.[39;49;00m[33m'[39;49;00m,
    keywords=[33m"[39;49;00m[33mcustom framework contaier training package SageMaker[39;49;00m[33m"[39;49;00m,

    packages=find_packages(where=[33m'[39;49;00m[33msrc[39;49;00m[33m'[39;49;00m),
    package_dir={[33m'[39;49;00m[33m'

This build script looks at the packages under the local src/ path and specifies the dependency on sagemaker-containers. The training module contains the following code:

In [17]:
!pygmentize ../package/src/custom_framework_training/training.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36mlogging[39;49;00m

[34mimport[39;49;00m [04m[36msagemaker_containers.beta.framework[39;49;00m [34mas[39;49;00m [04m[36mframework[39;49;00m

logger = logging.getLogger([31m__name__[39;49;00m)

[34mdef[39;49;00m [32mtrain[39;49;00m(training_environment):
    logger.info([33m'[39;49;00m[33mInvoking user training script.[39;49;00m[33m'[39;49;00m)
    
    [37m# Execute user script as module.[39;49;00m
    framework.modules.run_module(training_environment.module_dir, training_environment.to_cmd_args(),
                                 training_environment.to_env_vars(), training_environment.module_name)
[34mdef[39;49;00m [32mmain[39;49;00m():
    train(framework.training_env())


The idea here is that we will use the <strong>run_module()</strong> function of the sagemaker-containers library to execute the user-provided training script.

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [18]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

[36mcd[39;49;00m ../package/ && python setup.py sdist && cp dist/custom_framework_training-1.0.0.tar.gz ../docker/code/

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>
First, the script runs the <strong>setup.py</strong> to create the training package, which is copied under <strong>../docker/code/</strong>.

Then it builds the Docker container, creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [19]:
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

running sdist
running egg_info
writing src/custom_framework_training.egg-info/PKG-INFO
writing dependency_links to src/custom_framework_training.egg-info/dependency_links.txt
writing requirements to src/custom_framework_training.egg-info/requires.txt
writing top-level names to src/custom_framework_training.egg-info/top_level.txt
reading manifest file 'src/custom_framework_training.egg-info/SOURCES.txt'
writing manifest file 'src/custom_framework_training.egg-info/SOURCES.txt'

running check

creating custom_framework_training-1.0.0
creating custom_framework_training-1.0.0/src
creating custom_framework_training-1.0.0/src/custom_framework_training
creating custom_framework_training-1.0.0/src/custom_framework_training.egg-info
copying files to custom_framework_training-1.0.0...
copying setup.py -> custom_framework_training-1.0.0
copying src/custom_framework_training/__init__.py -> custom_framework_training-1.0.0/src/custom_framework_training
copying src/custom_framework_training/training.

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [20]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

825935527263.dkr.ecr.eu-west-1.amazonaws.com/gianpo-ecr/framework-container:latest


Given the purpose of this example is explaining how to build custom framework containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the script first:

In [21]:
! pygmentize source_dir/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m save_model_artifacts, print_files_in_path

[34mdef[39;49;00m [32mtrain[39;49;00m(hp1, hp2, hp3, train_channel, validation_channel):

    [34mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in train channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_TRAIN[39;49;00m[33m'[39;49;00m])
    
    [34mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in validation channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_VALIDATION[39;49;00m[33m'[39;49;00m])
    
    [37m# Dummy 

You can realize that the training code has been implemented as a standard Python script, that will be invoked as a module by the framework container code, passing hyperparameters as arguments.

Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [22]:
! echo "val1, val2, val3" > dummy.csv
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/train'))
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/val'))
! rm dummy.csv

s3://sagemaker-eu-west-1-825935527263/framework-container/train/dummy.csv
s3://sagemaker-eu-west-1-825935527263/framework-container/val/dummy.csv


As said before, framework containers enable dynamically running user-provided code by either loading it from Amazon S3 or from a GitHub repository. In this case we are going to leverage on Amazon S3, so we need to:
<ul>
    <li>Package the <strong>source_dir</strong> folder in a tar.gz archive</li>
    <li>Upload the archive to Amazon S3</li>
    <li>Specify the path to the archive in Amazon S3 as one of the parameters of the training job</li>
</ul>

<strong>Note:</strong> these steps are executed automatically by the Amazon SageMaker Python SDK when using framework estimators for MXNet, Tensorflow, etc.

In [51]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

create_tar_file(["source_dir/train.py", "source_dir/utils.py"], "sourcedir.tar.gz")

'sourcedir.tar.gz'

In [52]:
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print(sources)
! rm sourcedir.tar.gz

s3://sagemaker-eu-west-1-825935527263/framework-container/code/sourcedir.tar.gz


When starting the training job, we need to let the sagemaker-containers library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):
<ul>
    <li>sagemaker_program</li>
    <li>sagemaker_submit_directory</li>
</ul>

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [50]:
import sagemaker

# JSON encode hyperparameters to avoid showing some info messages raised by the sagemaker-containers library.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": "train",
    "sagemaker_submit_directory": sources,
    "hp1": "value1",
    "hp2": 300,
    "hp3": 0.001})

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

2019-10-25 11:50:39 Starting - Starting the training job...
2019-10-25 11:50:41 Starting - Launching requested ML instances......
2019-10-25 11:52:03 Starting - Preparing the instances for training......
2019-10-25 11:53:00 Downloading - Downloading input data
2019-10-25 11:53:00 Training - Downloading the training image....[31m2019-10-25 11:53:42,081 sagemaker-containers INFO     Imported framework custom_framework_training.training[0m
[31m2019-10-25 11:53:42,084 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-10-25 11:53:42,096 custom_framework_training.training INFO     Invoking user training script.[0m
[31m2019-10-25 11:53:42,368 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-10-25 11:53:42,368 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-10-25 11:53:42,368 sagemaker-containers INFO     Generating MANIFEST.in[0m
[31m2019-10-25 11:53:42,368 sagem

<h3>Training with a custom SDK framework estimator</h3>

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks.

Moreover, this approach will allow you to leverage on local mode training (https://sagemaker.readthedocs.io/en/stable/overview.html#id6).

In [40]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py2",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None
        
import sagemaker

est = CustomFramework(image_name=container_image_uri,
                      role=role,
                      entry_point='train.py',
                      source_dir='source_dir/',
                      train_instance_count=1, 
                      train_instance_type='local',
                      base_job_name=prefix,
                      hyperparameters={
                          "hp1": "value1",
                          "hp2": "300",
                          "hp3": "0.001"
                      })

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

Creating tmpp6sp8hdd_algo-1-fptd8_1 ... 
[1BAttaching to tmpp6sp8hdd_algo-1-fptd8_12mdone[0m
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,048 sagemaker-containers INFO     Imported framework custom_framework_training.training
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,051 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,063 custom_framework_training.training INFO     Invoking user training script.
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,170 sagemaker-containers INFO     Module train does not provide a setup.py. 
[36malgo-1-fptd8_1  |[0m Generating setup.py
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,170 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,170 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-fptd8_1  |[0m 2019-10-25 10:29:09,170 sagemaker-containers INFO     Installing module with the following command:
[36malgo-1-