<h1>Script-mode Custom Training Container (2)</h1>

This notebook demonstrates how to build and use a custom Docker container for training with Amazon SageMaker that leverages on the <strong>Script Mode</strong> execution that is implemented by the sagemaker-training-toolkit library. Reference documentation is available at https://github.com/aws/sagemaker-training-toolkit.

The difference from the first example is that we are not copying the training code during the Docker build process, and we are loading them dynamically from Amazon S3 (this feature is implemented through the sagemaker-training-toolkit).

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'tf-script-mode-container-2'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

057716757052
ap-northeast-2
arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20210120T193680
sagemaker-ap-northeast-2-057716757052


Let's take a look at the Dockerfile which defines the statements for building our script-mode custom training container:

In [2]:
! pygmentize ../docker/Dockerfile

[34mFROM[39;49;00m [33mtensorflow/tensorflow:2.2.0rc2-gpu-py3-jupyter[39;49;00m

[37m# Install sagemaker-training toolkit to enable SageMaker Python SDK[39;49;00m
[34mRUN[39;49;00m pip3 install sagemaker-training


At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>We install the <strong>sagemaker-training-toolkit</strong> library</li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [3]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [4]:
%%capture
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [5]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest


Given the purpose of this example is explaining how to build custom script-mode containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the script first:

In [6]:
! pygmentize source_dir/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m save_model_artifacts, print_files_in_path

[34mdef[39;49;00m [32mtrain[39;49;00m(hp1, hp2, hp3, train_channel, validation_channel):

    [36mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in train channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_TRAIN[39;49;00m[33m'[39;49;00m])
    
    [36mprint[39;49;00m([33m'[39;49;00m[33m\n[39;49;00m[33mList of files in validation channel: [39;49;00m[33m'[39;49;00m)
    print_files_in_path(os.environ[[33m'[39;49;00m[33mSM_CHANNEL_VALIDATION[39;49;00m[33m'[39;49;00m])
    
    [37m# Dummy 

You can realize that the training code has been implemented as a standard Python script, that will be invoked by the sagemaker-training-toolkit library passing hyperparameters as arguments. This way of invoking training script is indeed called <strong>Script Mode</strong> for Amazon SageMaker containers.

Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [2]:
container_image_uri = '057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest'
%store container_image_uri

Stored 'container_image_uri' (str)


In [7]:
! echo "val1, val2, val3" > dummy.csv
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/train'))
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/val'))
! rm dummy.csv

s3://sagemaker-ap-northeast-2-057716757052/tf-script-mode-container-2/train/dummy.csv
s3://sagemaker-ap-northeast-2-057716757052/tf-script-mode-container-2/val/dummy.csv


We want to dynamically run user-provided code loading it from Amazon S3, so we need to:
<ul>
    <li>Package the <strong>source_dir</strong> folder in a tar.gz archive</li>
    <li>Upload the archive to Amazon S3</li>
    <li>Specify the path to the archive in Amazon S3 as one of the parameters of the training job</li>
</ul>

<strong>Note:</strong> these steps are executed automatically by the Amazon SageMaker Python SDK when using framework estimators for MXNet, Tensorflow, etc.

In [8]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

create_tar_file(["source_dir/train.py", "source_dir/utils.py"], "sourcedir.tar.gz")

'sourcedir.tar.gz'

In [9]:
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print(sources)
! rm sourcedir.tar.gz

s3://sagemaker-ap-northeast-2-057716757052/tf-script-mode-container-2/code/sourcedir.tar.gz


When starting the training job, we need to let the sagemaker-training-toolkit library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):
<ul>
    <li>sagemaker_program</li>
    <li>sagemaker_submit_directory</li>
</ul>

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [10]:
import sagemaker
import json

# JSON encode hyperparameters.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": "train.py",
    "sagemaker_submit_directory": sources,
    "hp1": "value1",
    "hp2": 300,
    "hp3": 0.001})

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='local',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating p5o7iezr1v-algo-1-m9vgq ... 
Creating p5o7iezr1v-algo-1-m9vgq ... done
Attaching to p5o7iezr1v-algo-1-m9vgq
[36mp5o7iezr1v-algo-1-m9vgq |[0m 2021-02-25 09:39:40,135 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mp5o7iezr1v-algo-1-m9vgq |[0m 2021-02-25 09:39:40,144 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mp5o7iezr1v-algo-1-m9vgq |[0m 2021-02-25 09:39:40,153 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mp5o7iezr1v-algo-1-m9vgq |[0m 2021-02-25 09:39:40,162 sagemaker-training-toolkit INFO     Invoking user script
[36mp5o7iezr1v-algo-1-m9vgq |[0m 
[36mp5o7iezr1v-algo-1-m9vgq |[0m Training Env:
[36mp5o7iezr1v-algo-1-m9vgq |[0m 
[36mp5o7iezr1v-algo-1-m9vgq |[0m {
[36mp5o7iezr1v-algo-1-m9vgq |[0m     "additional_framework_par

<h3>Training with a custom SDK framework estimator</h3>

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks.

In [11]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None
        
import sagemaker

est = CustomFramework(image_name=container_image_uri,
                      role=role,
                      entry_point='train.py',
                      source_dir='source_dir/',
                      train_instance_count=1, 
                      train_instance_type='local', # we use local mode
                      #train_instance_type='ml.m5.xlarge',
                      base_job_name=prefix,
                      hyperparameters={
                          "hp1": "value1",
                          "hp2": "300",
                          "hp3": "0.001"
                      })

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
image_name has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating ukrq6r9sfa-algo-1-hag6w ... 
Creating ukrq6r9sfa-algo-1-hag6w ... done
Attaching to ukrq6r9sfa-algo-1-hag6w
[36mukrq6r9sfa-algo-1-hag6w |[0m 2021-02-25 09:42:13,025 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mukrq6r9sfa-algo-1-hag6w |[0m 2021-02-25 09:42:13,034 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mukrq6r9sfa-algo-1-hag6w |[0m 2021-02-25 09:42:13,043 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mukrq6r9sfa-algo-1-hag6w |[0m 2021-02-25 09:42:13,053 sagemaker-training-toolkit INFO     Invoking user script
[36mukrq6r9sfa-algo-1-hag6w |[0m 
[36mukrq6r9sfa-algo-1-hag6w |[0m Training Env:
[36mukrq6r9sfa-algo-1-hag6w |[0m 
[36mukrq6r9sfa-algo-1-hag6w |[0m {
[36mukrq6r9sfa-algo-1-hag6w |[0m     "additional_framework_par

## Test in Cloud

In [12]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None
        
import sagemaker

est = CustomFramework(image_name=container_image_uri,
                      role=role,
                      entry_point='train.py',
                      source_dir='source_dir/',
                      train_instance_count=1, 
                      # train_instance_type='local', # we use local mode
                      train_instance_type='ml.m5.xlarge',
                      base_job_name=prefix,
                      hyperparameters={
                          "hp1": "value1",
                          "hp2": "300",
                          "hp3": "0.001"
                      })

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
image_name has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-25 09:50:23 Starting - Starting the training job...
2021-02-25 09:50:49 Starting - Launching requested ML instancesProfilerReport-1614246623: InProgress
......
2021-02-25 09:51:50 Starting - Preparing the instances for training...
2021-02-25 09:52:19 Downloading - Downloading input data...
2021-02-25 09:52:51 Training - Downloading the training image........[34m2021-02-25 09:54:03,194 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/bin/python3 -m pip install -r requirements.txt[0m
  from cryptography.utils import int_from_bytes[0m
  from cryptography.utils import int_from_bytes[0m
[34mCollecting pandas<2
  Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)[0m
[34mCollecting pytz>=2017.2
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)[0m
[34mInstalling collected packages: pytz, pandas[0m
[34mSuccessfully installed pandas-1.1.5 pytz-2021.1[0m
[34mYou should consider upgrading via the '/usr/bin/p