<h1>Script-mode Custom Training Container (2)</h1>

In [7]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'tf-script-mode-container-2'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

057716757052
ap-northeast-2
arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20210120T193680
sagemaker-ap-northeast-2-057716757052


Let's take a look at the Dockerfile which defines the statements for building our script-mode custom training container:

In [8]:
! pygmentize ../docker/Dockerfile

[34mFROM[39;49;00m [33mtensorflow/tensorflow:2.2.0rc2-gpu-py3-jupyter[39;49;00m

[37m# Install sagemaker-training toolkit to enable SageMaker Python SDK[39;49;00m
[34mRUN[39;49;00m pip3 install sagemaker-training


At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>We install the <strong>sagemaker-training-toolkit</strong> library</li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [9]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [10]:
# %%capture
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

Sending build context to Docker daemon  3.584kB
Step 1/2 : FROM tensorflow/tensorflow:2.2.0rc2-gpu-py3-jupyter
 ---> 7fdb30eac076
Step 2/2 : RUN pip3 install sagemaker-training
 ---> Using cache
 ---> dfc24fef0115
Successfully built dfc24fef0115
Successfully tagged sagemaker-training-containers/tf-script-mode-container-2:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:ap-northeast-2:057716757052:repository/sagemaker-training-containers/tf-script-mode-container-2",
            "registryId": "057716757052",
            "repositoryName": "sagemaker-training-containers/tf-script-mode-container-2",
            "repositoryUri": "057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2",
            "createdAt": 1614241631.0,
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
  

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [11]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest


Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [12]:
# container_image_uri = '057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest'
%store container_image_uri

Stored 'container_image_uri' (str)
