<h1>Custom Framework Container</h1>

This notebook demonstrates how to build and use a simple custom Docker container for training with Amazon SageMaker that leverages on the sagemaker-training-toolkit library to define framework containers.
A framework container is similar to a script-mode container, but in addition it loads a Python framework module that is used to configure the framework and then run the user-provided module.

Reference documentation is available at https://github.com/aws/sagemaker-training-toolkit

Before creating our LightGBM container, let's create a simple model and test it locally:

In [2]:
!pip install lightgbm==2.3.1

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import os

import lightgbm as lgb

from sklearn import datasets
import pandas as pd
import numpy as np
import joblib

In [4]:
iris = datasets.load_iris()

X=iris.data
y=iris.target

dataset = np.insert(iris.data, 0, iris.target,axis=1)

df = pd.DataFrame(data=dataset, columns=['iris_id'] + iris.feature_names)
## We'll also save the dataset, with header, give we'll need to create a baseline for the monitoring
df['species'] = df['iris_id'].map(lambda x: 'setosa' if x == 0 else 'versicolor' if x == 1 else 'virginica')

df.head()

Unnamed: 0,iris_id,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,0.0,5.1,3.5,1.4,0.2,setosa
1,0.0,4.9,3.0,1.4,0.2,setosa
2,0.0,4.7,3.2,1.3,0.2,setosa
3,0.0,4.6,3.1,1.5,0.2,setosa
4,0.0,5.0,3.6,1.4,0.2,setosa


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

In [6]:
gbm = lgb.LGBMClassifier(objective='multiclass',
                        num_class=len(np.unique(y)),
                        random_state=10)

In [7]:
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_names='[validation_softmax]',
        eval_metric='softmax',
        early_stopping_rounds=5,
        verbose=5)

Training until validation scores don't improve for 5 rounds
[5]	[validation_softmax]'s multi_logloss: 0.66223
[10]	[validation_softmax]'s multi_logloss: 0.449162
[15]	[validation_softmax]'s multi_logloss: 0.328279
[20]	[validation_softmax]'s multi_logloss: 0.256717
[25]	[validation_softmax]'s multi_logloss: 0.212008
[30]	[validation_softmax]'s multi_logloss: 0.184434
[35]	[validation_softmax]'s multi_logloss: 0.168658
[40]	[validation_softmax]'s multi_logloss: 0.158086
[45]	[validation_softmax]'s multi_logloss: 0.152053
[50]	[validation_softmax]'s multi_logloss: 0.147186
[55]	[validation_softmax]'s multi_logloss: 0.143228
[60]	[validation_softmax]'s multi_logloss: 0.140089
[65]	[validation_softmax]'s multi_logloss: 0.139879
Early stopping, best iteration is:
[63]	[validation_softmax]'s multi_logloss: 0.139111


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_class=3, num_leaves=31,
               objective='multiclass', random_state=10, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

In [8]:
from sklearn.metrics import f1_score

In [9]:
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

In [10]:
y_pred

array([2, 1, 0, 1, 2, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 2, 1, 1, 2, 1, 2, 1,
       0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 0, 1, 0, 0, 2, 1, 2, 1, 1, 1, 0, 0,
       2, 1, 2, 1, 1, 2])

In [11]:
score = f1_score(y_test,y_pred,labels=[0.0,1.0,2.0],average='micro')

In [12]:
score

0.94

Let's save the train and test datasets to a local folder and transform code to a script (that will be executed by SageMaker:

In [13]:
# Create directory and write csv
os.makedirs('./data', exist_ok=True)
# np.savetxt('./data/mimic.csv', data, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

In [14]:
np_data_raw = np.concatenate((X, np.expand_dims(y, axis=1)), axis=1)
np.savetxt('./data/iris.csv', np_data_raw, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')

np_data_train = np.concatenate((X_train, np.expand_dims(y_train, axis=1)), axis=1)
np.savetxt('./data/iris_train.csv', np_data_train, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')

np_data_test = np.concatenate((X_test, np.expand_dims(y_test, axis=1)), axis=1)
np.savetxt('./data/iris_test.csv', np_data_test, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')
np.savetxt('./data/iris_test_no_label.csv', X_test, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f')

In [14]:
# %%writefile

import os
import sys

import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import logging
import joblib

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

gbm = lgb.LGBMClassifier(objective='multiclass',
                        num_class=len(np.unique(y)))

gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_names='[validation_softmax]',
        eval_metric='softmax',
        early_stopping_rounds=5,
        verbose=5)

from sklearn.metrics import f1_score

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

score = f1_score(y_test,y_pred,labels=[0.0,1.0,2.0],average='micro')

logger.info(f'[F1 score] {score}')

logger.info(f'Saving model...')
joblib.dump(gbm, 'model.joblib')

Training until validation scores don't improve for 5 rounds
[5]	[validation_softmax]'s multi_logloss: 0.66223
[10]	[validation_softmax]'s multi_logloss: 0.449162
[15]	[validation_softmax]'s multi_logloss: 0.328279
[20]	[validation_softmax]'s multi_logloss: 0.256717
[25]	[validation_softmax]'s multi_logloss: 0.212008
[30]	[validation_softmax]'s multi_logloss: 0.184434
[35]	[validation_softmax]'s multi_logloss: 0.168658
[40]	[validation_softmax]'s multi_logloss: 0.158086
[45]	[validation_softmax]'s multi_logloss: 0.152053
[50]	[validation_softmax]'s multi_logloss: 0.147186
[55]	[validation_softmax]'s multi_logloss: 0.143228
[60]	[validation_softmax]'s multi_logloss: 0.140089
[65]	[validation_softmax]'s multi_logloss: 0.139879
Early stopping, best iteration is:
[63]	[validation_softmax]'s multi_logloss: 0.139111
[F1 score] 0.94
Saving model...


['model.joblib']

In [15]:
gbm_loaded = joblib.load('model.joblib')
gbm_loaded

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_class=3, num_leaves=31,
               objective='multiclass', random_state=None, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

---

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [15]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'framework-container'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

725879053979
us-east-1
arn:aws:iam::725879053979:role/MLOps
sagemaker-us-east-1-725879053979


Let's take a look at the Dockerfile which defines the statements for building our custom framework container:

In [16]:
!pygmentize ../docker/Dockerfile

[37m# Part of the implementation of this container is based on the Amazon SageMaker Apache MXNet container.[39;49;00m
[37m# https://github.com/aws/sagemaker-mxnet-container[39;49;00m

[34mFROM[39;49;00m [33mubuntu:16.04[39;49;00m

[34mLABEL[39;49;00m [31mmaintainer[39;49;00m=[33m"Amazon AI"[39;49;00m

[37m# Defining some variables used at build time to install Python3[39;49;00m
[34mARG[39;49;00m [31mPYTHON[39;49;00m=python3
[34mARG[39;49;00m [31mPYTHON_PIP[39;49;00m=python3-pip
[34mARG[39;49;00m [31mPIP[39;49;00m=pip3
[34mARG[39;49;00m [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m# Install some handful libraries like curl, wget, git, build-essential, zlib[39;49;00m
[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install -y --no-install-recommends [33m\[39;49;00m
   

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We copy a .tar.gz package named <strong>custom_framework_training-1.0.0.tar.gz</strong> in the WORKDIR</li>
    <li>We then install some Python libraries like numpy, pandas, ScikitLearn <strong>and the package we copied at the previous step</strong></li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we set the value of the environment variable <strong>SAGEMAKER_TRAINING_MODULE</strong> to a python module in the training package we installed</li>
</ul>

<h2>Training module</h2>

When looking at the Dockerfile above, you might be askiong yourself what the <strong>custom_framework_training-1.0.0.tar.gz</strong> package is.
When building a framework container, sagemaker-training-toolkit allows you to specify a framework module that will be run first, and then invoke a user-provided module.

The advantage of using this approach is that you can use the framework module to configure the framework of choice or apply any settings related to the libraries installed in the environment, and then run the user module (we will see shortly how).

Our framework module is part of a Python package - that you can find in the folder ../package/ - distributed as a .tar.gz by the Python setuptools library (https://setuptools.readthedocs.io/en/latest/).

Setuptools uses a setup.py file to build the package. Following is the content of this file:

In [17]:
!pygmentize ../package/setup.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mos[39;49;00m[04m[36m.[39;49;00m[04m[36mpath[39;49;00m [34mimport[39;49;00m basename
[34mfrom[39;49;00m [04m[36mos[39;49;00m[04m[36m.[39;49;00m[04m[36mpath[39;49;00m [34mimport[39;49;00m splitext

[34mfrom[39;49;00m [04m[36msetuptools[39;49;00m [34mimport[39;49;00m find_packages, setup

setup(
    name=[33m'[39;49;00m[33mcustom_lightgbm_framework[39;49;00m[33m'[39;49;00m,
    version=[33m'[39;49;00m[33m1.0.0[39;49;00m[33m'[39;49;00m,
    description=[33m'[39;49;00m[33mCustom framework container training package for LightGBM.[39;49;00m[33m'[39;49;00m,
    keywords=[33m"[39;49;00m[33mcustom framework container training package SageMaker LightGBM[39;49;00m[33m"[39;49;00m,

    packages=find_packa

This build script looks at the packages under the local src/ path and specifies the dependency on sagemaker-training. The training module contains the following code:

In [18]:
!pygmentize ../package/src/custom_lightgbm_framework/training.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mfrom[39;49;00m [04m[36msagemaker_training[39;49;00m [34mimport[39;49;00m entry_point, environment

logger = logging.getLogger([31m__name__[39;49;00m)

[34mdef[39;49;00m [32mtrain[39;49;00m(training_env):
    logger.info([33m'[39;49;00m[33mInvoking user training script.[39;49;00m[33m'[39;49;00m)

    entry_point.run(
        training_env.module_dir,
        training_env.user_entry_point,
        training_env.to_cmd_args(),
        training_env.to_env_vars()
    )

[34mdef[39;49;00m [32mmain[39;49;00m():
    training_env = environment.Environment()
    train(training_env)


The idea here is that we will use the <strong>entry_point.run()</strong> function of the sagemaker-training-toolkit library to execute the user-provided module.
You might want to set additional framework-level configurations (e.g. parameter servers) before calling the user module.

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [19]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

[36mcd[39;49;00m ../package/ && python setup.py sdist && cp dist/custom_lightgbm_framework-1.0.0.tar.gz ../docker/code/

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


---
First, the script runs the <strong>setup.py</strong> to create the training package, which is copied under <strong>../docker/code/</strong>.

Then it builds the Docker container, creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [20]:
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

----
**TODO Add local test**

```
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train","validation":"/opt/ml/input/data/validation"},"current_host":"algo-1-5onks","framework_module":"custom_lightgbm_framework.training:main","hosts":["algo-1-5onks"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"ContentType":"text/csv","TrainingInputMode":"File"},"validation":{"ContentType":"text/csv","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-custom-2020-08-02-03-57-03-742","log_level":20,"master_hostname":"algo-1-5onks","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-725879053979/sagemaker-custom/code/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-5onks","hosts":["algo-1-5onks"]},"user_entry_point":"train.py"}
```

Look:

https://github.com/aws/sagemaker-training-toolkit/blob/74722fab9c9a9138b350df2cf54a204e2ad790c4/src/sagemaker_training/environment.py#L311

```
!sudo rm -rf train_tests && mkdir -p train_tests
with open("train_tests/vars.env", "w") as f:
    f.write("AWS_ACCOUNT_ID=%s\n" % account_id)
    f.write("IMAGE_TAG=%s\n" % image_tag)
    f.write("AWS_DEFAULT_REGION=%s\n" % region)
    
    (...)
    
    f.close()

!cat tests/vars.env

```

Pass env vars to docker:

https://docs.docker.com/engine/reference/commandline/run/#set-environment-variables--e---env---env-file

!docker run --env-file vars.env \<IMG> train

In [21]:
!docker run --env-file vars.env sagemaker-training-containers/framework-container:latest train 

docker: open vars.env: no such file or directory.
See 'docker run --help'.


<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [22]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

725879053979.dkr.ecr.us-east-1.amazonaws.com/sagemaker-training-containers/framework-container:latest


Given the purpose of this example is explaining how to build custom framework containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the script first:

In [23]:
! pygmentize source_dir/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mjoblib[39;49;00m

[34mimport[39;49;00m [04m[36mlightgbm[39;49;00m [34mas[39;49;00m [04m[36mlgb[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m f1_score
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_sp

You can realize that the training code has been implemented as a standard Python script, that will be invoked as a module by the framework container code, passing hyperparameters as arguments.

Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [24]:
# Save data in S3 for training with SageMaker
prefix = 'sagemaker-custom'
data_dir = 'data'
input_train = sagemaker_session.upload_data('data/iris_train.csv', key_prefix="{}/{}".format(prefix, data_dir) )
input_test = sagemaker_session.upload_data('data/iris_test.csv', key_prefix="{}/{}".format(prefix, data_dir) )

In [25]:
sagemaker_session.list_s3_files(sagemaker_session.default_bucket(), prefix+'/'+data_dir)

['sagemaker-custom/data/iris_test.csv', 'sagemaker-custom/data/iris_train.csv']

In [26]:
input_train, input_test

('s3://sagemaker-us-east-1-725879053979/sagemaker-custom/data/iris_train.csv',
 's3://sagemaker-us-east-1-725879053979/sagemaker-custom/data/iris_test.csv')

Framework containers enable dynamically running user-provided code loading it from Amazon S3, so we need to:
<ul>
    <li>Package the <strong>source_dir</strong> folder in a tar.gz archive</li>
    <li>Upload the archive to Amazon S3</li>
    <li>Specify the path to the archive in Amazon S3 as one of the parameters of the training job</li>
</ul>

<strong>Note:</strong> these steps are executed automatically by the Amazon SageMaker Python SDK when using framework estimators for MXNet, Tensorflow, etc.

In [27]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

create_tar_file(["source_dir/train.py"], "sourcedir.tar.gz")

'sourcedir.tar.gz'

In [28]:
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print(sources)
! rm sourcedir.tar.gz

s3://sagemaker-us-east-1-725879053979/sagemaker-custom/code/sourcedir.tar.gz


When starting the training job, we need to let the sagemaker-training-toolkit library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):
<ul>
    <li>sagemaker_program</li>
    <li>sagemaker_submit_directory</li>
</ul>

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [29]:
container_image_uri

'725879053979.dkr.ecr.us-east-1.amazonaws.com/sagemaker-training-containers/framework-container:latest'

In [30]:
print(input_train, 
      '\n',
      input_test)

train_config = sagemaker.session.s3_input(input_train, content_type='text/csv')
test_config = sagemaker.session.s3_input(input_test, content_type='text/csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


s3://sagemaker-us-east-1-725879053979/sagemaker-custom/data/iris_train.csv 
 s3://sagemaker-us-east-1-725879053979/sagemaker-custom/data/iris_test.csv


In [42]:
sources

's3://sagemaker-us-east-1-725879053979/sagemaker-custom/code/sourcedir.tar.gz'

In [31]:
import sagemaker
import json

# JSON encode hyperparameters.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": "train.py",
    "sagemaker_submit_directory": sources})
#     "hp1": "value1",
#     "hp2": 300,
#     "hp3": 0.001}
# )

estimator = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='local',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)



In [32]:
estimator.fit({'train': train_config, 'validation': test_config })

Creating tmp4i4vxpis_algo-1-h14y5_1 ... 
[1BAttaching to tmp4i4vxpis_algo-1-h14y5_12mdone[0m
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,664 sagemaker-training-toolkit INFO     Imported framework custom_lightgbm_framework.training
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,666 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,679 custom_lightgbm_framework.training INFO     Invoking user training script.
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,816 sagemaker-training-toolkit INFO     Module train.py does not provide a setup.py. 
[36malgo-1-h14y5_1  |[0m Generating setup.py
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,816 sagemaker-training-toolkit INFO     Generating setup.cfg
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,816 sagemaker-training-toolkit INFO     Generating MANIFEST.in
[36malgo-1-h14y5_1  |[0m 2020-08-02 22:53:16,816 sagemaker-training-toolkit INFO     Installing module w

<h3>Training with a custom SDK framework estimator</h3>

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks.

In [39]:
%%writefile custom_framework.py
from sagemaker.estimator import Framework

class MyLightGBMFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(MyLightGBMFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None

Writing custom_framework.py


In [40]:
import sagemaker
from custom_framework import MyLightGBMFramework

framework = MyLightGBMFramework(image_name=container_image_uri,
                          role=role,
                          entry_point='train.py',
                          source_dir='source_dir/',
                          train_instance_count=1, 
                          train_instance_type='local', # we use local mode
                          #train_instance_type='ml.m5.xlarge',
                          base_job_name=prefix,
                          hyperparameters={})

framework.fit({'train': train_config, 'validation': test_config })



Creating tmponpwe5bu_algo-1-z5s7p_1 ... 
[1BAttaching to tmponpwe5bu_algo-1-z5s7p_12mdone[0m
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,253 sagemaker-training-toolkit INFO     Imported framework custom_lightgbm_framework.training
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,255 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,269 custom_lightgbm_framework.training INFO     Invoking user training script.
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,443 sagemaker-training-toolkit INFO     Module train.py does not provide a setup.py. 
[36malgo-1-z5s7p_1  |[0m Generating setup.py
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,444 sagemaker-training-toolkit INFO     Generating setup.cfg
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,444 sagemaker-training-toolkit INFO     Generating MANIFEST.in
[36malgo-1-z5s7p_1  |[0m 2020-08-02 23:22:49,444 sagemaker-training-toolkit INFO     Installing module w

---
TODO - Deploys - dev

In [41]:
estimator.deploy(initial_instance_count=1,
                 instance_type='local',
                )

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/image.py", line 618, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/image.py", line 677, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/image.py", line 623, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpktskvj7l/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1



RuntimeError: Giving up, endpoint didn't launch correctly

> [0;32m/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/entities.py[0m(510)[0;36m_wait_for_serving_container[0;34m()[0m
[0;32m    508 [0;31m        [0mi[0m [0;34m+=[0m [0;36m5[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    509 [0;31m        [0;32mif[0m [0mi[0m [0;34m>=[0m [0mHEALTH_CHECK_TIMEOUT_LIMIT[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 510 [0;31m            [0;32mraise[0m [0mRuntimeError[0m[0;34m([0m[0;34m"Giving up, endpoint didn't launch correctly"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    511 [0;31m[0;34m[0m[0m
[0m[0;32m    512 [0;31m        [0mlogger[0m[0;34m.[0m[0minfo[0m[0;34m([0m[0;34m"Checking if serving container is up, attempt: %s"[0m[0;34m,[0m [0mi[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
--KeyboardInterrupt--


ipdb>  q


In [37]:
%pdb
framework.deploy(initial_instance_count=1,
                 instance_type='local')

Automatic pdb calling has been turned ON


AttributeError: 'NoneType' object has no attribute 'name'

> [0;32m/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py[0m(709)[0;36mdeploy[0;34m()[0m
[0;32m    707 [0;31m            [0mmodel[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mcreate_model[0m[0;34m([0m[0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    708 [0;31m[0;34m[0m[0m
[0m[0;32m--> 709 [0;31m        [0mmodel[0m[0;34m.[0m[0mname[0m [0;34m=[0m [0mmodel_name[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    710 [0;31m[0;34m[0m[0m
[0m[0;32m    711 [0;31m        return model.deploy(
[0m


ipdb>  help



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv         unt   
a      clear      display  j         p       q        s          until 
alias  commands   down     jump      pdef    quit     source     up    
args   condition  enable   l         pdoc    r        step       w     
b      cont       exit     list      pfile   restart  tbreak     whatis
break  continue   h        ll        pinfo   return   u          where 
bt     d          help     longlist  pinfo2  retval   unalias  
c      debug      ignore   n         pp      run      undisplay

Miscellaneous help topics:
exec  pdb



ipdb>  l


[1;32m    704 [0m            [0mmodel[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0m_compiled_models[0m[0;34m[[0m[0mfamily[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[1;32m    705 [0m        [0;32melse[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m    706 [0m            [0mkwargs[0m[0;34m[[0m[0;34m"model_kms_key"[0m[0;34m][0m [0;34m=[0m [0mself[0m[0;34m.[0m[0moutput_kms_key[0m[0;34m[0m[0;34m[0m[0m
[1;32m    707 [0m            [0mmodel[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mcreate_model[0m[0;34m([0m[0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m    708 [0m[0;34m[0m[0m
[0;32m--> 709 [0;31m        [0mmodel[0m[0;34m.[0m[0mname[0m [0;34m=[0m [0mmodel_name[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    710 [0m[0;34m[0m[0m
[1;32m    711 [0m        return model.deploy(
[1;32m    712 [0m            [0minstance_type[0m[0;34m=[0m[0minstance_type[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    713

ipdb>  ll


[1;32m    625 [0m    def deploy(
[1;32m    626 [0m        [0mself[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    627 [0m        [0minitial_instance_count[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    628 [0m        [0minstance_type[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    629 [0m        [0maccelerator_type[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    630 [0m        [0mendpoint_name[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    631 [0m        [0muse_compiled_model[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    632 [0m        [0mupdate_endpoint[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    633 [0m        [0mwait[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    634 [0m        [0mmodel_name[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m    635 [0m 

ipdb>  model
ipdb>  p model


None


ipdb>  p kwargs["model_kms_key"]


None


ipdb>  endpoint_name


'sagemaker-custom-2020-08-02-22-53-21-422'


ipdb>  model_name


'sagemaker-custom-2020-08-02-22-53-21-422'


ipdb>  use_compiled_model


False


ipdb>  self


<__main__.MyLightGBMFramework object at 0x7f2ca82ec128>


ipdb>  q
