In [None]:
# Use the previous version of the SageMaker SDK
import sys
!{sys.executable} -m pip install sagemaker==1.72.0 -U

# Creating our custom LightGBM training container for SageMaker

Now that we have successfully developed the LightGBM model, we can proceed and create our custom Docker container. We'll use the [sagemaker-training-toolkit library](https://github.com/aws/sagemaker-training-toolkit) to define script mode and framework containers (our custom LightGBM container).

This part of the workshop is composed of 4 parts:

1. <a href="#custom_training_container">Create our <strong>custom Docker container that SageMaker will run for training</strong> in Script Mode and Framework mode</a>
2. <a href="#training_toolkit">Create a Python package to configure <strong>SageMaker Training toolkit</strong></a>
3. <a href="#container_training_build"><strong>Building</strong> our Docker image and <strong>pushing</strong> it to Amazon Elastic Container Registry</a>
3. <a href="#testing_training"><strong>Testing the training locally</strong> with our container using the SageMaker Python SDK</a>

### First of all, what are the types of SageMaker containers? What is a SageMaker Framework container?

Basically, SageMaker provides 3 types of containers (and APIs) that you can interact with:

#### a) Basic Training Container (click on the three dots below for mode details)

The bare minimum that is required for building a custom Docker container to run training in Amazon SageMaker (using pre-defined training logic and prepared to receive data in specific shape, e.g. target variable in the first column). 

<img src="./media/basic_training_container.jpg" class="center">

When interacting with SageMaker with the basic training container, we use the Estimator API with the class ['sagemaker.estimator.Estimator(...)'](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) passing only the configurations like container image URI, hyperparamenters, etc. **We cannot pass custom code into SageMaker as it is already specified inside the Docker container.**

#### b) Script Mode Container (click on the three dots below for mode details)

A custom container where **we can pass a user defined script that SageMaker will execute at training time** (sagemaker-training toolkit will load and execute the defined script as entry point). 

This training script can be passed to the container in a Python module (shown below in Script Mode Container on the left) or can be stored in Amazon S3 and passed to the container that will download it and run it as entry point (shown below in Script Mode Conatiner 2 on the right).

<table><tr>
<td> <img src="./media/script_mode_container.jpg" style="height: 100%"> </td>
<td> <img src="./media/script_mode_container_2.jpg" style="height: 100%"> </td>
</tr></table>

We can use the Estimator API with the class ['sagemaker.estimator.Estimator(...)'](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) and pass the configurations  including `sagemaker_program` for the entry point Python module and `sagemaker_submit_directory` for URI of the tarball in S3 with the code used for training (this way SageMaker at training time knows where to download the code from and how to run it).
    

#### c) Framework Container (click on the three dots below for mode details)

A framework container is similar to a script-mode container, but in addition it loads a Python framework module that is used to configure the framework and then run the user-provided module. **The SageMaker Python SDK with Framework mode will create a tarball with local files and upload our code to S3. Then, in the training stage, SageMaker will load the script similarly to the Script Mode Container 2 above and run it for training.**

<img src="./media/framework_container.jpg" class="center">

We can use the Estimator API with the class ['sagemaker.estimator.Framework(...)'](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework) and pass the configurations  including `entry_point` for the entry point Python module and `source_dir` for the local directory with code.


[More details about the containers and examples here.](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/custom-training-containers)

> **Quick Recap**: [In the previous exercise](https://github.com/marcelokscunha/amazon-sagemaker-mlops-workshop/blob/master/lab/00_Warmup_Studio/xgboost_customer_churn_studio.ipynb) we used XGBoost in both "basic training container" and "framework" modes.

<div id="custom_training_container">
<h2> 1. Creating the training container</h2>
</div>

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and the default Amazon S3 bucket to be used by Amazon SageMaker:

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_repository_name = 'sagemaker-custom-lightgbm'
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)
print(ecr_repository_name)

Let's take a look at the Dockerfile which defines the statements for building our custom script/framework container:

In [None]:
!pygmentize ../docker/Dockerfile

---

At high-level the Dockerfile specifies the following operations for building this container:
1. Start from Ubuntu 16.04
2. Define some variables to be used at build time to install Python 3
3. Some handful libraries are installed with apt-get
4. We then install Python 3 and create a symbolic link
5. We copy a .tar.gz package named **custom_framework_training-1.0.0.tar.gz** in the WORKDIR
6. We then install some Python libraries like Numpy, Pandas, Scikit-Learn, **LightGBM and our package we copied at the previous step**
7. We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)
8. Finally, we set the value of the environment variable `SAGEMAKER_TRAINING_MODULE` to a Python module in the training package we installed. 

> **Under the hood**: 
>
>- After installing the sagemaker-training-tookit, we can run it as a command line script just executing `train` in the terminal. 
>
>[Take a look at the setup.py of sagemaker-training-toolkit here.](https://github.com/aws/sagemaker-training-toolkit/blob/master/setup.py) 
>
>(see the last lines of setup.py &rarr; `entry_points={"console_scripts": : ["train=(...)}`)
>
>- When executing `train` the toolkit will run the specified script and look for this environment variable `SAGEMAKER_TRAINING_MODULE` above. 
>
>In another words:
>
>- We define `ENV SAGEMAKER_TRAINING_MODULE custom_lightgbm_framework.training:main`. When we run `train` in bash, the SageMaker Training Toolkit execute our `main()` function defined at [custom_lightgbm_framework.training](../package/src/custom_lightgbm_framework/training.py). 
>
>SageMaker will run our Docker image for training with `docker run <YOUR-IMAGE> train` ([more details here](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html)).

<div id="training_toolkit">
    <h2>2. What is our custom_framework_training-1.0.0.tar.gz package?</h2>
</div>

Looking at our Dockerfile above, we see that we didn't install sagemaker-training (the toolkit) with `pip install sagemaker-training`, nor created a custom console script for the command `train`. All of that is configured within our package.

The advantage here is that **we can re-use the package and change just the libraries installed in the Docker container**. In the end if we wanted to use another Framework (e.g. [Catboost](https://catboost.ai/)) we would just change the step 6. in the Dockerfile (e.g. `RUN pip install catboost`).

Let's see the configurations of our custom Python package:

In [None]:
!pygmentize ../package/setup.py

This build script looks at the packages under the local src/ path and specifies the dependency on sagemaker-training. As previously explained, the training module containing the `main()` function that will be executed is the following:

In [None]:
!pygmentize ../package/src/custom_lightgbm_framework/training.py

The idea here is that we will use the <strong>entry_point.run()</strong> function of the sagemaker-training-toolkit library to execute the user-provided module.
You might want to set additional framework-level configurations (e.g. parameter servers) before calling the user module.

<div id="container_training_build">
<h2>3. Build and push the container</h2>
</div>

We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [None]:
! pygmentize ../scripts/build_and_push.sh

---
This script does the following:

1. runs the **setup.py** to create the training package (a tarball in this case) and copy it under **../docker/code/**
2. builds the Docker image and tags it
3. logs in to Amazon Elastic Container Registry and creates a repository if there's none
4. push the image to our ECR repository

The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

**Let's run this script now:**

In [None]:
print('Confgurations for our bash script:')
print('account ID:', account_id)
print('region:', region)
print('ECR repository name:', ecr_repository_name)

In [None]:
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

[Go to ECR in the AWS console](https://console.aws.amazon.com/ecr/home?region=us-east-1) and check if our new repository called `sagemaker-custom-lightgbm` was created and the image was pushed to it.

![ecr-repo](./media/ecr-repo.png)

<div id="testing_training">
<h2>4. Training with Amazon SageMaker</h2>
</div>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker. As previously explained, we have 2 options to pass this training script to SageMaker: via **Script Mode** or **Framework mode**.

*We have to:*

a) **upload the data to S3** so that SageMaker knows from where to dowload it and train (defining our S3-based training channels).

b) to train using our script:
- b.1) **if in Script mode** &rarr; upload training script to S3
- b.2) **if in Framework mode** &rarr; extend the sagemaker.estimator.Framework class(the SageMaker Python SDK will do the uploading for us)
    

#### a) Uploading the data to S3
Let's **upload our data** from the directory `data` (created in the first notebook) to Amazon S3 and configure SageMaker input:

In [None]:
# Save data in S3 for training with SageMaker
prefix = 'sagemaker-custom'
data_dir = 'data'
input_train = sagemaker_session.upload_data('data/train/iris_train.csv', key_prefix="{}/{}".format(prefix, data_dir) )
input_test = sagemaker_session.upload_data('data/test/iris_test.csv', key_prefix="{}/{}".format(prefix, data_dir) )

In [None]:
sagemaker_session.list_s3_files(sagemaker_session.default_bucket(), prefix+'/'+data_dir)

In [None]:
input_train, input_test

In [None]:
train_config = sagemaker.session.s3_input(input_train, content_type='text/csv')
test_config = sagemaker.session.s3_input(input_test, content_type='text/csv')

[Go to the S3 bucket to check how the data was stored.](https://s3.console.aws.amazon.com/s3/home)

![iris-data](./media/s3-iris-data.png)

### b.1) Now, for the Script mode container:

As discussed, in script mode, SageMaker will load the training script from S3. 

Let's create a tarball with the training script and upload it to S3:

In [None]:
# helper function to create a tarball
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

In [None]:
# Create a tarball with our training script
create_tar_file(["source_dir/train.py"], "sourcedir.tar.gz")

In [None]:
# Upload the tarball to S3
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print('Uploaded tarball to:', sources)
! rm sourcedir.tar.gz

Let's configure SageMaker to use our Docker container:

In [None]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

To remember the training script we created before ([in the previous notebook](./0_local_development.ipynb)), [click here to view the script.](./source_dir/train.py)

Train in Script mode with SageMaker (use `sagemaker.estimator.Estimator`):

In [None]:
import sagemaker
import json

# JSON encode hyperparameters.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": "train.py",
    "sagemaker_submit_directory": sources,
    "num_leaves": 40,
    "max_depth": 10,
    "learning_rate": 0.11,
    "random_state": 42})

estimator = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='local',
                                    #train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)

Observe that the *sagemaker-training-toolkit library* uses the following reserved hyperparameters to know where the our sources are stored in Amazon S3:
- `sagemaker_program`
- `sagemaker_submit_directory`

In [None]:
estimator.fit({'train': train_config, 'validation': test_config })

>Again, as expected the training finished with the final validation loss of 0.138846 and F1 score of 0.94 after choosing the same hyperparameters ([look the previous notebook again if you want](./0_local_development.ipynb)).

### b.2) Now, for the Framework mode container:

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. However, **with the Framework class in the SageMaker Python SDK, the upload is automated for us.**

Let's extend the `sagemaker.estimator.Framework` class from the SageMaker Python SDK. We'll create a class called `MyLightGBMFramework` that inherits from the Framework class:

In [None]:
%%writefile custom_framework.py
from sagemaker.estimator import Framework

class MyLightGBMFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(MyLightGBMFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None

We can now use our `MyLightGBMFramework` class for training with SageMaker and pass the local script and directory (`entry_point` and `source_dir`):

In [None]:
import sagemaker
from custom_framework import MyLightGBMFramework

framework = MyLightGBMFramework(image_name=container_image_uri,
                                role=role,
                                entry_point='train.py',
                                source_dir='source_dir/',
                                train_instance_count=1, 
                                train_instance_type='local', # we use local mode
                                #train_instance_type='ml.m5.xlarge',
                                base_job_name=prefix,
                                hyperparameters={"num_leaves": 40,
                                                 "max_depth": 10,
                                                 "learning_rate": 0.11,
                                                 "random_state": 42}
                                )

In [None]:
framework.fit({'train': train_config, 'validation': test_config })

>Again, as expected the training finished with the final validation loss of 0.138846 and F1 score of 0.94 after choosing the same hyperparameters ([look the previous notebook again if you want](./0_local_development.ipynb)).

> If wanted we could have used a Python script that is stored in a Git repository instead of the local file using [git_config](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework) (this could be helpful when we automate and create a ML pipeline).

## The end of the creating the training container! 

### What's next? How can we deploy the custom model?

If we try to deploy it, we'll receive an error (obviously, since we didn't implement the logic for serving the trained model and inference.

Click on the **STOP** button (square button on the top) to stop this test:

Since there's no implementation of web server nor inference logic, if you deploy the model with the script mode (the `estimator` object above) you'll receive the error: 
```
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/image.py", line 618, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/local/image.py", line 677, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

(...)

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpjswa7hh1/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

```

In [None]:
estimator.deploy(initial_instance_count=1,
                 instance_type='local',
                )

For the `framework` object above for the Framework mode, again there will be an error since we didn't implement the [Framework Model class](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.FrameworkModel) (when executing `framwork.deploy(...)` the SageMaker Python SDK first creates a Model object and then uses it to deploy ([more details in the source code](https://github.com/aws/sagemaker-python-sdk/blob/4100bfa8aba871c3b947f88891a7719139b6f394/src/sagemaker/estimator.py#L1377)):

```
AttributeError: 'NoneType' object has no attribute 'name'
```

In [None]:
framework.deploy(initial_instance_count=1,
                 instance_type='local')

## Let's implement the final logic for serving and inference!

## &rarr; [CLICK HERE TO MOVE ON](../../1_custom_inference/lab/2_inference-container.ipynb)