# Fine-tuning and deploying a BERTopic model on SageMaker with your own scripts and dataset, by extending existing PyTorch containers

### INFO: this notebook follows and extends the structure of the [Extending our PyTorch containers example](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb)

With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook guides you through an example on how to extend one of our existing and predefined SageMaker deep learning framework containers.

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 

1. [Extending our PyTorch containers](#Extending-our-pytorch-containers)
  1. [When should I extend a SageMaker container?](#When-should-I-extend-a-SageMaker-container?)
  1. [Permissions](#Permissions)
  1. [The example](#The-example)
  1. [The presentation](#The-presentation)
1. [Part 1: Docker containers and their use in Amazon SageMaker](#Part-1:-Docker-containers-and-their-use-in-Amazon-SageMaker)
    1. [An overview of Docker](#An-overview-of-Docker)
    1. [How Amazon SageMaker runs your Docker container](#How-Amazon-SageMaker-runs-your-Docker-container)
      1. [Running your container during training](#Running-your-container-during-training)
        1. [The input](#The-input)
        1. [The output](#The-output)
      1. [Running your container during hosting](#Running-your-container-during-hosting)
    1. [The parts of the sample container](#The-parts-of-the-sample-container)
    1. [The Dockerfile](#The-Dockerfile)
    1. [Building and registering the container](#Building-and-registering-the-container)
  1. [Testing your algorithm on your local machine](#Testing-your-algorithm-on-your-local-machine)
  1. [Download the 20newsgroups dataset](#Download-the-20newsgroups-dataset)
  1. [SageMaker Python SDK Local Training](#SageMaker-Python-SDK-Local-Training)
  1. [Fit, Deploy, Predict](#Fit,-Deploy,-Predict)
  1. [Making predictions using Python SDK](#Making-predictions-using-Python-SDK)
1. [Part 2: Training and Hosting your Algorithm in Amazon SageMaker](#Part-2:-Training-and-Hosting-your-Algorithm-in-Amazon-SageMaker)
  1. [Set up the environment](#Set-up-the-environment)
  1. [Create the session](#Create-the-session)
  1. [Upload the data for training](#Upload-the-data-for-training)
  1. [Training On SageMaker](#Training-on-SageMaker)
  1. [Optional cleanup](#Optional-cleanup)  
1. [Reference](#Reference)

_or_ I'm impatient, just [let me see the code](#The-Dockerfile)!

# Extending our pytorch containers

## When should I extend a SageMaker container?

You may not need to create a container to bring your own code to Amazon SageMaker. When you are using a framework such as [TensorFlow](https://github.com/aws/sagemaker-tensorflow-container), [MXNet](https://github.com/aws/sagemaker-mxnet-container), [PyTorch](https://github.com/aws/sagemaker-pytorch-container) or [Chainer](https://github.com/aws/sagemaker-chainer-container) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework.

Even if there is direct SDK support for your environment or framework, you may want to add additional functionality or configure your container environment differently while utilizing our container to use on SageMaker.

**Some of the reasons to extend a SageMaker deep learning framework container are:**
1. Install additional dependencies. (E.g. I want to install a specific Python library, that the current SageMaker containers don't install.)
2. Configure your environment. (E.g. I want to add an environment variable to my container.)

**Although it is possible to extend any of our framework containers as a parent image, the example this notebook covers is currently only intended to work with our PyTorch container.**

This walkthrough shows that it is quite straightforward to extend one of our containers to build your own custom container for PyTorch.

## Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it creates new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

## The example

In this example we show how to package a PyTorch container, extending the SageMaker PyTorch container, with a Python example which works with the BERTopic model and the 20newsgroups dataset. BERTopic comes now as standalone library and requires GPU support, so the most effective way to use it is to extend the SageMaker PyTorch container, ot utilize the existing training and hosting solution made to work on SageMaker. By comparison, if one were to build their own custom framework container from scratch, they would need to implement a training and hosting solution in order to use SageMaker. Here is an example showing [how to create a SageMaker TensorFlow container from scratch](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.ipynb).

In this example, we need to use separate base images to support training and hosting, as they are provided separately for [Pytorch](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only). Sometimes you may want to use the same image for training and hosting because they have the same requirements. This simplifies the procedure because we only need to manage one image for both tasks. In that case, merge the parts discussed below into the same Dockerfiles and build one image. Choosing whether to use a single image or two images is a matter of what is most convenient for you to develop and manage.

If you're only using Amazon SageMaker for training or hosting, but not both, only the functionality used needs to be built into your container.
Finally, when using different containers, make sure that the libraries and their versions match, otherwise this could create problems. In fact, some libraries versions are different in the Pytorch training and inference containers.

## The notebook structure

This notebook is divided into three parts: 
1. Docker containers and their use in Amazon SageMaker
2. _building_ and _using_ the training container 
3. _building_ and _using_ the inference container 

# Part 1: Docker containers and their use in Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new technology. But they are not difficult and can significantly simplify the deployment of your software packages. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way your program is set up is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, and environment variable.

A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run simultaneously on the same physical or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. An example is provided below. You can build your Docker images based on Docker images built by yourself or by others, which can simplify things quite a bit.

Docker has become very popular in programming and devops communities due to its flexibility and its well-defined specification of how code can be run in its containers. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a one way for training and another, slightly different, way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

* Typically you specify a program (e.g. script) as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and decide what to do. The original `ENTRYPOINT` specified within the SageMaker PyTorch is [here](https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.5.1/py3/Dockerfile.cpu#L142).

#### Running your container during training

Currently, our SageMaker PyTorch container utilizes [console_scripts](http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point) to make use of the `train` command issued at training time. The line that gets invoked during `train` is defined within the setup.py file inside [SageMaker Containers](https://github.com/aws/sagemaker-containers/blob/master/setup.py#L48), our common SageMaker deep learning container framework. When this command is run, it will invoke the [trainer class](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/cli/train.py) to run, which will finally invoke our [PyTorch container code](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/training.py) to run your Python file.

A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values are always strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match algorithm expectations. The files for each channel are copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker packages any files in this directory into a compressed tar archive file. This file is made available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file are returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it is ignored.

#### Running your container during hosting

Hosting has a very different model than training because hosting is reponding to inference requests that come in via HTTP. Currently, the SageMaker PyTorch containers [uses](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103) our [recommended Python serving stack](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_server.py#L44) to provide robust and scalable serving of inference requests:

![Request serving stack](stack.png)

Amazon SageMaker uses two URLs in the container:

* `/ping` receives `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in as well. 

The container has the model files in the same place that they were written to during training:

    /opt/ml
    `-- model
        `-- <model files>



## Custom files available to build the container used in this example

The `container` directory has all the components you need to extend the SageMaker PyTorch container to use as an sample algorithm:

    .
    |-- Dockerfile
    |-- Dockerfile-inference
    `-- bert-topic
        `-- bert-topic.py
        `-- bert-topic-inference.py

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image for *training*. More details are provided below.
* __`Dockerfile-inference`__ describes how to build your Docker container image for *inference*. More details are provided below.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
* __`bert-topic`__ is the directory which contains our user code to be invoked.

In this application, we install and/or update a few libraries, and copy one script in the container, which will be used as `ENTRYPOINT`. You may only need that many, but if you have many supporting routines, you may wish to install more and use more files.

The files that we put in the container are:

* __`bert-topic.py`__ is the program that implements our training algorithm (used only for training container)
* __`bert-topic-inference.py`__ is the program that handles loading our model for inferences (used only for inference container)

## Part 2: Packaging and Uploading your Algorithm for training
### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

We start from the SageMaker PyTorch image as the base. The base image is an ECR image, so it will have the following pattern.
* {account}.dkr.ecr.{region}.amazonaws.com/sagemaker-{framework}:{framework_version}-{processor_type}-{python_version}

Here is an explanation of each field.
1. account - AWS account ID the ECR image belongs to. Our public deep learning framework images are all under the 763104351884 account.
2. region - The region the ECR image belongs to. [Available regions](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).
3. framework - The deep learning framework.
4. framework_version - The version of the deep learning framework.
5. processor_type - CPU or GPU.
6. python_version - The supported version of Python.

So the SageMaker PyTorch ECR image would be:
* 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker

Information on supported frameworks and versions can be found in this [README](https://github.com/aws/sagemaker-python-sdk).

Next, we install the required additional libraries like BERTopic and add the code that implements our specific algorithm to the container, and set up the right environment for it to run under.

Finally, we need to specify two environment variables.
1. SAGEMAKER_SUBMIT_DIRECTORY - the directory within the container containing our Python script for training and inference.
2. SAGEMAKER_PROGRAM - the Python script that should be invoked for training and inference.

Let's look at the Dockerfile for this example.

In [12]:
!cat container/Dockerfile

# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

# For more information on creating a Dockerfile
# https://docs.docker.com/compose/gettingstarted/#step-2-create-a-dockerfile
# https://github.com/awslabs/amazon-sagemaker-examples/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb

ARG REGION=us-east-1

# SageMaker PyTorch image for TRAINING
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu11

### Writing the training script (bert-topic.py)

To fine-tune a BERTopic model with a custom dataset on Amazon SageMaker, we will write a training script to be used by the Amazon SageMaker Training Job (or locally).

The training script will need to do the following steps:
- Load a pretrained Model
- Load the dataset
- Define the Training Arguments
- Define a Trainer
- Train the model and save the checkpoint on the validation set

These steps will be done in a `_train()` function.

The script also uses some hyperparameters, which can be extended depending on what is needed for each model. Here we used `language` to choose the training language.

In [13]:
!pygmentize container/bert-topic/bert-topic.py

[37m# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m[37m[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m[37m[39;49;00m
[37m# the License is located at[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m[37m[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m[37m[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m[37m[39;49;00m
[37m# language governing permissions and limitations under the License.[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;

### Building and registering the training container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this is the region where the notebook instance was created). If the repository doesn't exist, the script will create it. In addition, since we are using the SageMaker PyTorch image as the base, we will need to retrieve ECR credentials to pull this public image.

Note that your notebook role needs to have the permission to push images to the ECR registry.

In [14]:
# The name of our algorithm -- i.e. the name of the produced training container
training_algorithm_name = "bert-topic-training-example"

In [15]:
# building and pushing the container
! cd container && sh build_and_push.sh {training_algorithm_name}

ECR image fullname: 600743882593.dkr.ecr.us-east-1.amazonaws.com/bert-topic-training-example:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  28.67kB
Step 1/8 : ARG REGION=us-east-1
Step 2/8 : FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
 ---> d0d970e24a16
Step 3/8 : ENV PATH="/opt/ml/code:${PATH}"
 ---> Using cache
 ---> 85c0f8a933bc
Step 4/8 : COPY /bert-topic /opt/ml/code
 ---> Using cache
 ---> b50007176c88
Step 5/8 : ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
 ---> Using cache
 ---> 7a90277fca35
Step 6/8 : ENV SAGEMAKER_PROGRAM bert-topic.py
 ---> Using cache
 ---> b7f14932f09c
Step 7/8 : RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir bertopic==0.12.0
 ---> Using cache
 ---> 8d31db912163
Step 8/8 : RU

In [16]:
# An alternative and simplified command which outsources the docker build process to codebuild.
# This is especially useful when using Sagemaker studio notebooks where docker is not running.
# Details in https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/
#!cd container && sm-docker build {training_algorithm_name}

## Testing your algorithm locally

When you're packaging your first algorithm to use with Amazon SageMaker, you probably want to test it yourself to make sure it's working correctly. We use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to test both locally and on SageMaker. For more examples with the SageMaker Python SDK, see [Amazon SageMaker Examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk). In order to test our algorithm, we need our dataset.

## Download the 20newsgroups training dataset
We will be utilizing the 20newsgroups dataset loader provided within PyTorch to download and load our data for training.

In [17]:
import os

training_file_name = "training_file.txt"
working_dir = os.getcwd()
training_file_path = os.path.join(working_dir, training_file_name)

print(f"Working Dir: {working_dir}")
print(f"Training File: {training_file_path}")

Working Dir: /home/ec2-user/SageMaker/for_deployment/amazon-sagemaker-examples/advanced_functionality/pytorch_extend_container_train_deploy
Training File: /home/ec2-user/SageMaker/for_deployment/amazon-sagemaker-examples/advanced_functionality/pytorch_extend_container_train_deploy/training_file.txt


In [18]:
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

# only use 100 documents out of 18846 for faster iteration
docs = docs[:100]
len(docs)

100

In [19]:
print("The first document contains the following text:\n",docs[0])

The first document contains the following text:
 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [20]:
# write the list to a file to be easily passed to the training job

with open(training_file_path, "w") as f:
    for line in docs:
        f.write(
            line.replace("\n", "\\n") + "\n"
        )  # preserve newlines symbols (\n) and write one document per line

## SageMaker Python SDK Local Training
To represent our training, we use the Estimator class, which needs to be configured in five steps. 
1. IAM role - our AWS execution role
2. train_instance_count - number of instances to use for training.
3. train_instance_type - type of instance to use for training. For training locally, we specify `local` or `local_gpu`.
4. image_name - our custom PyTorch Docker image we created.
5. hyperparameters - hyperparameters we want to pass.

Let's start with setting up our IAM role. We make use of a helper function within the Python SDK. This function throw an exception if run outside of a SageMaker notebook instance, as it gets metadata from the notebook instance. If running outside, you must provide an IAM role with proper access stated above in [Permissions](#Permissions).

In [21]:
from sagemaker import get_execution_role

role = get_execution_role()

### Check the local SageMaker setup

In [22]:
# Lets set up our SageMaker notebook instance for local mode.
!/bin/bash ./utils/setup.sh

The user has root access.
nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [23]:
# CHECK if GPUs are available and set the corresponding "instance_type"
import os
import subprocess

instance_type_local = "local"

if subprocess.call("nvidia-smi") == 0:
    ## Set type to GPU if one is present
    instance_type_local = "local_gpu"

print("Instance type = " + instance_type_local)

Thu Mar  2 10:45:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P8    13W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Fit (+ Deploy, Predict)

Now that the rest of our estimator is configured, we can call `fit()` with the path to our local 20newsgroups dataset prefixed with `file://`. This invokes our PyTorch container with 'train' and passes in our hyperparameters and other metadata as json files in /opt/ml/input/config within the container to our program entry point defined in the Dockerfile.

After our training has succeeded, our training algorithm outputs our trained model within the /opt/ml/model directory, which is used to handle predictions.

If the container used for the training is also used for the inference (NOT IN THIS CASE!) we can then conveniently call `deploy()` with an instance_count and instance_type, which is 1 and `local`. This invokes our PyTorch container with 'serve', which setups our container to handle prediction requests as defined [here](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103). What is returned is a predictor, which is used to make inferences against our trained model.

After our prediction, we can delete our endpoint.

We recommend always testing and training your training algorithm locally first, as it provides quicker iterations and better debuggability.

### Local training: Fit

In [24]:
from sagemaker.estimator import Estimator

# define hyperparameters
hyperparameters = {"language": "english"}

# prepare training job
estimator = Estimator(
    role=role,
    train_instance_count=1,
    train_instance_type=instance_type_local,
    image_uri=training_algorithm_name + ":latest",
    hyperparameters=hyperparameters,
)

# launch training job
print(f"file://{training_file_path}")
estimator.fit(f"file://{training_file_path}")
# estimator

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


file:///home/ec2-user/SageMaker/for_deployment/amazon-sagemaker-examples/advanced_functionality/pytorch_extend_container_train_deploy/training_file.txt


INFO:sagemaker:Creating training-job with name: bert-topic-training-example-2023-03-02-10-45-19-428
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-gciqu:
    command: train
    container_name: 4mpo1cakoh-algo-1-gciqu
    deploy:
      resources:
        reservations:
          devices:
          - capabilities:
            - gpu
    environment:
    - '[Masked]'
    - '[Masked]'
    image: bert-topic-training-example:latest
    networks:
      sagemaker-local:
        aliases:
        - algo-1-gciqu
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpgagv6ca0/algo-1-gciqu/input:/opt/ml/input
    - /tmp/tmpgagv6ca0/algo-1-gciqu/output/

Creating 4mpo1cakoh-algo-1-gciqu ... 
Creating 4mpo1cakoh-algo-1-gciqu ... done
Attaching to 4mpo1cakoh-algo-1-gciqu
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,362 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,384 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,384 sagemaker-training-toolkit INFO     Failed to parse hyperparameter language value english to Json.
[36m4mpo1cakoh-algo-1-gciqu |[0m Returning the value itself
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,391 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,394 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m4mpo1cakoh-algo-1-gciqu |[0m 2023-03-02 10:45:21,397 sagemaker_

INFO:root:creating /tmp/tmpgagv6ca0/artifacts/output/data
INFO:root:copying /tmp/tmpgagv6ca0/algo-1-gciqu/output/success -> /tmp/tmpgagv6ca0/artifacts/output
INFO:root:copying /tmp/tmpgagv6ca0/model/my_model -> /tmp/tmpgagv6ca0/artifacts/model


[36m4mpo1cakoh-algo-1-gciqu exited with code 0
[0mAborting on container exit...
===== Job Complete =====


In [25]:
# check where the fitted model has been stored after fit
estimator.model_data

's3://sagemaker-us-east-1-600743882593/bert-topic-training-example-2023-03-02-10-45-19-428/model.tar.gz'

### A sleek  alternative to "Bring your own container": bring your own requirements.txt

Instead of extending the pytorch container through a Dockerfile, in many cases we can simply import additional libraries to make our model work. 
In these cases, adding the requirements.txt inside the same folder as the code will install the required dependencies at runtime. 
The source directory must be specified in the `source_dir` argument and the entrypointin the `entry_point` argument when creating PyTorch estimator. Further documentation [here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries)

In [28]:
import boto3
my_session = boto3.session.Session()
my_region = my_session.region_name
my_image_uri =  "763104351884.dkr.ecr."+my_region+".amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker"
print("my_image_uri:",my_image_uri)
from sagemaker.estimator import Estimator

# define hyperparameters
hyperparameters = {"language": "english"}

# prepare training job
estimator = Estimator(
    role=role,
    train_instance_count=1,
    train_instance_type=instance_type_local,
    image_uri=my_image_uri,
    hyperparameters=hyperparameters,
    source_dir="container/bert-topic-byoreq",  # this directory contains the entrypoint code and the requirements.txt
    entry_point="bert-topic.py", # this argument is used to override internal container entrypoint, if needed!
)

# launch training job
print(f"file://{training_file_path}")
estimator.fit(f"file://{training_file_path}")

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


my_image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
file:///home/ec2-user/SageMaker/for_deployment/amazon-sagemaker-examples/advanced_functionality/pytorch_extend_container_train_deploy/training_file.txt


INFO:sagemaker:Creating training-job with name: pytorch-training-2023-03-02-10-48-21-013
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-epzte:
    command: train
    container_name: 8id8rkwnfq-algo-1-epzte
    deploy:
      resources:
        reservations:
          devices:
          - capabilities:
            - gpu
    environment:
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
    networks:
      sagemaker-local:
        aliases:
        - algo-1-epzte
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpq7ab7y2k/algo-1-epzte/inp

Creating 8id8rkwnfq-algo-1-epzte ... 
Creating 8id8rkwnfq-algo-1-epzte ... done
Attaching to 8id8rkwnfq-algo-1-epzte
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,128 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,150 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,158 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,161 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,164 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36m8id8rkwnfq-algo-1-epzte |[0m 2023-03-02 10:48:23,200 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2Instance

INFO:root:creating /tmp/tmpq7ab7y2k/artifacts/output/data
INFO:root:copying /tmp/tmpq7ab7y2k/algo-1-epzte/output/success -> /tmp/tmpq7ab7y2k/artifacts/output
INFO:root:copying /tmp/tmpq7ab7y2k/model/my_model -> /tmp/tmpq7ab7y2k/artifacts/model


[36m8id8rkwnfq-algo-1-epzte exited with code 0
[0mAborting on container exit...
===== Job Complete =====


In [29]:
estimator.model_data

's3://sagemaker-us-east-1-600743882593/pytorch-training-2023-03-02-10-48-21-013/model.tar.gz'

## Training your Algorithm in Amazon SageMaker in batch mode
Once you have your container packaged and/or validated, you can use it to train models in batch mode. Let's do that with the algorithm we made above.

## Set up the environment
Here we specify the bucket to use (the role has been defined earlier).

In [30]:
# S3 prefix
prefix = "DEMO-pytorch-bert-topic"

## Create the session

The session remembers our connection parameters to SageMaker. We use it to perform all of our SageMaker operations.

In [31]:
import sagemaker as sage

sess = sage.Session()

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


## Upload the data for training

We will use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

In [32]:
data_location = sess.upload_data(training_file_path, key_prefix=prefix)
data_location

's3://sagemaker-us-east-1-600743882593/DEMO-pytorch-bert-topic/training_file.txt'

## Training on SageMaker in batch mode
Training a model on SageMaker with the Python SDK is done in a way that is similar to the way we trained it locally. This is done by changing our train_instance_type from `local` to one of our [supported EC2 instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/).

In addition, we must now specify the ECR image URL, which we just pushed above.

Finally, our local training dataset has to be in Amazon S3 and the S3 URL to our dataset is passed into the `fit()` call.

Let's first fetch our ECR image url that corresponds to the image we just built and pushed.

Also in this case, whenever useful, one can use the leverage a pre-existing base image with the "bring your own requirements.txt" option (not shown below).

In [33]:
import boto3

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]

my_session = boto3.session.Session()
region = my_session.region_name

ecr_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{training_algorithm_name}:latest"

print("using ecr_image:", ecr_image)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


using ecr_image: 600743882593.dkr.ecr.us-east-1.amazonaws.com/bert-topic-training-example:latest


After training the mode, the output will be packaged into a tarred file `model.tar.gz` and copied to the S3 bucket in use for the training

In [34]:
from sagemaker.estimator import Estimator

hyperparameters = {"language": "english"}

# instance_type = "ml.g4dn.xlarge"
instance_type = "ml.p2.xlarge"

estimator = Estimator(
    role=role,
    train_instance_count=1,
    train_instance_type=instance_type,
    image_uri=ecr_image,
    hyperparameters=hyperparameters,
)

estimator.fit(data_location)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Creating training-job with name: bert-topic-training-example-2023-03-02-12-19-47-456


2023-03-02 12:19:47 Starting - Starting the training job...
2023-03-02 12:20:15 Starting - Preparing the instances for training.........
2023-03-02 12:21:28 Downloading - Downloading input data...
2023-03-02 12:21:53 Training - Downloading the training image...........................
2023-03-02 12:26:30 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-03-02 12:26:56,714 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-03-02 12:26:56,728 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-03-02 12:26:56,728 sagemaker-training-toolkit INFO     Failed to parse hyperparameter language value english to Json.[0m
[34mReturning the value itself[0m
[34m2023-03-02 12:26:56,741 sagemaker_pytorch_container.training INFO     Block

In [35]:
# check where the fitted model has been stored after fit
estimator.model_data

's3://sagemaker-us-east-1-600743882593/bert-topic-training-example-2023-03-02-12-19-47-456/output/model.tar.gz'

## Part 3: Packaging and Uploading your Algorithm for inference

### The Dockerfile

Also in this case the Dockerfile describes the image that we want to build. 

We start from the SageMaker PyTorch image as the base, but the *inference* one. 

So the SageMaker PyTorch ECR image in this case would be:
* FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker

Next, we install the required additional libraries (we make sure that to install numba==0.53.1 and nvgpu together with berttopic, otherwise we would get an error!) and add the code that implements our specific algorithm to the container, and set up the right environment for it to run under.

Finally, we need to specify two environment variables.
1. SAGEMAKER_SUBMIT_DIRECTORY - the directory within the container containing our Python script for training and inference.
2. SAGEMAKER_PROGRAM - the Python script that should be invoked for training and inference.

Let's look at the Dockerfile for this example.

In [36]:
!cat container/Dockerfile-inference

# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

# For more information on creating a Dockerfile
# https://docs.docker.com/compose/gettingstarted/#step-2-create-a-dockerfile
# https://github.com/awslabs/amazon-sagemaker-examples/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb

ARG REGION=us-east-1

# SageMaker PyTorch image for INFERENCE
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu

### Writing your own inference script (bert-topic-inference.py)

Given the use of a pre-packaged SageMaker Pytorch container, the only requirement to write an inference script is that it has to define the following template functions:
- `model_fn()` reading the content of an existing model weights directory saved as a `tar.gz` in s3. We will use it to load the trained Model.
- `input_fn()` used here simply to format the data receives from a request made to the endpoint.
- `predict_fn()` calling the output of `model_fn()` to run inference on the output of `input_fn()`.

Optionally a `output_fn()` can be created for inference formatting, using the output of `predict_fn()`. Here it is especially useful since the BERTopic inference output is not really standard, featuring a tuple containing 2 lists.


In [37]:
!pygmentize container/bert-topic/bert-topic-inference.py

[37m# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m[37m[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m[37m[39;49;00m
[37m# the License is located at[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m[37m[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m[37m[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m[37m[39;49;00m
[37m# language governing permissions and limitations under the License.[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;

In [38]:
# The name of our algorithm -- i.e. the name of the inference container
inference_algorithm_name = "pytorch-inference-bert-topic-example"

In [39]:
# building and pushing the container
! cd container && sh build_and_push.sh {inference_algorithm_name} Dockerfile-inference

ECR image fullname: 600743882593.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-bert-topic-example:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  28.67kB
Step 1/8 : ARG REGION=us-east-1
Step 2/8 : FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker: Pulling from pytorch-inference

[1B0b181fff: Already exists 
[1Bdd88827c: Already exists 
[1B1d7060d0: Already exists 
[1B76058f67: Already exists 
[1B4c52c661: Already exists 
[1B0896c4cf: Pulling fs layer 
[1B4ebf7138: Pulling fs layer 
[1Bc96776f5: Pulling fs layer 
[1Bc6997db2: Pulling fs layer 
[1B48b1b583: Pulling fs layer 
[1B2e3703a5: Pulling fs layer 
[1B8f8f8237: Pulling fs layer 
[1B04e63ad6: Pulling fs layer 
[1Bf70c0888: Pul

## Inference from Containerized SageMaker Model

In [43]:
import pprint
import boto3

pp = pprint.PrettyPrinter(indent=1)

sm_boto3 = boto3.client("sagemaker")

region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]

image_uri_inference = (
    f"{account_id}.dkr.ecr.{region}.amazonaws.com/{inference_algorithm_name}:latest"
)
image_uri_inference

'600743882593.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-bert-topic-example:latest'

### Prepare session to run container locally

In [44]:
import sagemaker
from sagemaker.local import LocalSession

session_local = LocalSession()
session_local.config = {instance_type_local: {"local_code": True}}
print(type(session_local))

from sagemaker import get_execution_role

role = get_execution_role()

<class 'sagemaker.local.local_session.LocalSession'>


### Check for locally running Docker containers (and stop them if needed)

In [45]:
! docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


In [None]:
# if model doesn't deploy locally, kill other running containers
# docker stop $(docker ps -a -q)
# docker rm $(docker ps -a -q)

### Deploy container locally to create a local endpoint 
#### (WARNING: cells stdout will start behaving fuzzy, showing container log info, till the local container is active...)

Disclaimer: if the inference container is missing the default `behavior` is possible to use local files (`source_dir` and `entry_point` arguments)

In [46]:
from sagemaker import Model

model_data = estimator.model_data

estimator = Model(
    image_uri=image_uri_inference,
    model_data=model_data,
    role=role,
    source_dir="container/bert-topic",  # this argument is used to override internal container entrypoint, if needed!
    entry_point="bert-topic-inference.py",  # this argument is used to override internal container entrypoint, if needed!
    sagemaker_session=session_local,  # local session
    #                   predictor_cls=None,
    #                   env=None,
    #                   name=None,
    #                   vpc_config=None,
    #                   enable_network_isolation=False,
    #                   model_kms_key=None,
    #                   image_config=None,
    #                   code_location=None,
    #                   container_log_level=20,
    #                   dependencies=None,
    #                   git_config=None
)

predictor = estimator.deploy(1, instance_type_local)

### Invoke locally deployed endpoint

In [30]:
import json

sagemaker_session = LocalSession()
sagemaker_session.config = {instance_type_local: {"local_code": True}}

sm_client = sagemaker_session.sagemaker_runtime_client
response = sm_client.invoke_endpoint(
    EndpointName="local-endpoint",
    ContentType="application/json",
    Body=json.dumps(["some random text", "free speech is important for democracy"]),
)

r = response["Body"]
print("RESULT r.read().decode():", r.read().decode())

[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,763 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - inside model_fn, model_dir= /opt/ml/model
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,786 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Device Type: cuda
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,901 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [2022-12-21 09:15:12.901 0c3994fb7e23:53 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,937 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [2022-12-21 09:15:12.937 0c3994fb7e23:53 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:13,314 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 12716
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:13,315 [INFO ] W-9000-model_1.0 TS_METRICS - W-9000-model_1.0.ms:

In [31]:
%%sh
# check if docker containers are running and kill them if needed
docker ps
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)


Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/local/image.py", line 854, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/local/image.py", line 916, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 137

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/local/image.py", line 859, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpi22602l0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited w

[36m5hpwya6dgy-algo-1-asc7w exited with code 137
[0mAborting on container exit...
CONTAINER ID   IMAGE                                                                                      COMMAND                  CREATED          STATUS          PORTS                                                 NAMES
0c3994fb7e23   497050307998.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-bert-topic-example:latest   "python /usr/local/b…"   27 seconds ago   Up 26 seconds   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 8081/tcp   5hpwya6dgy-algo-1-asc7w
0c3994fb7e23
3f377ab14ae8
0c3994fb7e23
3f377ab14ae8


### A sleek  alternative to "Bring your own container": bring your own requirements.txt

Also in this case, instead of extending the pytorch container through a Dockerfile, we can simply import additional libraries to make our model work, if neeed. 
The process is the same, i.e. adding the requirements.txt inside the same folder as the code will install the required dependencies at runtime. 
The source directory must be specified in the `source_dir` argument and the entrypointin the `entry_point` argument when creating PyTorch estimator. Further documentation [here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries)

*CAVEAT*: Installing libraries on the fly through requirements.txt implies additional overhead. This is negligible in case of "always on" single model endpoint deployment, but could have an impact in case of serverless or multi-model deployment. 


In [49]:
import boto3
my_session = boto3.session.Session()
my_region = my_session.region_name
my_image_uri_inference =  "763104351884.dkr.ecr."+my_region+".amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker"
print("my_image_uri_inference:",my_image_uri_inference)
from sagemaker.estimator import Estimator

# define hyperparameters
hyperparameters = {"language": "english"}

estimator = Model(
    image_uri=my_image_uri_inference,
    model_data=model_data,
    role=role,
    source_dir="container/bert-topic-byoreq",  # this directory contains the entrypoint code and the requirements.txt
    entry_point="bert-topic-inference.py",  # this argument is used to override internal container entrypoint, if needed!
    sagemaker_session=session_local,  # local session
    #                   predictor_cls=None,
    #                   env=None,
    #                   name=None,
    #                   vpc_config=None,
    #                   enable_network_isolation=False,
    #                   model_kms_key=None,
    #                   image_config=None,
    #                   code_location=None,
    #                   container_log_level=20,
    #                   dependencies=None,
    #                   git_config=None
)

predictor = estimator.deploy(1, instance_type_local)


my_image_uri_inference: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker


INFO:sagemaker:Creating model with name: pytorch-inference-2023-03-02-15-57-01-476
INFO:sagemaker:Creating endpoint-config with name pytorch-inference-2023-03-02-15-57-01-477
INFO:sagemaker:Creating endpoint with name pytorch-inference-2023-03-02-15-57-01-477
INFO:sagemaker.local.image:serving
INFO:sagemaker.local.image:creating hosting dir in /tmp/tmpcez5lrtf
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

INFO:sagemaker.local.image:docker command: docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker


Login Succeeded


failed to register layer: Error processing tar file(exit status 1): write /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: no space left on device


CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker']' returned non-zero exit status 1.

### Invoke locally deployed endpoint

In [30]:
import json

sagemaker_session = LocalSession()
sagemaker_session.config = {instance_type_local: {"local_code": True}}

sm_client = sagemaker_session.sagemaker_runtime_client
response = sm_client.invoke_endpoint(
    EndpointName="local-endpoint",
    ContentType="application/json",
    Body=json.dumps(["some random text", "free speech is important for democracy"]),
)

r = response["Body"]
print("RESULT r.read().decode():", r.read().decode())

[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,763 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - inside model_fn, model_dir= /opt/ml/model
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,786 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Device Type: cuda
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,901 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [2022-12-21 09:15:12.901 0c3994fb7e23:53 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:12,937 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [2022-12-21 09:15:12.937 0c3994fb7e23:53 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:13,314 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 12716
[36m5hpwya6dgy-algo-1-asc7w |[0m 2022-12-21T09:15:13,315 [INFO ] W-9000-model_1.0 TS_METRICS - W-9000-model_1.0.ms:

### Deploy container remotely to create a managed Amazon SageMaker endpoint

In [46]:
from sagemaker import Model

import sagemaker as sage

sess = sage.Session()

# instance_type = "ml.m5.xlarge" # no GPU, will trigger an error
# instance_type = "ml.g4dn.xlarge"
instance_type = "ml.p2.xlarge"
model_data = estimator.model_data

estimator = Model(
    image_uri=image_uri_inference,
    model_data=model_data,
    role=role,
    source_dir="container/bert-topic",
    entry_point="bert-topic-inference.py",
    sagemaker_session=sess,  # not local session anymore
    #                   predictor_cls=None,
    #                   env=None,
    name="my-BERTopic-test-endpoint",
    #                   vpc_config=None,
    #                   enable_network_isolation=False,
    #                   model_kms_key=None,
    #                   image_config=None,
    #                   code_location=None,
    #                   container_log_level=20,
    #                   dependencies=None,
    #                   git_config=None
)

# deploy the model
predictor = estimator.deploy(1, instance_type)

---------------!

Manually check the deployed endpoint name, it will have a name similar to 
- pytorch-inference-bert-topic-example-2022-12-07-16-23-40-112

### Invoke remotely deployed endpoint

In [52]:
sm_client = sess.sagemaker_runtime_client
endpoint_name = "pytorch-inference-bert-topic-example-2022-12-21-10-12-52-914"
response = sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(["some random text", "free speech is important for democracy"]),
)

r = response["Body"]
print("RESULT r.read().decode():", r.read().decode())

RESULT r.read().decode(): [[0, 0], [0.772705707141109, 0.8011079049820548]]


### Optional cleanup of the create endpoint
The created endpoint can be deleted manually.

This part represent the end of the notebook.

## Reference
- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [BERTtopic model](https://maartengr.github.io/BERTopic/index.html)
- [20newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [scikit-bring-your-own](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
- [SageMaker PyTorch container](https://github.com/aws/sagemaker-pytorch-container)

