In [None]:
import boto3
import sagemaker

In [12]:
session = sagemaker.Session()

In [13]:
!pip freeze | grep sagemaker

sagemaker==2.68.0
sagemaker-pyspark==1.4.2


### Creating bucket using Boto3

In [14]:
bucket_name = "maq01-first-bucket"
s3 = boto3.resource("s3")
region = session.boto_region_name

print(region)

bucket_config = {"LocationConstraint": region}
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=bucket_config)

eu-central-1


s3.Bucket(name='maq01-first-bucket')

### Uploading the data to S3
---

Sagemaker expects our training, validation or testing data to be in a `S3` bucket! Hence, we have to upload our data into the bucket we just created!

We have 2 ways!

**Using Sagemaker's session object**

https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.upload_data

In [15]:
'''
    We can also provide the target folder name where we want to upload, for that purpose we can use `key_prefix` parameter of upload_data method.
    However, here we did not provide so default is "data" folder.
'''

inputs = session.upload_data(path="Language Detection.csv", bucket=bucket_name)

print("input spec (in this case, just an S3 path): {}".format(inputs))

input spec (in this case, just an S3 path): s3://maq01-first-bucket/data/Language Detection.csv


**Using Sagemaker's s3 utilities**


https://sagemaker.readthedocs.io/en/stable/api/utility/s3.html

In [7]:
s3_uploader = sagemaker.s3.S3Uploader()


data_s3_uri = s3_uploader.upload(
        local_path="Language Detection.csv", desired_s3_uri=f"s3://{bucket_name}/data"
    )

## Training with Sagemaker using Pytorch
---

Amazon Sagemker provides so many functionalities to train Machine/Deep learning models. It also provides builtin algorthims so we dont even need to write our own algorithm (see [link](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)). However, this notebook deals with how to train custom model using Pytorch and Sagemaker.

In order to use Pytorch with Sagemaker, we have to use a `Pytorch` class provided by Sagemaker. Lets import it

In [16]:
from sagemaker.pytorch import PyTorch

This `Pytorch` model creates an estimator object which handles end-to-end training and deployment of custom PyTorch code. This estimator pull the Docker contrainer from Amazon ECR (Elastic Container Registry), and this container have `Python` and `Pytorch` pre-installed and then it start training our model in it. Once, training is finished it uploads artifacts including our model to `S3 bucket`. Hence, we only get charge of that training duration. After training, we can also deploy our model. Sagemaker can generate and deploy rest API endpoint with just one single line! We will see more in detail later.

First, lets define our hypermeters.

In [17]:
hyperparameters = {
    "epochs": 10, 
    "batch-size": 64, 
    "embedding-dim": 125, 
    "hidden-dim": 2
}

As we have already talked about estimator, in order to create an estimator we need to provide configuration. It requires version of Pytorch and Python as it will fetch the Docker container from ECR having these dependencies pre-installled.

For more info about estimator config: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase 

In [18]:
estimator_config = {
    "source_dir": "scripts", # we provide source_dir in order to install torchtext!!
    "entry_point": "train_script.py",
    "framework_version": "1.9", # Pytorch version
    "py_version": "py38", # Python version
    "instance_type": "ml.m5.xlarge", # type of docker container, i am using ml.m5.xlarge
    "instance_count": 1,
    "role": sagemaker.get_execution_role(),
    "output_path": f"s3://{bucket_name}", # if this is not specified then SM will create new bucket to store artifacts
    "hyperparameters": hyperparameters,
    "code_location": f"s3://{bucket_name}" # This location is used when we deploy our endpoint if not specified another S3 bucket!
}

Everything is pretty self explanatory. Although, `source_dir` and `entry_point` is important to understand.

As you have notice, we have a folder `scripts` and inside that folder we have our `entry_point`'s script. We dont need to create or have a folder for `entry_point` script but if you want to install custom dependencies using `requirements.txt` then you should create a folder and inside that folder put the `requirements.txt` and `entry_point`'s script. This is exactly what we did here as well, as we want to install `torchtext`!
Sagemaker will automatically look for `requirements.txt` and install it.

### Investigating train_script.py
---

**`main` block**:

Most of the code in `train_script.py` is simply how you write your model using `Pytorch`. However, I would to explain few things here. Let's start with `main` block
```Python
if __name__ == "__main__":
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    parser = argparse.ArgumentParser()
    
    # These variables are populate with estimator hypermeters
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--batch-size", type=int, default=32)
    parser.add_argument("--embedding-dim", type=int, default=125)
    parser.add_argument("--hidden-dim", type=int, default=2)
    
    args, _ = parser.parse_known_args()
```
As we passed our `hyperparameters` in estimator config. Here you can see that `train_script.py` receives them as command-line options. 

Next  
```Python
    # Estimator's config "output_path" populates `SM_OUTPUT_DATA_DIR` enviroment variable
    model_storage = os.environ['SM_OUTPUT_DATA_DIR']
    
    # https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#prepare-a-pytorch-training-script
    # estimator.fit populates `SM_CHANNEL_TRAIN` enviroment variable
    corpus_dir = os.environ['SM_CHANNEL_TRAIN']
```
Using estimator config, Sagemaker populates enviroment variables which we can access in `train_script.py`. Here you can see, enviroment variable of `SM_OUTPUT_DATA_DIR` was populate using `output_path` from estimator config. Same goes for `SM_CHANNEL_TRAIN` enviroment variable. However, it was populated using `pytorch_estimator.fit` (see next cells for more details).

Now at the end of `main` block, notice I am not just storing `state_dict` our model as we going to need our vocabulary and its size in order to instantiate our model and to do inference.

**`model_fn`, `input_fn`, `predict_fn`  functions**:

As I mentioned earlier Sagemaker can also generate and deploy an endpoint for our model. We can use that end point for inference. In this notebook we are also going to deploy an endpoint. For deployment, we must need to provide `model_fn`, `input_fn`, `predict_fn` functions in our training scripts. Let's examine them

**`model_fn`**:
This function should have the following signature:

```Python
def model_fn(model_dir)
```
Sagemaker will inject `model_dir` and this is where we saved our model. This function should load the model from the `model_dir` and return it. You can return any kind of object but output of this function should contain model. The output of this function will be further used as an input to `predict_fn`

**`input_fn`**:
This function should have the following signature:

```Python
def input_fn(request_body, request_content_type)
```
Here we will recieve body of our endpoint. We can further process and validate the input here and then return the input which our model will use to do prediction. The output of this function will be further used as an input to `predict_fn`


**`predict_fn`**:
This function should have the following signature:

```Python
def predict_fn(input_fn_out, model_fn_out)
```

The first parameter is the output from `input_fn` and second one is output from the `predict_fn` function.

That's it we are good to go with training and deployment.

One can further read official docs of Sagemaker: https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html

In [19]:
pytorch_estimator = PyTorch(**estimator_config)

In [20]:
# https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html
data_channels = {"train": f"s3://{bucket_name}/data/"} # this will populate 'SM_CHANNEL_TRAIN' enviroment variable

pytorch_estimator.fit(data_channels)

2021-12-05 21:08:38 Starting - Starting the training job...
2021-12-05 21:09:01 Starting - Launching requested ML instancesProfilerReport-1638738518: InProgress
...
2021-12-05 21:09:29 Starting - Preparing the instances for training......
2021-12-05 21:10:38 Downloading - Downloading input data...
2021-12-05 21:11:01 Training - Downloading the training image...
2021-12-05 21:11:33 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-12-05 21:11:34,462 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-12-05 21:11:34,463 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-12-05 21:11:34,471 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-12-05 21:11:40,706 sagemaker_pytorch_container

In [22]:
pytorch_estimator.model_data

's3://maq01-first-bucket/pytorch-training-2021-12-05-21-08-38-151/output/model.tar.gz'

### The bug
---

At the time of creating this notebook (Dec 2021), there is a bug with Sagemaker API (2.68.0) that it returns incorrect path (check output of last cell)

```
s3://maq01-first-bucket/pytorch-training-2021-12-04-22-04-33-125/output/model.tar.gz
```

However, model is saved with `output.tar.gz`. I have filed the bug on Sagemaker Github repository.

https://github.com/aws/sagemaker-python-sdk/issues/2762

This bug causes to break the deployment of endpoint (see next cell). However, I also found workaround for it that renaming the `output.tar.gz` to `model.tar.gz` in `S3 bucket` will fix this problem!

So if you running this notebook in Sagemaker, at this point your model has been uploaded to your specified bucket. Go to your `bucket` and then `output` folder and rename `output.tar.gz` to `model.tar.gz`.

In [23]:
predictor = pytorch_estimator.deploy(instance_type='ml.m5.xlarge',
                                     initial_instance_count=1)

-------!

In [24]:
predictor.endpoint_name 

'pytorch-training-2021-12-05-21-28-43-864'

In [25]:
predictor.serializer # default is numpy

<sagemaker.serializers.NumpySerializer at 0x7fa898287a58>

In [26]:
# Changine default serializer
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

In [34]:
predictor.predict("how are you")

'French'

In [None]:
predictor.delete_endpoint()

References and courtesy: 

https://github.com/debnsuma/Intro-Transformer-BERT/blob/main/BERT-Disaster-Tweets-Prediction.ipynb