# Huggingface Sagemaker example using `Trainer` class

Each folder starting with `0X_..` contains an specific sagemaker example. Each example contains a jupyter notebooke `sagemaker-example.ipynb` and a `src/` folder. The `sagemaker-example` is a jupyter notebook which is used to train transformers and datasets on AWS Sagemaker. The `src/` folder contains the `train.py`, our training script and `requirements.txt` for additional dependencies.


## Initializing Sagemaker Session with local AWS Profile

In [36]:
local_profile_name='hf-sm'

In [37]:
import sagemaker
import boto3

# creates a boto3 session using the local profile we defined
bt3 = boto3.session.Session(profile_name=local_profile_name)


sess = sagemaker.Session(boto_session=bt3)

# since we are using the sagemaker-sdk locally we cannot `get_execution_role` 
# role = sagemaker.get_execution_role()

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

In [3]:
role_name = "SageMakerRole"

_WARNING: This policy gives full S3 access to the container that is running in SageMaker. You can change this policy to a more restrictive one, or create your own policy._

In [4]:
%%bash  -s "$local_profile_name" "$role_name" 
# This script creates a role named SageMakerRole
# that can be used by SageMaker and has Full access to S3.

ROLE_NAME=$2

# WARNING: this policy gives full S3 access to container that
# is running in SageMaker. You can change this policy to a more
# restrictive one, or create your own policy.
POLICY_S3=arn:aws:iam::aws:policy/AmazonS3FullAccess

# Creates a AWS policy that allows the role to interact
# with ANY S3 bucket
cat <<EOF > /tmp/assume-role-policy-document.json
{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Principal": {
			"Service": "sagemaker.amazonaws.com"
		},
		"Action": "sts:AssumeRole"
	}]
}
EOF

# Creates the role
aws iam create-role --profile $1  --role-name ${ROLE_NAME} --assume-role-policy-document file:///tmp/assume-role-policy-document.json

# attaches the S3 full access policy to the role
aws iam attach-role-policy --profile $1 --policy-arn ${POLICY_S3}  --role-name ${ROLE_NAME}


An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name SageMakerRole already exists.


In [11]:
# get create role arn 
iam = bt3.client('iam')
role = iam.get_role(RoleName=role_name)['Role']['Arn']

## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


In [44]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            base_job_name='huggingface',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version='1.5.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32})

In [45]:
pytorch_estimator.fit()

Creating tmp4r1by5t0_algo-1-2slct_1 ... 
[1BAttaching to tmp4r1by5t0_algo-1-2slct_12mdone[0m
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,097 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,106 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,120 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,124 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,490 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. 
[36malgo-1-2slct_1  |[0m Generating setup.py
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,490 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:19,491 sagemaker-containers INFO     

[36malgo-1-2slct_1  |[0m Collecting distro<2,>=1.5.0
[36malgo-1-2slct_1  |[0m   Downloading distro-1.5.0-py2.py3-none-any.whl (18 kB)
[36malgo-1-2slct_1  |[0m Collecting docker[ssh]<5,>=4.3.1
[36malgo-1-2slct_1  |[0m   Downloading docker-4.4.0-py2.py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 3.2 MB/s eta 0:00:01
[36malgo-1-2slct_1  |[0m [?25hCollecting texttable<2,>=0.9.0
[36malgo-1-2slct_1  |[0m   Downloading texttable-1.6.3-py2.py3-none-any.whl (10 kB)
[36malgo-1-2slct_1  |[0m Collecting dockerpty<1,>=0.4.1
[36malgo-1-2slct_1  |[0m   Downloading dockerpty-0.4.1.tar.gz (13 kB)
[36malgo-1-2slct_1  |[0m Collecting cached-property<2,>=1.2.0
[36malgo-1-2slct_1  |[0m   Downloading cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)
[36malgo-1-2slct_1  |[0m Collecting jsonschema<4,>=2.5.1
[36malgo-1-2slct_1  |[0m   Downloading jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.8 MB/s eta 0:

[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:57,256 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:57,290 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:57,318 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-2slct_1  |[0m 2020-12-22 10:39:57,339 sagemaker-containers INFO     Invoking user script
[36malgo-1-2slct_1  |[0m 
[36malgo-1-2slct_1  |[0m Training Env:
[36malgo-1-2slct_1  |[0m 
[36malgo-1-2slct_1  |[0m {
[36malgo-1-2slct_1  |[0m     "additional_framework_parameters": {},
[36malgo-1-2slct_1  |[0m     "channel_input_dirs": {},
[36malgo-1-2slct_1  |[0m     "current_host": "algo-1-2slct",
[36malgo-1-2slct_1  |[0m     "framework_module": "sagemaker_pytorch_container.training:main",
[36malgo-1-2slct_1  |[0m     "hosts": [
[36malgo-1-2slct_1  |[0m         "algo-1-2slct"
[36malgo-

[36mtmp4r1by5t0_algo-1-2slct_1 exited with code 1
[0mAborting on container exit...


RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/tmp4r1by5t0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

## Create an Estimator

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`


In [48]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            sagemaker_session=sess,
#                            use_spot_instances=True,
#                            max_wait=7200, # Seconds to wait for spot instances to become available
                            base_job_name='huggingface',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32})

In [49]:
pytorch_estimator.fit()

2020-12-22 11:08:27 Starting - Starting the training job...
2020-12-22 11:08:50 Starting - Launching requested ML instancesProfilerReport-1608635306: InProgress
......
2020-12-22 11:09:51 Starting - Preparing the instances for training......
2020-12-22 11:10:58 Downloading - Downloading input data
2020-12-22 11:10:58 Training - Downloading the training image...
2020-12-22 11:15:15 Uploading - Uploading generated training model
2020-12-22 11:15:15 Completed - Training job completed
ProfilerReport-1608635306: NoIssuesFound
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-22 11:12:44,121 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-22 11:12:44,144 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-22 11:12:47,170 sagemaker_pytorch_container.training INFO     Invoking user training

Training seconds: 257
Billable seconds: 257
