# Huggingface Sagemaker example using `Trainer` class

Each folder starting with `0X_..` contains an specific sagemaker example. Each example contains a jupyter notebooke `sagemaker-example.ipynb` and a `src/` folder. The `sagemaker-example` is a jupyter notebook which is used to train transformers and datasets on AWS Sagemaker. The `src/` folder contains the `train.py`, our training script and `requirements.txt` for additional dependencies.


## Initializing Sagemaker Session with local AWS Profile

In [1]:
local_profile_name='hf-sm'

In [2]:
import sagemaker
import boto3

# creates a boto3 session using the local profile we defined
bt3 = boto3.session.Session(profile_name=local_profile_name)


sess = sagemaker.Session(boto_session=bt3)

# since we are using the sagemaker-sdk locally we cannot `get_execution_role` 
# role = sagemaker.get_execution_role()

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

In [3]:
role_name = "SageMakerRole"

_WARNING: This policy gives full S3 access to the container that is running in SageMaker. You can change this policy to a more restrictive one, or create your own policy._

In [4]:
%%bash  -s "$local_profile_name" "$role_name" 
# This script creates a role named SageMakerRole
# that can be used by SageMaker and has Full access to S3.

ROLE_NAME=$2

# WARNING: this policy gives full S3 access to container that
# is running in SageMaker. You can change this policy to a more
# restrictive one, or create your own policy.
POLICY_S3=arn:aws:iam::aws:policy/AmazonS3FullAccess

# Creates a AWS policy that allows the role to interact
# with ANY S3 bucket
cat <<EOF > /tmp/assume-role-policy-document.json
{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Principal": {
			"Service": "sagemaker.amazonaws.com"
		},
		"Action": "sts:AssumeRole"
	}]
}
EOF

# Creates the role
aws iam create-role --profile $1  --role-name ${ROLE_NAME} --assume-role-policy-document file:///tmp/assume-role-policy-document.json

# attaches the S3 full access policy to the role
aws iam attach-role-policy --profile $1 --policy-arn ${POLICY_S3}  --role-name ${ROLE_NAME}


An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name SageMakerRole already exists.


In [4]:
# get create role arn 
iam = bt3.client('iam')
role = iam.get_role(RoleName=role_name)['Role']['Arn']

## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


In [5]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            base_job_name='huggingface',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version='1.5.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased',
                                               'tokenizer':'distilbert-base-uncased'})

In [6]:
pytorch_estimator.fit()

Creating tmpjxtziolz_algo-1-idsdf_1 ... 
[1BAttaching to tmpjxtziolz_algo-1-idsdf_12mdone[0m
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,208 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,220 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,237 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,241 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,622 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. 
[36malgo-1-idsdf_1  |[0m Generating setup.py
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,623 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-idsdf_1  |[0m 2020-12-22 14:34:39,623 sagemaker-containers INFO     

[36malgo-1-idsdf_1  |[0m Collecting docker[ssh]<5,>=4.3.1
[36malgo-1-idsdf_1  |[0m   Downloading docker-4.4.0-py2.py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 3.1 MB/s eta 0:00:01
[36malgo-1-idsdf_1  |[0m [?25hCollecting websocket-client<1,>=0.32.0
[36malgo-1-idsdf_1  |[0m   Downloading websocket_client-0.57.0-py2.py3-none-any.whl (200 kB)
[K     |████████████████████████████████| 200 kB 3.2 MB/s eta 0:00:01
[36malgo-1-idsdf_1  |[0m [?25hCollecting dockerpty<1,>=0.4.1
[36malgo-1-idsdf_1  |[0m   Downloading dockerpty-0.4.1.tar.gz (13 kB)
[36malgo-1-idsdf_1  |[0m Collecting jsonschema<4,>=2.5.1
[36malgo-1-idsdf_1  |[0m   Downloading jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 3.0 MB/s eta 0:00:01
[36malgo-1-idsdf_1  |[0m [?25hCollecting texttable<2,>=0.9.0
[36malgo-1-idsdf_1  |[0m   Downloading texttable-1.6.3-py2.py3-none-any.whl (10 kB)
[36malgo-1-idsdf_1  |[0m Collecting cached-

[36mtmpjxtziolz_algo-1-idsdf_1 exited with code 1
[0mAborting on container exit...


RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/tmpjxtziolz/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

## Create an Estimator

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`


In [58]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            sagemaker_session=sess,
#                            use_spot_instances=True,
#                            max_wait=7200, # Seconds to wait for spot instances to become available
                            base_job_name='huggingface',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased',
                                               'tokenizer':'distilbert-base-uncased'
                                                })

In [59]:
pytorch_estimator.fit()

2020-12-22 12:44:19 Starting - Starting the training job...
2020-12-22 12:44:43 Starting - Launching requested ML instancesProfilerReport-1608641058: InProgress
......
2020-12-22 12:45:44 Starting - Preparing the instances for training......
2020-12-22 12:46:46 Downloading - Downloading input data
2020-12-22 12:46:46 Training - Downloading the training image........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-22 12:48:12,773 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-22 12:48:12,796 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m

2020-12-22 12:48:26 Training - Training image download completed. Training in progress.[34m2020-12-22 12:48:19,030 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-22 12:48:19,341 sagemaker-training-toolkit IN

[34m  Building wheel for pyrsistent (setup.py): finished with status 'done'
  Created wheel for pyrsistent: filename=pyrsistent-0.17.3-cp36-cp36m-linux_x86_64.whl size=112543 sha256=0aa01dea2ee568c49bcb1626bec8bdcb4f247eba3e15c405888ee69e6f2e5be8
  Stored in directory: /root/.cache/pip/wheels/34/13/19/294da8e11bce7e563afee51251b9fa878185e14f4b5caf00cb[0m
[34mSuccessfully built sklearn sacremoses docopt dockerpty pyrsistent[0m
[34mInstalling collected packages: regex, sacremoses, filelock, tokenizers, transformers, xxhash, pyarrow, dill, multiprocess, datasets, sklearn, attrs, pyrsistent, jsonschema, texttable, python-dotenv, docopt, dockerpty, websocket-client, docker, distro, cached-property, docker-compose[0m
[34mSuccessfully installed attrs-20.3.0 cached-property-1.5.2 datasets-1.1.3 dill-0.3.3 distro-1.5.0 docker-4.4.0 docker-compose-1.27.4 dockerpty-0.4.1 docopt-0.6.2 filelock-3.0.12 jsonschema-3.2.0 multiprocess-0.70.11.1 pyarrow-2.0.0 pyrsistent-0.17.3 python-dotenv-0.15.


2020-12-22 12:49:59 Uploading - Uploading generated training model[34m{'eval_loss': 0.6840440630912781, 'eval_accuracy': 0.72, 'eval_f1': 0.5625000000000001, 'eval_precision': 0.75, 'eval_recall': 0.45, 'epoch': 1.0}[0m
[34m{'epoch': 1.0}[0m
[34m***** Eval results *****[0m
[34m2020-12-22 12:49:57,627 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2020-12-22 12:50:48 Completed - Training job completed
Training seconds: 256
Billable seconds: 256
