# Huggingface Sagemaker example using `Trainer` class

Each folder starting with `0X_..` contains an specific sagemaker example. Each example contains a jupyter notebooke `sagemaker-example.ipynb` and a `src/` folder. The `sagemaker-example` is a jupyter notebook which is used to train transformers i ncombination with datasets on AWS Sagemaker. The `src/` folder contains the `train.py`, our training script and `requirements.txt` for additional dependencies.


## Initializing Sagemaker Session with local AWS Profile

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

Therefore you have to create an IAM-Role with correct permission for sagemaker to start training jobs and download files from s3. Beware that you need s3 permission on bucket-level `"arn:aws:s3:::sagemaker-*"` and on object-level     `"arn:aws:s3:::sagemaker-*/*"`. 

You can read [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) how to create a role with right permissions.

In [30]:
# local aws profile configured in ~/.aws/credentials
local_profile_name='hf-sm' # optional if you only have default configured

# role name for sagemaker -> needs the described permissions from above
role_name = "SageMakerRole"

In [33]:
import sagemaker

try:
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
except ValueError:
    import boto3
    # creates a boto3 session using the local profile we defined
    if local_profile_name:
        bt3 = boto3.session.Session(profile_name=local_profile_name)
        iam = bt3.client('iam')
        # create sagemaker session with boto3 session
        sess = sagemaker.Session(boto_session=bt3)
    else:
        iam = boto3.client('iam')
        sess = sagemaker.Session()
    # get role arn
    role = iam.get_role(RoleName=role_name)['Role']['Arn']


print(role)

Couldn't call 'get_role' to get Role ARN from role name philipp to get Role path.


arn:aws:iam::558105141721:role/SageMakerRole


# Preprocessing the data

In [7]:
from datasets import load_dataset
from transformers import AutoTokenizer
import torch

In [8]:
# load dataset
dataset = load_dataset('imdb')

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

#helper tokenizer function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 

# sample a to small dataset for training
train_dataset = train_dataset.shuffle().select(range(1000)) # smaller the size for test dataset to 10k 
test_dataset = test_dataset.shuffle().select(range(50)) # smaller the size for test dataset to 10k 


# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# set format for pytorch
train_dataset.rename_column_("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.rename_column_("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])


# cach the dataset, so we can load it directly for training
torch.save(train_dataset, 'train_dataset.pt')
torch.save(test_dataset, 'test_dataset.pt')

Reusing dataset imdb (/Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Reusing dataset imdb (/Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Loading cached shuffled indices for dataset at /Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-80ccdcca361db0f6.arrow
Loading cached shuffled indices for dataset at /Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-40d4f3183282e2a2.arrow
Loading cached shuffled indices for dataset at /Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-82f8ca6d57daeab5.arrow
Loading cached processed dataset at /Users/phi

## Upload data to sagemaker S3

In [18]:
prefix = 'imdb/small'

training_input_path  = sess.upload_data('train_dataset.pt', key_prefix=prefix+'/training')
test_input_path      = sess.upload_data('test_dataset.pt', key_prefix=prefix+'/test')

print(training_input_path)
print(test_input_path)

s3://sagemaker-eu-central-1-558105141721/imdb/small/training/train_dataset.pt
s3://sagemaker-eu-central-1-558105141721/imdb/small/test/test_dataset.pt


## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


In [22]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            base_job_name='huggingface',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version='1.5.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })

In [27]:
pytorch_estimator.fit({'train': training_input_path, 'test': test_input_path})

2020-12-22 16:25:03 Starting - Starting the training job...
2020-12-22 16:25:28 Starting - Launching requested ML instancesProfilerReport-1608654302: InProgress
......
2020-12-22 16:26:29 Starting - Preparing the instances for training......
2020-12-22 16:27:29 Downloading - Downloading input data...
2020-12-22 16:27:50 Training - Downloading the training image......
2020-12-22 16:29:11 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-22 16:29:05,811 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-22 16:29:05,835 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-22 16:29:12,080 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-22 16:29:12,377 sagemaker-training-toolkit 


2020-12-22 16:29:34 Uploading - Uploading generated training model
2020-12-22 16:29:34 Failed - Training job failed
[34m2020-12-22 16:29:30,755 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:[0m
[34mCommand "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"[0m
[34mTraceback (most recent call last):
  File "train.py", line 31, in <module>
    train_dataset  = torch.load(os.path.join(args.training_dir, 'train_dataset.pt'))
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 765, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 414, in __setstate__
    pa_table = reader._read_files([data_file])
  File "/opt/conda/lib/python3.6/site-packages/datase

UnexpectedStatusException: Error for Training job huggingface-2020-12-22-16-25-02-393: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "train.py", line 31, in <module>
    train_dataset  = torch.load(os.path.join(args.training_dir, 'train_dataset.pt'))
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 765, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 414, in __setstate__
    pa_table = reader._read_files([data_file])
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_reader.py", line 167, in _read_files
    pa_table: pa.Table = self._get_dataset_from_filename(f_dict)
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_reader.py", line 292, in _get_datas

## Create an Estimator

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`


In [24]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            sagemaker_session=sess,
#                            use_spot_instances=True,
#                            max_wait=7200, # Seconds to wait for spot instances to become available
                            base_job_name='huggingface',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })

In [25]:
pytorch_estimator.fit({'train': training_input_path, 'test': test_input_path})

2020-12-22 16:13:10 Starting - Starting the training job...
2020-12-22 16:13:35 Starting - Launching requested ML instancesProfilerReport-1608653589: InProgress
......
2020-12-22 16:14:36 Starting - Preparing the instances for training.........
2020-12-22 16:16:13 Downloading - Downloading input data...
2020-12-22 16:16:37 Training - Downloading the training image......
2020-12-22 16:17:38 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-22 16:17:38,237 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-22 16:17:38,260 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-22 16:17:41,283 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-22 16:17:41,577 sagemaker-training-toolki

[34m2020-12-22 16:18:01,995 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:[0m
[34mCommand "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"[0m
[34mTraceback (most recent call last):
  File "train.py", line 24, in <module>
    parser.add_argument('--training_dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
  File "/opt/conda/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None[0m
[34mKeyError: 'SM_CHANNEL_TRAIN'[0m

2020-12-22 16:18:18 Uploading - Uploading generated training model
2020-12-22 16:18:18 Failed - Training job failed


UnexpectedStatusException: Error for Training job huggingface-2020-12-22-16-13-09-857: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "train.py", line 24, in <module>
    parser.add_argument('--training_dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
  File "/opt/conda/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'SM_CHANNEL_TRAIN'