# Huggingface Sagemaker example using `Trainer` class

Each folder starting with `0X_..` contains an specific sagemaker example. Each example contains a jupyter notebooke `sagemaker-example.ipynb` and a `src/` folder. The `sagemaker-example` is a jupyter notebook which is used to train transformers and datasets on AWS Sagemaker. The `src/` folder contains the `train.py`, our training script and `requirements.txt` for additional dependencies.


## Initializing Sagemaker Session with local AWS Profile

In [4]:
import sagemaker


sess = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    import boto3
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='SageMakerRole')['Role']['Arn']

Couldn't call 'get_role' to get Role ARN from role name SageMakerRole to get Role path.


## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


In [11]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            base_job_name='huggingface',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train-batch-size': 32})

In [14]:
pytorch_estimator.fit()

Using the short-lived AWS credentials found in session. They might expire while running.


FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker'

## Create an Estimator

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`


In [15]:
from sagemaker.pytorch import PyTorch

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='src',
                            sagemaker_session=sess,
                            base_job_name='huggingface',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version='1.5.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1})

In [16]:
pytorch_estimator.fit()

2020-12-22 10:19:19 Starting - Starting the training job...
2020-12-22 10:19:20 Starting - Launching requested ML instancesProfilerReport-1608632358: InProgress
......
2020-12-22 10:20:44 Starting - Preparing the instances for training......
2020-12-22 10:21:45 Downloading - Downloading input data
2020-12-22 10:21:45 Training - Downloading the training image.........
2020-12-22 10:23:13 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-22 10:23:15,105 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-22 10:23:15,134 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-22 10:23:15,138 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-22 10:23:15,442 sagemaker-containers INFO     Mod

UnexpectedStatusException: Error for Training job huggingface-2020-12-22-10-19-18-621: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py --batch-size 64 --epochs 20 --learning-rate 0.1"
Traceback (most recent call last):
  File "train.py", line 1, in <module>
    from datasets import load_dataset
  File "/opt/conda/lib/python3.6/site-packages/datasets/__init__.py", line 26, in <module>
    from .arrow_dataset import Dataset, concatenate_datasets
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 178, in <module>
    class Dataset(DatasetInfoMixin, IndexableMixin):
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1892, in Dataset
    new_fingerprint: Optional[str] = None,
AttributeError: module 'numpy.random' has no attribute 'Generator'