# Huggingface Sagemaker-sdk extension example using `Trainer` class

## Installs requirements if you haven´t already done it and sets up ipywidgets for datasets in sagemaker studio

In [10]:
%%capture
!pip install -r ../requirements.txt --upgrade

In [8]:
%%capture
import os 
import IPython
if 'SAGEMAKER_TRAINING_MODULE' in os.environ:
    !conda install -c conda-forge ipywidgets -y
    IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used

## Initializing Sagemaker Session with local AWS Profile

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

Therefore you have to create an IAM-Role with correct permission for sagemaker to start training jobs and download files from s3. Beware that you need s3 permission on bucket-level `"arn:aws:s3:::sagemaker-*"` and on object-level     `"arn:aws:s3:::sagemaker-*/*"`. 

You can read [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) how to create a role with right permissions.

In [1]:
# local aws profile configured in ~/.aws/credentials
local_profile_name='hf-sm' # optional if you only have default configured

# role name for sagemaker -> needs the described permissions from above
role_name = "SageMakerRole"

In [2]:
import sagemaker
import os
try:
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
except Exception:
    import boto3
    # creates a boto3 session using the local profile we defined
    if local_profile_name:
        os.environ['AWS_PROFILE'] = local_profile_name # setting env var bc local-mode cannot use boto3 session
        #bt3 = boto3.session.Session(profile_name=local_profile_name)
        #iam = bt3.client('iam')
        # create sagemaker session with boto3 session
        #sess = sagemaker.Session(boto_session=bt3)
    iam = boto3.client('iam')
    sess = sagemaker.Session()
    # get role arn
    role = iam.get_role(RoleName=role_name)['Role']['Arn']
    


print(role)


Couldn't call 'get_role' to get Role ARN from role name philipp to get Role path.


arn:aws:iam::558105141721:role/SageMakerRole


### Sagemaker Session prints

In [11]:
print(sess.list_s3_files(sess.default_bucket(),'datasets/')) # list objects in s3 under datsets/
print(sess.default_bucket()) # s3 bucketname
print(sess.boto_region_name) # aws region of sagemaker session

['datasets/imdb/small/test/dataset.arrow', 'datasets/imdb/small/test/dataset_info.json', 'datasets/imdb/small/test/state.json', 'datasets/imdb/small/test/test_dataset.pt', 'datasets/imdb/small/train/dataset.arrow', 'datasets/imdb/small/train/dataset_info.json', 'datasets/imdb/small/train/state.json', 'datasets/imdb/small/training/train_dataset.pt', 'datasets/imdb/test/dataset.arrow', 'datasets/imdb/test/dataset_info.json', 'datasets/imdb/test/state.json', 'datasets/imdb/train/dataset.arrow', 'datasets/imdb/train/dataset_info.json', 'datasets/imdb/train/state.json']
sagemaker-eu-central-1-558105141721
eu-central-1


# Imports

Since we are using the `.py` module directly from `huggingface/` we have to adjust our `sys.path` to be able to import our estimator

In [3]:
import sys, os

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)


In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
from huggingface.estimator import HuggingFace

# Create an Estimator with an Experiment

[Metric Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html)

To find a metric, SageMaker searches the logs that your algorithm emits and finds logs that match the regular expression that you specify for that metric. 

Defining Training Metrics (SageMaker Python SDK)
Define the metrics that you want to send to CloudWatch by specifying a list of metric names and regular expressions as the metric_definitions argument when you initialize an Estimator object. For example, if you want to monitor both the train:error and validation:error metrics in CloudWatch, your Estimator initialization would look like the following:
```python
Estimator(
    ...,
    sagemaker_session = sm_sess,
    tags = [{'Key': 'my-experiments', 'Value': 'demo2'}])

estimator.fit(
    ...,
    experiment_config = {
        # "ExperimentName"
        "TrialName" : demo_trial.trial_name,
        "TrialComponentDisplayName" : "TrainingJob",
    })
```

In the regex for the train:error metric defined above, the first part of the regex finds the exact text `"Train_error="`, and the expression `(.*?);` captures zero or more of any character until the first `;` semicolon character.

For more information about training by using Amazon SageMaker Python SDK estimators, see https://github.com/aws/sagemaker-python-sdk#sagemaker-python-sdk-overview.

# Scripts
https://github.com/huggingface/transformers/tree/master/examples/text-classification


## GLUE

    export TASK_NAME=MRPC

    python run_glue.py \
      --model_name_or_path bert-base-cased \
      --task_name $TASK_NAME \
      --do_train \
      --do_eval \
      --max_seq_length 128 \
      --per_device_train_batch_size 32 \
      --learning_rate 2e-5 \
      --num_train_epochs 3.0 \
      --output_dir /tmp/$TASK_NAME/
      
## XNLI      
    export XNLI_DIR=/path/to/XNLI

    python run_xnli.py \
      --model_name_or_path bert-base-multilingual-cased \
      --language de \
      --train_language en \
      --do_train \
      --do_eval \
      --data_dir $XNLI_DIR \
      --per_device_train_batch_size 32 \
      --learning_rate 5e-5 \
      --num_train_epochs 2.0 \
      --max_seq_length 128 \
      --output_dir /tmp/debug_xnli/ \
      --save_steps -1


## Local Estimator ##

In [23]:
! ls 

README.md                     run_tf_glue.py
requirements.txt              run_tf_text_classification.py
run_glue.py                   run_xnli.py


In [16]:
from huggingface.estimator import HuggingFace


local_estimator = HuggingFace(entry_point='run_glue.py',
                            source_dir='../../transformers/examples/text-classification',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version={'transformers':'4.1.1','datasets':'1.1.3'},
                            py_version='py3',
                            hyperparameters = {
                                'model_name_or_path': 'distilbert-base-cased',
                                'task_name':'MRPC',
                                'do_train': True,
                                'do_eval': True,
                                'max_seq_length':'128',
                                'per_device_train_batch_size':32,
                                'learning_rate':2e-5,
                                'num_train_epochs': 3.0,
                                'output_dir':'Not defined sagemaker'
                            })

In [17]:
local_estimator.fit()

Creating rp6bue5co8-algo-1-g4cu2 ... 
[1BAttaching to rp6bue5co8-algo-1-g4cu22mdone[0m
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,511 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,513 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,529 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,532 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,571 botocore.credentials INFO     Found credentials in environment variables.
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,816 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:28,836 s

[36mrp6bue5co8-algo-1-g4cu2 |[0m Traceback (most recent call last):
[36mrp6bue5co8-algo-1-g4cu2 |[0m   File "run_glue.py", line 467, in <module>
[36mrp6bue5co8-algo-1-g4cu2 |[0m     main()
[36mrp6bue5co8-algo-1-g4cu2 |[0m   File "run_glue.py", line 146, in main
[36mrp6bue5co8-algo-1-g4cu2 |[0m     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
[36mrp6bue5co8-algo-1-g4cu2 |[0m   File "/opt/conda/lib/python3.6/site-packages/transformers/hf_argparser.py", line 158, in parse_args_into_dataclasses
[36mrp6bue5co8-algo-1-g4cu2 |[0m     raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
[36mrp6bue5co8-algo-1-g4cu2 |[0m ValueError: Some specified arguments are not used by the HfArgumentParser: ['True', 'True']
[36mrp6bue5co8-algo-1-g4cu2 |[0m 
[36mrp6bue5co8-algo-1-g4cu2 |[0m 2020-12-30 15:10:30,482 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
[36mrp6bue5co8-algo-1-g4cu2 

RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/tmpusce6rmq/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

## Sagemaker Estimator

In [32]:
from huggingface.estimator import HuggingFace


huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='../scripts',
                            sagemaker_session=sess,
                            use_spot_instances=True,
                            max_wait=4600, # This should be equal to or greater than max_run in seconds'
                            max_run=3600,
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version={'transformers':'4.1.1','datasets':'1.1.3'},
                            py_version='py3',
                            hyperparameters = {'epochs': 3,
                                               'train_batch_size': 16,
                                               'model_name':'distilbert-base-uncased'})