# Huggingface Sagemaker-sdk extension example using `Trainer` class

## Initializing Sagemaker Session with local AWS Profile

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

Therefore you have to create an IAM-Role with correct permission for sagemaker to start training jobs and download files from s3. Beware that you need s3 permission on bucket-level `"arn:aws:s3:::sagemaker-*"` and on object-level     `"arn:aws:s3:::sagemaker-*/*"`. 

You can read [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) how to create a role with right permissions.

In [1]:
# local aws profile configured in ~/.aws/credentials
local_profile_name='default' # optional if you only have default configured

# role name for sagemaker -> needs the described permissions from above
role_name = "AmazonSageMaker-ExecutionRole-20201222T210251"

In [2]:
import sagemaker
import os
try:
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
except Exception:
    import boto3
    # creates a boto3 session using the local profile we defined
    if local_profile_name:
        os.environ['AWS_PROFILE'] = local_profile_name # setting env var bc local-mode cannot use boto3 session
        #bt3 = boto3.session.Session(profile_name=local_profile_name)
        #iam = bt3.client('iam')
        # create sagemaker session with boto3 session
        #sess = sagemaker.Session(boto_session=bt3)
    iam = boto3.client('iam')
    sess = sagemaker.Session()
    # get role arn
    role = iam.get_role(RoleName=role_name)['Role']['Arn']
    


print(role)


Couldn't call 'get_role' to get Role ARN from role name lagunas to get Role path.


arn:aws:iam::854676674973:role/service-role/AmazonSageMaker-ExecutionRole-20201222T210251


### Sagemaker Session prints

In [3]:
print(sess.list_s3_files(sess.default_bucket(),'datasets/')) # list objects in s3 under datsets/
print(sess.default_bucket()) # s3 bucketname
print(sess.boto_region_name) # aws region of sagemaker session

['datasets/imdb/test/dataset.arrow', 'datasets/imdb/test/dataset_info.json', 'datasets/imdb/test/state.json', 'datasets/imdb/train/dataset.arrow', 'datasets/imdb/train/dataset_info.json', 'datasets/imdb/train/state.json']
sagemaker-eu-west-1-854676674973
eu-west-1


# Imports

Since we are using the `.py` module directly from `huggingface/` we have to adjust our `sys.path` to be able to import our estimator

In [4]:
import sys, os

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)


# Preprocessing the data

## Upload data to sagemaker S3

## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


## Importing custom sdk-extension for HuggingFace

In [5]:
from huggingface.estimator import HuggingFace

## Create an local Estimator

The following code sample shows how you train a custom HuggingFace script `train.py`, passing in three hyperparameters (`epochs`,`train_batch_size`,`model_name`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

In [8]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

#boto3
import boto3
import os

VERSION="v11"

if local_profile_name:
    os.environ['AWS_PROFILE'] = local_profile_name # setting env var bc local-mode cannot use boto3 session
    
ex_sm_sess = boto3.client('sagemaker')
sess_experiment = Experiment(sagemaker_boto_client=ex_sm_sess)

print(dir(sess_experiment))
experiment = sess_experiment.create(experiment_name='nn-pruning-'+VERSION) # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,119}
#experiment = sess_experiment.load(experiment_name='nn-pruning-'+VERSION) # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,119}

print(experiment)

print()

for trial in experiment.list_trials():
    print("Trial", trial)

['MAX_DELETE_ALL_ATTEMPTS', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_boto_create_method', '_boto_delete_members', '_boto_delete_method', '_boto_ignore', '_boto_list_method', '_boto_load_method', '_boto_update_members', '_boto_update_method', '_construct', '_custom_boto_names', '_custom_boto_types', '_invoke_api', '_list', '_search', 'create', 'create_trial', 'delete', 'delete_all', 'description', 'experiment_name', 'from_boto', 'list', 'list_trials', 'load', 'sagemaker_boto_client', 'save', 'search', 'tags', 'to_boto', 'with_boto']
2021-01-24 15:44:17,544 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
Experiment(sagemaker_boto_client=<bo

In [9]:

local = False
if local:
    instance_type = "local"
    sess = None
    batch_size = 1
    num_train_epochs = 0.0005
    logging_steps=20
else:
    instance_type = "ml.p3.2xlarge"
    sagemaker_session=sess
    batch_size = 16
    num_train_epochs=20
    logging_steps=250
    
    
def build_metric_definitions():
    ret = []
    train_metrics = ['loss',
 'learning_rate',
 'threshold',
 'ampere_temperature',
 'regu_lambda',
 'ce_loss',
 'distil_loss',
 'nnz_perc_attention',
 'regu_loss_attention',
 'nnz_perc_dense',
 'regu_loss_dense',
 'regu_loss',
 'nnz_perc',
 'epoch']
    eval_metrics = ["f1", "exact_match"]
        
    metric_types = {"train":("",train_metrics), "validation":("eval_", eval_metrics)}
    for k, (prefix, metrics) in metric_types.items():
        for m in metrics:
            ret += {'Name': f"{k}:{m}", 'Regex':f"'{prefix}{m}': (.*?)[,"+"}]"},
    return ret
        
    
metric_definitions = build_metric_definitions()

from nn_pruning.examples.question_answering.qa_sparse_xp import SparseQAShortNamer

def now_aws_string():
    from datetime import datetime
    return str(datetime.now()).replace(" ", "--").replace(":", "-").split(".")[0]

def estimator_run(attention:int, regu_lambda:float, dense_lambda:float, wait=True, use_spot=False):
    # defines trial_name in given input
    trial_name = f"nn-pruning-{VERSION}-a{attention}-l{regu_lambda}-dl{dense_lambda}--{now_aws_string()}"
    trial_name = trial_name.replace(".0", "").replace(".", "-")
    
    # creates trial for our experiment
    trial = experiment.create_trial(trial_name=trial_name)
        
    dense = attention
    hyperparameters = {"attention_block_rows":attention,
                       "attention_block_cols":attention,
                       "dense_block_rows":dense,
                       "dense_block_cols":dense,
                       "regularization_final_lambda": regu_lambda,
                       "num_train_epochs":num_train_epochs,
                       "logging_steps":logging_steps,
                       "per_device_train_batch_size":batch_size,
                       "dense_lambda":dense_lambda,
                       "dense_pruning_method":"sigmoied_threshold"}
    
    hyperparameters = {k.replace("_", "-"):v for k,v in hyperparameters.items()}

    def get_hp_name(hyper_parameters):
        p = {k.replace("-", "_"):v for k,v in hyper_parameters.items()}    

        sn = SparseQAShortNamer()

        ret = sn.shortname(p)
        return ret

    base_job_name = "nn-pruning-" + VERSION #+ get_hp_name(hyperparameters)[3:].replace(".", "-")
    print(base_job_name)
    
    
    if use_spot:
        checkpoint_s3_uri = f"s3://sagemaker-eu-west-1-854676674973/{trial_name}/checkpoint/"
        additional = dict(use_spot_instances=True,
                          max_wait=3600, 
                          checkpoint_local_path="/opt/local/model",
                          checkpoint_s3_uri = checkpoint_s3_uri
                         )
    else:
        additional = {}

    estimator = HuggingFace(entry_point='nn_pruning_train.py',
                            source_dir='../scripts',
                            sagemaker_session=sess,
                            tags = [{'Key': 'name', 'Value': "nn-pruning"},
                                    {"Key":"version", "Value": VERSION}],
                            base_job_name=base_job_name,
                            volume_size=300,
                            instance_type=instance_type,
                            instance_count=1,
                            role=role,
                            framework_version={'transformers':'4.1.1','datasets':'1.1.3'},
                            py_version='py3',
                            metric_definitions = metric_definitions,
                            hyperparameters = hyperparameters,
                             **additional)
    
    estimator.fit(job_name=trial_name,
        experiment_config={
            "TrialName": trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },
        wait = wait
    )
        
    
    return estimator

estimators = []

attentions = [16, 8, 4]
regu_lambdas = [10, 20, 40]
dense_lambdas = [1.0]
use_spot = False
for i, attention in enumerate(attentions):
    for j, regu_lambda in enumerate(regu_lambdas):
        for dense_lambda in dense_lambdas:
            estimator = estimator_run(attention, regu_lambda, dense_lambda, wait = False, use_spot=use_spot)
#        print(estimator)
#        estimator.fit(wait = True)
        
        


nn-pruning-v11
IMAGE_URI 854676674973.dkr.ecr.eu-west-1.amazonaws.com/huggingface-nn-pruning-training:0.0.1-gpu-transformers4.1.1-datasets1.1.3-cu110
2021-01-24 15:45:00,914 - sagemaker.image_uris - INFO - Defaulting to the only supported framework/algorithm version: latest.
2021-01-24 15:45:00,918 - sagemaker.image_uris - INFO - Ignoring unnecessary instance type: None.
2021-01-24 15:45:26,919 - sagemaker - INFO - Creating training-job with name: nn-pruning-v11-a16-l10-dl1--2021-01-24--15-45-00
nn-pruning-v11
IMAGE_URI 854676674973.dkr.ecr.eu-west-1.amazonaws.com/huggingface-nn-pruning-training:0.0.1-gpu-transformers4.1.1-datasets1.1.3-cu110
2021-01-24 15:45:27,481 - sagemaker.image_uris - INFO - Defaulting to the only supported framework/algorithm version: latest.
2021-01-24 15:45:27,492 - sagemaker.image_uris - INFO - Ignoring unnecessary instance type: None.
2021-01-24 15:45:52,942 - sagemaker - INFO - Creating training-job with name: nn-pruning-v11-a16-l20-dl1--2021-01-24--15-45-2

In [39]:
#for exp in Experiment.list():
#    print(exp)
for trial in experiment.list_trials():
    print(trial)

In [51]:
delete = False
if delete:
    experiment.delete_all(action="--force")


In [None]:
huggingface_estimator.model_data

### Get latest training job name

In [None]:
huggingface_estimator.latest_training_job.name

### Attach to old estimator 

e.g. to get model data

In [None]:
old_job_name='huggingface-sdk-extension-2020-12-27-15-25-50-506'

In [None]:
from sagemaker.estimator import Estimator

In [None]:
huggingface_estimator_loaded = Estimator.attach(old_job_name)

In [None]:
huggingface_estimator_loaded.model_data

### Download model from s3

**using huggingface utils**

In [None]:
from huggingface.utils import download_model

download_model(model_data=huggingface_estimator_loaded.model_data,
               unzip=True,
               model_dir=huggingface_estimator_loaded.latest_training_job.name)

**using class built-in method**

In [None]:
huggingface_estimator.download_model(unzip=False)

### Access logs

until [PR](https://github.com/aws/sagemaker-python-sdk/pull/2059) is merged

In [None]:
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name, wait=True)

**after merged PR**

In [None]:
huggingface_estimator.logs()

In [None]:
hyperparameters = {"attention_block_rows":attention,
                   "attention_block_cols":attention,
                   "regularization_final_lambda": regu_lambda,
                   "num_train_epochs":num_train_epochs,
                   "logging_steps":logging_steps,
                   "per_device_train_batch_size":batch_size}
hp2 = {k.replace("-", "_"):v for k,v in hyperparameters.items()}

In [None]:
hp2

In [25]:
from datetime import datetime

str(datetime.now()).replace(" ", "--").replace(":", "-").split(".")[0]

'2021-01-15--14-57-42'