# Huggingface Sagemaker-sdk - Getting Started Demo
### Binary Classification with `Trainer` and `imdb` dataset

[Introduction](#Introduction)  
[Development Environment and Permissions](#Development-Environment-and-Permissions)


# Introduction

Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Faces `transformers` and `datasets` library togehter with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using `imdb` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

_**NOTE: You can run this demo in Sagemaker Studio or on you local machine**_

# Development Environment and Permissions 

## Development environment 

In [2]:
import os 

runtime='local'

if 'SAGEMAKER_TRAINING_MODULE' in os.environ:
    runtime='sagemaker'
    
runtime

'local'

if you run this demo within sagemaker studio, you have to update the `ipywidgets` for the `datasets` library

In [5]:
%%capture
import os 
import IPython
if runtime =='sagemaker':
    !conda install -c conda-forge ipywidgets -y
    IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used

In [13]:
!pip install git+https://github.com/philschmid/sagemaker-sdk-huggingface.git

Collecting git+https://github.com/philschmid/sagemaker-sdk-huggingface.git
  Cloning https://github.com/philschmid/sagemaker-sdk-huggingface.git to /private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/pip-req-build-0mgcmq3f


Building wheels for collected packages: sagemaker-huggingface
  Building wheel for sagemaker-huggingface (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker-huggingface: filename=sagemaker_huggingface-0.0.1-py3-none-any.whl size=16169 sha256=8dd3920ba03e343a431929394b9a92403199471fef51b06c4b492b9944d64ec1
  Stored in directory: /private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/pip-ephem-wheel-cache-ol73rt6e/wheels/ea/f4/50/478c4ff02760c480780476a589fe276064c446bc710087db4c
Successfully built sagemaker-huggingface
Installing collected packages: sagemaker-huggingface
Successfully installed sagemaker-huggingface-0.0.1


In [16]:
!pip show sagemaker-huggingface

Name: sagemaker-huggingface
Version: 0.0.1
Summary: Custom Sdk implementation for the huggingface libraries
Home-page: https://github.com/philschmid/sagemaker-sdk-huggingface/tree/SagemakerTrainer
Author: Philipp
Author-email: schmidphilipp1995@gmail.com
License: Apache 2.0
Location: /Users/philippschmid/.anaconda3/envs/hf/lib/python3.8/site-packages
Requires: sagemaker, sagemaker-experiments, datasets, sagemaker, torch, sklearn, boto3, numpy, transformers, matplotlib
Required-by: 


In [10]:
%%capture
!pip install -r ../requirements.txt --upgrade

## Permissions

## Initializing Sagemaker Session with local AWS Profile

From outside these notebooks, `get_execution_role()` will return an exception because it does not know what is the role name that SageMaker requires.

To solve this issue, pass the IAM role name instead of using `get_execution_role()`.

Therefore you have to create an IAM-Role with correct permission for sagemaker to start training jobs and download files from s3. Beware that you need s3 permission on bucket-level `"arn:aws:s3:::sagemaker-*"` and on object-level     `"arn:aws:s3:::sagemaker-*/*"`. 

You can read [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) how to create a role with right permissions.

In [1]:
# local aws profile configured in ~/.aws/credentials
local_profile_name='hf-sm' # optional if you only have default configured

# role name for sagemaker -> needs the described permissions from above
role_name = "SageMakerRole"

In [2]:
import sagemaker
import os
try:
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
except Exception:
    import boto3
    # creates a boto3 session using the local profile we defined
    if local_profile_name:
        os.environ['AWS_PROFILE'] = local_profile_name # setting env var bc local-mode cannot use boto3 session
        #bt3 = boto3.session.Session(profile_name=local_profile_name)
        #iam = bt3.client('iam')
        # create sagemaker session with boto3 session
        #sess = sagemaker.Session(boto_session=bt3)
    iam = boto3.client('iam')
    sess = sagemaker.Session()
    # get role arn
    role = iam.get_role(RoleName=role_name)['Role']['Arn']
    


print(role)


Couldn't call 'get_role' to get Role ARN from role name philipp to get Role path.


arn:aws:iam::558105141721:role/SageMakerRole


### Sagemaker Session prints

In [4]:
print(sess.list_s3_files(sess.default_bucket(),'datasets/')) # list objects in s3 under datsets/
print(sess.default_bucket()) # s3 bucketname
print(sess.boto_region_name) # aws region of sagemaker session

['datasets/imdb/small/test/dataset.arrow', 'datasets/imdb/small/test/dataset_info.json', 'datasets/imdb/small/test/state.json', 'datasets/imdb/small/test/test_dataset.pt', 'datasets/imdb/small/train/dataset.arrow', 'datasets/imdb/small/train/dataset_info.json', 'datasets/imdb/small/train/state.json', 'datasets/imdb/small/training/train_dataset.pt']
sagemaker-eu-central-1-558105141721
eu-central-1


# Imports

Since we are using the `.py` module directly from `huggingface/` we have to adjust our `sys.path` to be able to import our estimator

In [3]:
import sys, os

module_path = os.path.abspath(os.path.join('../../src'))
if module_path not in sys.path:
    sys.path.append(module_path)


# Preprocessing the data

In [121]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [122]:
# load dataset
dataset = load_dataset('imdb')

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

#helper tokenizer function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 

# sample a to small dataset for training
#train_dataset = train_dataset.shuffle().select(range(2000)) # smaller the size for test dataset to 10k 
#test_dataset = test_dataset.shuffle().select(range(150)) # smaller the size for test dataset to 10k 


# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# set format for pytorch
train_dataset.rename_column_("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.rename_column_("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Reusing dataset imdb (/Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Reusing dataset imdb (/Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Loading cached shuffled indices for dataset at /Users/philippschmid/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-f7ed38da5ada7a37.arrow


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




## Upload data to sagemaker S3

In [123]:
import glob
def upload_data_to_s3(dataset=None,prefix='datasets',split_type='train'):
    """helper function with saves the dataset locally using dataset.save_to_disk() and upload its then to s3. """
    
    temp_prefix =f"{prefix}/{split_type}"
    # saves datasets in local directory
    dataset.save_to_disk(f"./{temp_prefix}")
    
    # loops over saved files and uploads them to s3 
    for file in glob.glob(f"./{temp_prefix}/*"):
        sess.upload_data(file, key_prefix=temp_prefix)

    # return s3 url to files for estimator.fit()
    return f"s3://{sess.default_bucket()}/{temp_prefix}"

In [124]:
prefix = 'datasets/imdb'

training_input_path  = upload_data_to_s3(dataset=train_dataset,prefix=prefix,split_type='train')
test_input_path      = upload_data_to_s3(dataset=test_dataset,prefix=prefix,split_type='test')

print(training_input_path)
print(test_input_path)


s3://sagemaker-eu-central-1-558105141721/datasets/imdb/train
s3://sagemaker-eu-central-1-558105141721/datasets/imdb/test


## Create an local estimator for testing

You run PyTorch training scripts on SageMaker by creating PyTorch Estimators. SageMaker training of your script is invoked when you call fit on a PyTorch Estimator. The following code sample shows how you train a custom PyTorch script `train.py`, passing in three hyperparameters (`epochs`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

in sagemaker you can test you training in a "local-mode" by setting your instance_type to `'local'`


In [7]:
!pygmentize src/train.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoModelForSequenceClassification, Trainer, TrainingArguments
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_recall_fscore_support
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m ==[33m'[39;49;00m[33m__main__[39;49;00m[33m'[39;49;00m:

    parser = argparse.ArgumentParser()

    [37m# hyperparameters sent by the client are passed as command-line arguments to the script.[39;49;00m
    parser.add_argument([33m'[39;49;00m[33m--epochs[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, defa

## Importing custom sdk-extension for HuggingFace

In [87]:
from huggingface.estimator import HuggingFace

## Create an local Estimator

The following code sample shows how you train a custom HuggingFace script `train.py`, passing in three hyperparameters (`epochs`,`train_batch_size`,`model_name`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`

In [88]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            base_job_name='huggingface',
                            instance_type='local',
                            instance_count=1,
                            role=role,
                            framework_version={'transformers':'4.1.1','datasets':'1.1.3'},
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })

In [89]:
huggingface_estimator.image_uri

'558105141721.dkr.ecr.eu-central-1.amazonaws.com/huggingface-training:0.0.1-cpu-transformers4.1.1-datasets1.1.3'

In [90]:
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

Creating vjrxuebgqp-algo-1-7vnys ... 
[1BAttaching to vjrxuebgqp-algo-1-7vnys2mdone[0m
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:29,768 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:29,771 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:29,789 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:29,803 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:29,966 botocore.credentials INFO     Found credentials in environment variables.
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:30,238 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:30,259 s

[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:33,574 - __main__ - INFO -  loaded train_dataset length is: 2000
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:33,574 - __main__ - INFO -  loaded test_dataset length is: 150
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:33,938 - filelock - INFO - Lock 140437597566176 acquired on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361.lock
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:34,280 - filelock - INFO - Lock 140437597566176 released on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361.lock
[36mvjrxuebgqp-algo-1-7vnys |[0m 2020-12-27 17:08:34,615 - filelock - INFO - Lock 140437597568248 acquired on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c63

RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/jj/dzns9hc55db1vmfsjvrh9n8m0000gp/T/tmp12pz5tan/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

## Create an Estimator

The following code sample shows how you train a custom HuggingFace script `train.py`, passing in three hyperparameters (`epochs`,`train_batch_size`,`model_name`). We are not going to pass any data into sagemaker training job instead it will be downloaded in `train.py`


In [23]:
from huggingface.estimator import HuggingFace


huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='../scripts',
                            sagemaker_session=sess,
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            framework_version={'transformers':'4.1.1','datasets':'1.1.3'},
                            py_version='py3',
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })

In [None]:
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

# Estimator Parameters

### Get S3 url for model data

In [74]:
huggingface_estimator.model_data

's3://sagemaker-eu-central-1-558105141721/huggingface-sdk-extension-2020-12-27-15-25-50-506/output/model.tar.gz'

### Get latest training job name

In [75]:
huggingface_estimator.latest_training_job.name

'huggingface-sdk-extension-2020-12-27-15-25-50-506'

### Attach to old estimator 

e.g. to get model data

In [36]:
old_job_name='huggingface-sdk-extension-2020-12-27-15-25-50-506'

In [41]:
from sagemaker.estimator import Estimator

In [58]:
huggingface_estimator_loaded = Estimator.attach(old_job_name)


2020-12-27 15:34:03 Starting - Preparing the instances for training
2020-12-27 15:34:03 Downloading - Downloading input data
2020-12-27 15:34:03 Training - Training image download completed. Training in progress.
2020-12-27 15:34:03 Uploading - Uploading generated training model
2020-12-27 15:34:03 Completed - Training job completed


In [78]:
huggingface_estimator_loaded.model_data

's3://sagemaker-eu-central-1-558105141721/huggingface-sdk-extension-2020-12-27-15-25-50-506/output/model.tar.gz'

### Download model from s3

**using huggingface utils**

In [70]:
from huggingface.utils import download_model

download_model(model_data=huggingface_estimator_loaded.model_data,
               unzip=True,
               model_dir=huggingface_estimator_loaded.latest_training_job.name)

**using class built-in method**

In [None]:
huggingface_estimator.download_model(unzip=False)

### Access logs

until [PR](https://github.com/aws/sagemaker-python-sdk/pull/2059) is merged

In [84]:
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name, wait=True)

2020-12-27 15:34:03 Starting - Preparing the instances for training
2020-12-27 15:34:03 Downloading - Downloading input data
2020-12-27 15:34:03 Training - Training image download completed. Training in progress.
2020-12-27 15:34:03 Uploading - Uploading generated training model
2020-12-27 15:34:03 Completed - Training job completed[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-27 15:32:15,198 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-27 15:32:15,221 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-27 15:32:18,249 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-27 15:32:18,802 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_in

**after merged PR**

In [None]:
huggingface_estimator.logs()