# Fine-tune classifier on AWS



# Introduction

Welcome to my end-to-end `non-distributed` and `distributed` multilabel text-classifier example. T  In this demo, we will use the Hugging Face `transformers` and `datasets` library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer for multilabel text-classification on a single or multiple-gpus. In particular, the pre-trained model will be fine-tuned using the `tweet_eval: emotion` dataset. The demo will use the new `smdistributed` library to run training on multiple gpus. 

_**NOTE: You are encouraged to run this demo in Sagemaker Notebook Instances**_

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [1]:
!pip install 'botocore==1.27.75' "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Development environment 

In [2]:
import sagemaker.huggingface

## Permissions

In [3]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::957370261234:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole
sagemaker bucket: sagemaker-us-west-2-957370261234
sagemaker session region: us-west-2


# Preprocessing (Skip this section if your dataset is already saved in S3)

We are using the `datasets` library to download and preprocess the [tweet_eval: emotion](https://huggingface.co/datasets/tweet_eval) dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [tweet_eval: emotion](https://huggingface.co/datasets/tweet_eval) dataset consists of the follow:
text: a string feature containing the tweet.
label: an int classification label with the following mapping:
0: anger
1: joy
2: optimism
3: sadness

## Tokenization 

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'

# dataset used
dataset_name = 'tweet_eval'
dataset_arg = 'emotion'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/'+dataset_name+'_'+dataset_arg

In [8]:
# load dataset
#dataset = load_dataset(dataset_name)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset(dataset_name, dataset_arg, split=['train', 'test'])
#test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 


# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Reusing dataset tweet_eval (/home/ec2-user/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## Uploading data to `sagemaker_session_bucket`

After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [9]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)


# You can skip to this point if you already have your data saved in your S3 bucket
Simply paste your test and training input paths in the following cell.

In [6]:
training_input_path =  f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'

### If you would like to start form a previously trained sagemaker model
replace `model_s3_path` with the path to your desiered model.tar.gz

In [7]:
model_s3_path = f's3://{sess.default_bucket()}/huggingface-pytorch-training-2022-09-26-04-03-51-741/output/model.tar.gz'
model_path = './trained_model'


In [None]:
import tarfile
# open file
file = tarfile.open(model_s3_path)
# extracting file
file.extractall(model_path)
file.close()


In [9]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('./trained_model')

### Lets take a look at our fine-tuning script

In [1]:
!pygmentize ./train.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_recall_fscore_support
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:

    parser = argparse.ArgumentParser()

    [37m# hyperparameters sent by the client are passed as command-line arguments to the script.[39;49;00m
    parser.add_argument(

## Creating an Estimator and start a training job

### For Non-distributed Training


In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 # Replace the model name with the path to the model you would like to fine tune.
                 'model_name_or_path':'distilbert-base-uncased'
                 }

In [None]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            hyperparameters = hyperparameters)

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

## Distributed Training
### For Distributed Training


In [28]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    # Replace the model name with the path to the model you would like to fine tune.
    'model_name_or_path': 'distilbert-base-uncased',
    'per_device_train_batch_size': 8,
    'per_device_eval_batch_size': 8,
    'num_train_epochs': 1
}

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

# instance configurations
instance_type='ml.p3.16xlarge'
instance_count=2
volume_size=200

In [29]:
# estimator
huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='./',
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters)

In [30]:
!sudo chmod 777 lost+found

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
predictor = huggingface_estimator_loaded.deploy(1,"ml.g4dn.xlarge")

-------------------------------

Then, we use the returned predictor object to call the endpoint.

 Lets create an example for each label 0: anger 1: joy 2: optimism 3: sadness

In [None]:
sentiment_input_anger= {"inputs":"I get mad when using the new Inference DLC."}
sentiment_input_joy= {"inputs":"I am happy when using the new Inference DLC."}
sentiment_input_optimism= {"inputs":"I am excited to uses the new Inference DLC."}
sentiment_input_sadness= {"inputs":"I am disapointed in the new Inference DLC."}




print(predictor.predict(sentiment_input_anger))
print(predictor.predict(sentiment_input_joy))
print(predictor.predict(sentiment_input_optimism))
print(predictor.predict(sentiment_input_sadness))

Finally, we delete the endpoint again.

In [12]:
predictor.delete_endpoint()

# Extras

### Estimator Parameters

In [15]:
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")



container image used for training job: 
None

s3 uri where the trained model is located: 
s3://sagemaker-us-west-2-957370261234/huggingface-pytorch-training-2022-09-26-04-03-51-741/output/model.tar.gz

latest training job name for this estimator: 
huggingface-pytorch-training-2022-09-26-04-03-51-741



In [None]:
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

### Attach to old training job to an estimator 

In Sagemaker you can attach an old training job to an estimator to continue training, get results etc..

In [19]:
model_path

's3://sagemaker-us-west-2-957370261234/huggingface-pytorch-training-2022-09-26-04-03-51-741/output/model.tar.gz'

In [10]:
from sagemaker.estimator import Estimator

# job which is going to be attached to the estimator
old_training_job_name='huggingface-pytorch-training-2022-09-26-04-03-51-741'


In [11]:
# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)

# get model output s3 from training job
huggingface_estimator_loaded.model_data


2022-09-26 04:22:59 Starting - Preparing the instances for training
2022-09-26 04:22:59 Downloading - Downloading input data
2022-09-26 04:22:59 Training - Training image download completed. Training in progress.
2022-09-26 04:22:59 Uploading - Uploading generated training model
2022-09-26 04:22:59 Completed - Training job completed


's3://sagemaker-us-west-2-957370261234/huggingface-pytorch-training-2022-09-26-04-03-51-741/output/model.tar.gz'