# Huggingface Sagemaker-sdk - Distributed Training 



We will use the new [Hugging Face DLCs](https://github.com/aws/deep-learning-containers/tree/master/huggingface) and [Amazon SageMaker extension](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator) to train a distributed Seq2Seq-transformer model on `summarization` using the `transformers` and `datasets` libraries and upload it afterwards to [huggingface.co](http://huggingface.co) and test it.

As [distributed training strategy](https://huggingface.co/transformers/sagemaker.html#distributed-training-data-parallel) we are going to use [SageMaker Data Parallelism](https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/), which has been built into the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) API. To use data-parallelism we only have to define the `distribution` parameter in our `HuggingFace` estimator.

```python
# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
```

In this code, we will use an Amazon SageMaker Notebook Instance for running our training job. 

**What are we going to do:**

- Set up a development environment and install sagemaker
- Chose 🤗 Transformers `examples/` script
- Configure distributed training and hyperparameters
- Create a `HuggingFace` estimator and start training
- Upload the fine-tuned model to [huggingface.co](http://huggingface.co)
- Test inference

### Model and Dataset

We are going to fine-tune [facebook/bart-base](https://huggingface.co/facebook/bart-base) on the [Arxiv](https://www.kaggle.com/datasets/Cornell-University/arxiv) dataset. *"BART is sequence-to-sequence model trained with denoising as pretraining objective."* [[REF](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md)]

The `Arxiv` dataset contains about 1.7M research papers with summaries. 

```

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Set up a development environment and install sagemaker

## Installation

_**Note:** The use of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have an SDK installed, connectivity to the cloud and appropriate permissions, such as a Laptop, another IDE or a task scheduler like Airflow or AWS Step Functions._

In [2]:
!pip install "sagemaker>=2.48.0"  --upgrade
#!apt install git-lfs

Collecting sagemaker>=2.48.0
  Downloading sagemaker-2.199.0.tar.gz (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting boto3<2.0,>=1.33.3 (from sagemaker>=2.48.0)
  Downloading boto3-1.33.6-py3-none-any.whl.metadata (6.7 kB)
Collecting uvicorn==0.22.0 (from sagemaker>=2.48.0)
  Downloading uvicorn-0.22.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m947.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting fastapi==0.95.2 (from sagemaker>=2.48.0)
  Downloading fastapi-0.95.2-py3-none-any.whl.metadata (24 kB)
Collecting docker (from sagemaker>=2.48.0)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2 (from fastapi==0.95.2->sagemaker>=2.48.0)
  Downloading pydantic-1.10.13-cp3

In [3]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
!sudo yum install git-lfs -y
!git lfs install

Detected operating system as ubuntu/22.
Checking for curl...
Detected curl...
Downloading repository file: https://packagecloud.io/install/repositories/github/git-lfs/config_file.repo?os=ubuntu&dist=22&source=script
main: line 165: /etc/yum.repos.d/github_git-lfs.repo: No such file or directory

Unable to run: 
    curl https://packagecloud.io/install/repositories/github/git-lfs/config_file.repo?os=ubuntu&dist=22&source=script

Double check your curl installation and try again.
sudo: yum: command not found
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log


## Development environment 

In [4]:
import sagemaker.huggingface

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [5]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::113723224739:role/service-role/AmazonSageMaker-ExecutionRole-20230522T112578
sagemaker bucket: sagemaker-us-east-1-113723224739
sagemaker session region: us-east-1


## Choose 🤗 Transformers `examples/` script

The [🤗 Transformers repository](https://github.com/huggingface/transformers/tree/master/examples) contains several `examples/`scripts for fine-tuning models on tasks from `language-modeling` to `token-classification`. In our case, we are using the `run_summarization.py` from the `seq2seq/` examples. 

_**Note**: you can use this tutorial identical to train your model on a different examples script._

Since the `HuggingFace` Estimator has git support built-in, we can specify a [training script that is stored in a GitHub repository](https://sagemaker.readthedocs.io/en/stable/overview.html#use-scripts-stored-in-a-git-repository) as `entry_point` and `source_dir`.

We are going to use the `transformers 4.4.2` DLC which means we need to configure the `v4.4.2` as the branch to pull the compatible example scripts.

In [6]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'} # v4.6.1 is referring to the `transformers_version` you use in the estimator.

## Configure distributed training and hyperparameters

Next, we will define our `hyperparameters` and configure our distributed training strategy. As hyperparameter, we can define any [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#seq2seqtrainingarguments) and the ones defined in [run_summarization.py](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#sequence-to-sequence-training-and-evaluation). 

In [7]:
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'facebook/bart-large-cnn',
                 'dataset_name': 'kdawoud91/Arxiv_train_test',
                 'do_train': True,
                 'do_eval': False,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': './data',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 
                 
                 }

# configuration for running training on smdistributed Data Parallel
#distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

## Create a `HuggingFace` estimator and start training

In [8]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py', # script
      source_dir='./examples/pytorch/summarization', # relative path to example
      git_config=git_config,
      instance_type='ml.g5.xlarge',
      instance_count=2,
      transformers_version='4.26.0',
      pytorch_version='1.13.1',
      py_version='py39',
      role=role,
      hyperparameters = hyperparameters,
      #distribution = distribution
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [9]:
# starting the train job
huggingface_estimator.fit()

Cloning into '/tmp/tmpmx22715e'...
Note: switching to 'v4.26.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 820c46a70 Hotifx remove tuple for git config image processor. (#21278)


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-11-09-17-42-12-238


2023-11-09 17:42:24 Starting - Starting the training job...
2023-11-09 17:42:39 Starting - Preparing the instances for training.........
2023-11-09 17:44:12 Downloading - Downloading input data
2023-11-09 17:44:12 Stopping - Stopping the training job
2023-11-09 17:44:12 Stopped - Training job stopped
..



Training seconds: 2
Billable seconds: 2


## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [13]:
predictor = huggingface_estimator.deploy(1,"ml.g5.xlarge")

INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2023-10-31-13-48-29-501


ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-1-113723224739/huggingface-pytorch-training-2023-10-31-13-27-47-542/output/model.tar.gz.

Then, we use the returned predictor object to call the endpoint.

In [None]:
conversation = '''The campus bookstore, a seeming anachronism in the digital age,
will soon become history at the University of Massachusetts. Starting next fall, 
students at the flagship Amherst campus will buy almost all textbooks from Amazon.com.
The online retail giant has struck a deal with UMass to replace an on-campus “textbook annex” 
run by Follett Corp. with a smaller Amazon distribution center. UMass officials hope the 
arrangement will save students money. “We really recognize that textbooks and course materials
are a major expense for students, and those have continued to go up over time,” said
Ed Blaguszewski, UMass spokesman.
                                          
    '''

data= {"inputs":conversation}

predictor.predict(data)

Finally, we delete the endpoint again.

In [12]:
predictor.delete_endpoint()

## Upload the fine-tuned model to [huggingface.co](http://huggingface.co)

We can download our model from Amazon S3 and unzip it using the following snippet.



In [None]:
import os
import tarfile
from sagemaker.s3 import S3Downloader

local_path = 'my_bart_model'

os.makedirs(local_path, exist_ok = True)

# download model from S3
S3Downloader.download(
    s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located
    local_path=local_path, # local path where *.targ.gz is saved
    sagemaker_session=sess # sagemaker session used for training the model
)

# unzip model
tar = tarfile.open(f"{local_path}/model.tar.gz", "r:gz")
tar.extractall(path=local_path)
tar.close()
os.remove(f"{local_path}/model.tar.gz")

Before we are going to upload our model to [huggingface.co](http://huggingface.co) we need to create a `model_card`. The `model_card` describes the model includes hyperparameters, results and which dataset was used for training. To create a `model_card` we create a `README.md` in our `local_path`

In [None]:
# read eval and test results 
with open(f"{local_path}/eval_results.json") as f:
    eval_results_raw = json.load(f)
    eval_results={}
    eval_results["eval_rouge1"] = eval_results_raw["eval_rouge1"]
    eval_results["eval_rouge2"] = eval_results_raw["eval_rouge2"]
    eval_results["eval_rougeL"] = eval_results_raw["eval_rougeL"]
    eval_results["eval_rougeLsum"] = eval_results_raw["eval_rougeLsum"]

with open(f"{local_path}/test_results.json") as f:
    test_results_raw = json.load(f)
    test_results={}
    test_results["test_rouge1"] = test_results_raw["test_rouge1"]
    test_results["test_rouge2"] = test_results_raw["test_rouge2"]
    test_results["test_rougeL"] = test_results_raw["test_rougeL"]
    test_results["test_rougeLsum"] = test_results_raw["test_rougeLsum"]

After we extract all the metrics we want to include we are going to create our `README.md`. Additionally to the automated generation of the results table we add the metrics manually to the `metadata` of our model card under `model-index`

In [None]:
print(eval_results)
print(test_results)

After we have our unzipped model and model card located in `my_bart_model` we can use the either `huggingface_hub` SDK to create a repository and upload it to [huggingface.co](http://huggingface.co) or go to https://huggingface.co/new an create a new repository and upload it.

In [None]:
from getpass import getpass
from huggingface_hub import HfApi, Repository

hf_username = "philschmid" # your username on huggingface.co
hf_email = "philipp@huggingface.co" # email used for commit
repository_name = f"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}" # repository name on huggingface.co
password = getpass("Enter your password:") # creates a prompt for entering password

# get hf token
token = HfApi().login(username=hf_username, password=password)

# create repository
repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)

# create a Repository instance
model_repo = Repository(use_auth_token=token,
                        clone_from=repo_url,
                        local_dir=local_path,
                        git_user=hf_username,
                        git_email=hf_email)

# push model to the hub
model_repo.push_to_hub()

print(f"https://huggingface.co/{hf_username}/{repository_name}")