# How to use DeepSpeed with Habana Gaudi for large-scale efficitent training

In this blog, you will learn how to fine-tune [T5-3B](https://huggingface.co/t5-3b) for abstractive summarization with [DeepSpeed](https://www.deepspeed.ai/) using a Habana Gaudi-based [DL1 instance](https://aws.amazon.com/ec2/instance-types/dl1/) on AWS to take advantage of the cost performance benefits of Gaudi. We will use the Hugging Faces Transformers, Optimum Habana and Datasets library as well as the Habana fork of [DeepSpeed](https://github.com/HabanaAI/DeepSpeed). We are going to fine-tune fine-tune [T5-3B](https://huggingface.co/t5-3b) using the [Trade the Event](https://paperswithcode.com/paper/trade-the-event-corporate-events-detection) dataset for abstractive text summarization. The benchmark dataset contains 303893 news articles range from 2020/03/01 to 2021/05/06. The articles are downloaded from the [PRNewswire](https://www.prnewswire.com/) and [Businesswire](https://www.businesswire.com/).


You will learn how to:

1. [Prepare Dataset & Environment](#1-prepare-dataset--environment)
2. [Configure DeepSpeed](#2-configure-deepspeed)
3. [Run T3-3B on Habana Gaudi](#3-run-t3-3b-on-habana-gaudi)
4. [Cost performance benefits of Habana Gaudi on AWS](#4-cost-performance-benefits-of-habana-gaudi-on-aws)

![remote-runner](../assets/remote-runner.png)

**Requirements**

Before we can start, make sure you have met the following requirements

* AWS Account with quota for [DL1 instance type](https://aws.amazon.com/ec2/instance-types/dl1/)
* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) installed
* AWS IAM user [configured in CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with permission to create and manage ec2 instances

**Helpful Resources**

* [Setup Deep Learning environment for Hugging Face Transformers with Habana Gaudi on AWS](https://www.philschmid.de/getting-started-habana-gaudi)
* [Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi](https://www.philschmid.de/habana-gaudi-ec2-runner)
* [Optimum Habana Documentation](https://huggingface.co/docs/optimum/main/en/habana_index)
* [Pre-Training BERT with Hugging Face Transformers and Habana Gaudi](https://www.philschmid.de/pre-training-bert-habana)
* [Habana DeepSpeed User Guide](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide.html)


## What is DeepSpeed? 

[DeepSpeed](https://www.deepspeed.ai/training/) is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training. [DeepSpeed](https://www.deepspeed.ai/) enables you to fit and train larger models on HPUs thanks to various optimizations. In particular, you can use the two following ZeRO configurations that have been validated to be fully functioning with Gaudi:

* ZeRO-1, which partitions the optimizer states across processes.
* ZeRO-2, which additionnally partitions the gradients across processes so that each process retains only the gradients corresponding to its portion of the optimizer states.

These configurations are fully compatible with Habana Mixed Precision and can thus be used to train your model in bf16 precision.

You can find more information about DeepSpeed Gaudi integration [here](https://huggingface.co/docs/optimum/habana_deepspeed).



## 1. Prepare Dataset & Environment

In this example are we going to use Habana Gaudi on AWS using the DL1 instance for running the training. We will use the [Remote Runner](https://github.com/philschmid/deep-learning-remote-runner) toolkit to easily launch our training on a remote DL1 Instance from our local setup. You can check-out [Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi](https://www.philschmid.de/habana-gaudi-ec2-runner) if you want to know more about how this works. 

In [None]:
!pip install rm-runner

To make the use of DeepSpeed on Habana Gaudi as easy as possible have we created a docker image that contains all the necessary dependencies. You can find the docker image on the docker up at [huggingface/optimum-habana:4.23.1-pt1.12.0-synapse1.6.0-deepspeed](https://hub.docker.com/r/huggingface/optimum-habana/tags). The docker image is based on the [Optimum Habana DeepSpeed User Guide](https://huggingface.co/docs/optimum/habana_deepspeed) and contains the following dependencies:

* `transformers==4.23.1`
* `datasets==2.11.0`
* `optimum==1.4.0`
* `optimum-habana==1.2.3`
* `HabanaAI/DeepSpeed==1.6.1`
* `synapseAI==1.6.0`
* `torch==1.12.0`
* `tensorboard`



In [1]:
# container image used for running the training
image_id="huggingface/optimum-habana:4.23.1-pt1.12.0-synapse1.6.0-deepspeed"

This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join). 
If you already have an account you can skip this step. 
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk. 

In [None]:
from huggingface_hub import notebook_login

notebook_login()


## 2. Configure DeepSpeed

The [GaudiTrainer](https://huggingface.co/docs/optimum/main/en/habana_trainer) allows us to use DeepSpeed as easily as the [Transformers Trainer](https://huggingface.co/docs/transformers/main_classes/trainer). It will take care of adding all of the DeepSpeed specific methods and checks. To add DeepSpeed to our training we have to: 

1. Create a DeepSpeed configuration
2. Specify the DeepSpeed configuration in the `GaudiTrainer`
3. Use `deepspeed` to launch our training
   
These steps are detailed below. A comprehensive guide about how to use DeepSpeed with the Transformers Trainer is also available [here](https://huggingface.co/docs/transformers/main_classes/deepspeed).


### 1. Create DeepSpeed configuration

The DeepSpeed configuration is passed through as JSON file and enables you to choose the optimizations to apply. Below you will find the confiugration we are going to use for fine-tuning `T5-3b` using ZeRO-2 optimizations and `bf16` precision. 

The [Transformers documentation](https://huggingface.co/docs/transformers/main_classes/deepspeed#configuration) explains how to write a configuration from scratch very well. A more complete description of all configuration possibilities is can be found in the [DeepSpeed documentation](https://www.deepspeed.ai/docs/config-json/).

We create a `ds_config.json` with the below configuration in the `scripts/` directory to be later used for training.

```json
{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    }
}
```



> The special value `"auto"` enables to automatically get the correct or most efficient value. You can also specify the values yourself but, if you do so, you should be careful to not have conficting values with your training arguments. It is strongly advised to read [this section](https://huggingface.co/docs/transformers/main_classes/deepspeed#shared-configuration) in the Transformers documentation to completely understand how this works.




## 3. Run T3-3B on Habana Gaudi

When using GPUs you would use the [Trainer](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.Trainer) and [TrainingArguments](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.TrainingArguments). Since we are going to run our training on Habana Gaudi we are leveraging the `optimum-habana` library, we can use the [GaudiTrainer](https://huggingface.co/docs/optimum/main/en/habana_trainer) and GaudiTrainingArguments instead. The `GaudiTrainer` is a wrapper around the [Trainer](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.Trainer) that allows you to pre-traing or fine-tune a transformer model on a Habana Gaudi instances.

```diff
-from transformers import Trainer, TrainingArguments 
+from optimum.habana import GaudiTrainer, GaudiTrainingArguments

# define the training arguments
-training_args = TrainingArguments(
+training_args = GaudiTrainingArguments(
+  use_habana=True,
+  use_lazy_mode=True,
+  gaudi_config_name=path_to_gaudi_config,
  deepspeed=path_to_my_deepspeed_config,
  ...
)

# Initialize our Trainer
-trainer = Trainer(
+trainer = GaudiTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
    ... # other arguments
)
```

The `DL1` instance we use has 8 available HPU-cores meaning we can leverage distributed data-parallel training for our model. 
To run our training with `deepspeed` we need to create a training ([scripts/run_summarization.py](https://github.com/philschmid/deep-learning-habana-huggingface/blob/master/pre-training/scripts/run_mlm.py)) implementing our fine-tuning modeling using the `GaudiSeq2SeqTrainer`. To executed our distributed training we use the `DistributedRunner` runner from `optimum-habana` and pass our arguments. Alternatively you could check-out the [gaudi_spawn.py](https://github.com/huggingface/optimum-habana/blob/main/examples/gaudi_spawn.py) in the [optimum-habana](https://github.com/huggingface/optimum-habana) repository.


Before we can start our training we need to define the `hyperparameters` we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `GaudiSeq2SeqTrainer` to automatically push our checkpoints, logs and metrics during training into repository. 

In [2]:
from huggingface_hub import HfFolder

# hyperparameters
hyperparameters = {
    "model_id": "t5-3b",
    "dataset_id": "nickmuchi/trade-the-event-finance",
    "gaudi_config_id": "philschmid/bert-base-uncased-2022-habana",
    "repository_id": "Habana/t5",
    "hf_hub_token": HfFolder.get_token(),  # need to be login in with `huggingface-cli login`
    "num_epochs": 3,
    "per_device_train_batch_size": 4,
    "learning_rate": 5e-5,
}
hyperparameters_string = " ".join(f"--{key} {value}" for key, value in hyperparameters.items())


We can start our training with by creating a `EC2RemoteRunner` and then `launch` it. This will then start our AWS EC2 DL1 instance and runs our `run_mlm.py` script on it using the `huggingface/optimum-habana:latest` container.b

In [3]:
from rm_runner import EC2RemoteRunner
# create ec2 remote runner
runner = EC2RemoteRunner(
  instance_type="dl1.24xlarge",
  profile="hf-sm",  # adjust to your profile
  region="us-east-1",
  container=image_id # defined in the first step
  )

# launch my script with gaudi_spawn for the deepspeed training
runner.launch(
    command=f"python3 gaudi_spawn.py --use_deepspeed --world_size=8 run_summarization.py {hyperparameters_string}  --deepspeed ds_config.json",
    source_dir="scripts",
)


2022-10-18 11:10:02,947 | INFO | Found credentials in shared credentials file: ~/.aws/credentials
2022-10-18 11:10:04,320 | INFO | Created key pair: rm-runner-fggm
2022-10-18 11:10:05,782 | INFO | Created security group: rm-runner-fggm
2022-10-18 11:10:07,813 | INFO | Launched instance: i-0c170f7335e0ae0ef
2022-10-18 11:10:07,817 | INFO | Waiting for instance to be ready...
2022-10-18 11:10:23,935 | INFO | Instance is ready. Public DNS: ec2-54-89-187-241.compute-1.amazonaws.com
2022-10-18 11:10:23,962 | INFO | Setting up ssh connection...
2022-10-18 11:11:43,995 | INFO | Setting up ssh connection...
