# Fine-tune Llama3 and open LLMs on AWS Trainium 


Open LLMs like Meta [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-70b), Mistral AI [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) & [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) models or AI21 [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) are now OpenAI competitors. However, most of the time you need to fine-tune the model on your data to unlock the full potential of the model. Fine-tuning smaller LLMs, like Mistral became very accessible but still require a lot of computational resources. Thats were AWS Trainium comes into play.


This blog post walks you thorugh how to fine-tune a Llama 3.1 8B using [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index) on AWS Trainium. We will use the NeuronSFTTrainer to fine-tune the model on a custom dataset. The `NeuronSFTTrainer` is a high-level Trainer similar to the SFTTrainer from trl to easily supervise fine-tune LLMs on AWS Trainium with ease. Afterwards we will test the fine-tuned model with Hugging Face TGI. This post should provide you with a good starting point for fine-tuning and testing LLMs on AWS Accelerators, you will learn how to:

1. [Setup AWS environment](#1-setup-aws-environment)
2. [Create and prepare the dataset for fine-tuning](#2-create-and-prepare-the-dataset-for-fine-tuning)
3. [Fine-tune Llama 3.1 using LoRA on AWS Trainium using the NeuronSFTTrainer](#3-fine-tune-llama-31-using-lora-on-aws-trainium-using-the-neurontrainer)
4. [Run Inference with Hugging Face TGI](#4-run-inference-with-hugging-face-tgi)

## Quick intro: AWS Trainium

[AWS Trainium (Trn1)](https://aws.amazon.com/de/ec2/instance-types/trn1/) is a purpose-built EC2 for deep learning (DL) training workloads. Trainium is the successor of [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls) focused on high-performance training workloads. Trainium has been optimized for training natural language processing, computer vision, and recommender models used. The accelerator supports a wide range of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8. 

The biggest Trainium instance, the `trn1.32xlarge` comes with over 500GB of memory, making it easy to fine-tune ~10B parameter models on a single instance. Below you will find an overview of the available instance types. More details [here](https://aws.amazon.com/de/ec2/instance-types/trn1/#Product_details):

| instance size | accelerators | accelerator memory | vCPU | CPU Memory | price per hour |
| --- | --- | --- | --- | --- | --- |
| trn1.2xlarge | 1 | 32 | 8 | 32 | \$1.34 |
| trn1.32xlarge | 16 | 512 | 128 | 512 | \$21.50 |
| trn1n.32xlarge (2x bandwidth) | 16 | 512 | 128 | 512 | \$24.78 |

---

*Note: This tutorial was created on a trn1.32xlarge AWS EC2 Instance.* 


## 1. Setup AWS environment

In this example, we will use the `trn1.32xlarge` instance on AWS with 16 Accelerator, including 32 Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2). The Hugging Face AMI comes with all important libraries, like Transformers, Datasets, Optimum and Neuron packages pre-installed this makes it super easy to get started, since there is no need for environment management.

This blog post doesn’t cover how to create the instance in detail. You can check out my previous blog about [“Setting up AWS Trainium for Hugging Face Transformers”](https://www.philschmid.de/setup-aws-trainium), which includes a step-by-step guide on setting up the environment. 

Once the instance is up and running, we can ssh into it. But instead of developing inside a terminal we want to use a `Jupyter` environment, which we can use for preparing our dataset and launching the training. For this, we need to add a port for forwarding in the `ssh` command, which will tunnel our localhost traffic to the Trainium instance.

```bash
PUBLIC_DNS="" # IP address, e.g. ec2-3-80-....
KEY_PATH="" # local path to key, e.g. ssh/trn.pem

ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$PUBLIC_DNS
```

Next we need to clone the repository and change into the directory. Then we can launch the jupyter environment.

```bash
# clone repository
git clone https://github.com/philschmid/llama3-aws-trainium-sample.git
# change directory
cd llama3-aws-trainium-sample/notebooks
# launch jupyter
python -m notebook --allow-root --port=8080
```

You should see a familiar **`jupyter`** output with a URL to the notebook.

**`http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9`**

We can click on it, and a **`jupyter`** environment opens in our local browser. Open the notebook **`llama3-8b-fine-tuning.ipynb`** and lets get started.

_Note: We are going to use the Jupyter environment only for preparing the dataset and then `torchrun` for launching our training script for distributed training._

Before we can start fine-tuning, we need to login into our Hugging Face account, which has access to the model, to use your token for accessing the gated repository. We can do this by running the following command:

_Note: We also provide an ungated checkpoint for Llama 3.1 8B._

In [None]:
!huggingface-cli login --token TOKEN

We also want to make sure we have the latest `optimum-neuron` package and Hugging Face `trl` package installed.

In [None]:
# versions used to test this notebook
%pip install --upgrade "optimum-neuron==0.0.26" "trl==0.11.4" "peft==0.13.2"

## 2. Create and prepare the dataset for fine-tuning

After our environment is set up, we can start creating and preparing our dataset. A fine-tuning dataset should have a diverse set of demonstrations of the task you want to solve. If you want to learn more about how to create a dataset, take a look at the [How to Fine-Tune LLMs in 2024 with Hugging Face](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset).

We will use the [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset a high-quality dataset of 10,000 instructions and demonstrations created by skilled human annotators. This data can be used for supervised fine-tuning (SFT) to make language models follow instructions better. No Robots was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is comprised mostly of single-turn instructions.

```json
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

The [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset has 10,000 split into 9,500 training and 500 test examples. Some samples are not including a `system` message. We will load the dataset with the `datasets` library, add a missing `system` message and save them to separate json files.

In [None]:
from datasets import load_dataset

# Convert dataset to OAI messages
system_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""

def create_conversation(sample):
    if sample["messages"][0]["role"] == "system":
        return sample
    else:
      sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]
      return sample

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/no_robots")

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)
columns_to_remove.remove("messages")
dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)

# Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system message
dataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)
dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)

# save datasets to disk 
dataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)
dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)

## 3. Fine-tune Llama 3.1 using LoRA on AWS Trainium using the NeuronSFTTrainer

We are now ready to fine-tune our LLM with [NeuronSFTTrainer](https://huggingface.co/docs/optimum-neuron/package_reference/trainer) a 1-to-1 replacement for the Hugging Face `SFTTrainer` but for AWS Trainium instances.

Every AWS Trainium instances comes with > 1 accelerators. This means we will always use distributed training. The `NeuronSFTTrainer` comes with different distributed training strategies including: 
* [ZeRO-1](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html): shards the optimizer state over multiple devices.
* [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html): shards the model parameters along a given dimension on multiple devices, defined with `tensor_parallel_size`
* [Sequence parallelism](https://arxiv.org/pdf/2205.05198.pdf) shards the activations on the sequence axis outside of the tensor parallel regions. It is useful because it saves memory by sharding the activations.
* [Pipeline Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html): _coming soon_

It also supports `SFTTrainer` features like:
* Dataset formatting, including conversational and instruction format
* Training on completions only, ignoring prompts
* Packing datasets for more efficient training
* PEFT (parameter-efficient fine-tuning) support including LoRA
* Preparing the model and tokenizer for conversational fine-tuning (e.g. adding special tokens)

We prepared a script [run_sft.py](../scripts/run_sft.py) which will load the dataset from disk (json), prepare the model, tokenizer, create a LoRA configuration and making it straightfoward to supervise fine-tune open LLMs from YAML files. 

When training models on AWS Accelerators like AWS Trainium, we must first compile our model to enable execution on the specialized hardware. During compilation, the model's computational graph is optimized and translated into instructions specifically tailored for Trainium's NeuronCores, ensuring efficient utilization of the accelerator's capabilities.

Model compilation is done using the `neuron_parallel_compile` with your training script, model and hyperparameters you plan to use during training, except that it only needs to be run on a few steps, e.g. `10`. The `run_sft.py` scripts support loading config files, which makes it easy to overwrite the default parameters.

First lets create our config file `config.yaml`:

In [None]:
%%writefile llama_3_8b.yaml
# script parameters
model_id: "meta-llama/Llama-3.1-8b"    # Hugging Face model id
dataset_path: "train_dataset.json"     # path to dataset
max_seq_length: 1024                   # max sequence length for model and packing of the dataset
packing: true                          # group multiple samples into one sample to accelerate training
# training parameters
output_dir: "./llama3_trn"             # output directory for model checkpoints
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 2.0e-4                  # learning rate for lora 10x higher
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 3                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 4         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
bf16: true                             # use bfloat16 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
# distributed parameters
tensor_parallel_size: 8                # number of tensor parallel groups
zero_1: false                          # use zero stage 1, not needed for Llama 3.1 8B

After we created our config file, we can pre-compile our model using the `neuron_parallel_compile` command. We will use the `run_sft.py` script to compile the model. 

In [None]:
%%bash
# Set environment variables for memory and precision
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"
export XLA_USE_BF16=1 
export XLA_DOWNCAST_BF16=1

# Run the parallel compilation
neuron_parallel_compile torchrun \
    --nproc_per_node=32 \
    ../scripts/run_sft.py \
    --config llama_3_8b.yaml \
    --max_steps 10
    
# remove dummy artifacts which are created by the precompilation command
rm -rf "llama3_trn"

_Note: Compiling without a cache can take ~20-40 minutes. It will also create dummy files in the `llama3_trn` during compilation you we have to remove them afterwards. We also need to add `MALLOC_ARENA_MAX=64` to limit the CPU allocation to avoid potential crashes, don't remove it for now._ 

After the compilation is done we can start our training with a similar command, we just need to remove the `neuron_parallel_compile` and `max_steps` and can launch the training with the following command.

In [None]:
%%bash
# Set environment variables for memory and precision
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"
export XLA_USE_BF16=1 
export XLA_DOWNCAST_BF16=1

# Run the parallel compilation
torchrun \
    --nproc_per_node=32 \
    ../scripts/run_sft.py \
    --config llama_3_8b.yaml

The training for 3 epochs on `no_robots` (9.5k samples) took 13 minutes. This leads to a cost of ~$4.6 for the e2e training on the trn1.32xlarge instance. Not Bad! 

But before we can share and test our model we need to consolidate our model. Since we used Tensor Parallelism during training, we need to consolidate the model weights before we can use it. Tensor Parallelism shards the model weights accross different workers, only sharded checkpoints will be saved during training.

The Optimum CLI provides a way of doing that very easily via the `optimum neuron consolidate` command. Since we used LoRA we also need to merge the LoRA weights with the base model afterwards. We added a helper scripts [merge_adapter_weights.py](../scripts/merge_adapter_weights.py) to do this.

In [None]:
# consolidate tp adapter shards
# !optimum-cli neuron consolidate llama3_trn/adapter_shards/ llama3_trn
!optimum-cli neuron consolidate llama3_trn/shards/ llama3_trn
# merge adapter weights
!python ../scripts/merge_adapter_weights.py --peft_model_id llama3_trn --output_dir merged_llama3_8b
# clear old repository
# !rm -rf llama3_trn


## 4. Run Inference with Hugging Face TGI

Similar to training to be able to run inferece on AWS Trainium or AWS Inferentia2 we need to compile our model for the correct use. We will use our Trainium instance for the inference test, but we recommend customer to switch to Inferentia2 for inference. 

Optimum Neuron implements similar to Transformers AutoModel classes for easy inference use. We will use  the `NeuronModelForCausalLM` class to load our vanilla transformers checkpoint and convert it to neuron. 

In [None]:
%%bash
docker run --entrypoint optimum-cli \
  -v $(pwd)/llama3_trn:/model \
  -v $(pwd)/compiled_llama3_8b:/compiled_llama3_8b \
  --privileged \
  ghcr.io/huggingface/neuronx-tgi:latest \
  export neuron --model /model --batch_size 4 --sequence_length 4096 --auto_cast_type bf16 --num_cores 2 --task text-generation /compiled_llama3_8b

In [None]:
# from optimum.neuron import NeuronModelForCausalLM
# from transformers import AutoTokenizer

# compiler_args = {"num_cores": 2, "auto_cast_type": 'bf16'}
# input_shapes = {"batch_size": 4, "sequence_length": 4096}

# tokenizer = AutoTokenizer.from_pretrained("llama3_trn")
# model = NeuronModelForCausalLM.from_pretrained(
#         "llama3_trn",
#         export=True,
#         **compiler_args,
#         **input_shapes)

# # save the compiled model so we can load it with Hugging Face TGI 
# model.save_pretrained("compiled_llama3_8b")
# tokenizer.save_pretrained("compiled_llama3_8b")

_Note: Inference compilation can take ~25minutes. Luckily, you need to only run this onces. Since you can save the model afterwards. If you are going to run on Inferentia2 you need to recompile again. The compilation is parameter and hardware specific._

After we compiled the model we deploy it to production. For deploying open LLMs into production we recommend using Text Generation Inference (TGI). TGI is a purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and continous batching for the most popular open LLMs, including Llama, Mistral, Mixtral, StarCoder, T5 and more. Text Generation Inference is used by companies as IBM, Grammarly, Uber, Deutsche Telekom, and many more. There are existing examples on how to deploy open LLMs on AWS Inferentia2 and AWS Trainium:
* [Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum](https://www.philschmid.de/inferentia2-mixtral-8x7b)
* [Deploy Llama 3 70B on AWS Inferentia2 with Hugging Face Optimum](https://www.philschmid.de/inferentia2-llama3-70b)

To test we we will use the neuron specific TGI docker container. 

_Note: We have to make sure to launch the container with the same parameters as we used during compilation._


In [None]:
%%bash
docker run -p 8080:80 -d --name tgi \
       -v $(pwd)/compiled_llama3_8b:/llama \
       --privileged \
       -e HF_TOKEN=${HF_TOKEN} \
       -e HF_AUTO_CAST_TYPE="bf16" \
       -e HF_NUM_CORES=2 \
       ghcr.io/huggingface/neuronx-tgi:latest \
       --model-id llama \
       --max-batch-size 4 \
       --max-input-tokens 4000 \
       --max-total-tokens 4096

Once your container is running you can send requests using the `openai` or `huggingface_hub` python libraries. 

In [None]:
from huggingface_hub import InferenceClient

client = InferenceClient(api_base="http://localhost:8080/v1",api_key="notNeeded")

messages = [
	{
		"role": "user",
		"content": "What is the capital of France?"
	}
]

stream = client.chat.completions.create(
  model="compiled_llama3_8b", 
	messages=messages, 
	max_tokens=500,
	stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

> AWS stands for Amazon Web Services. AWS is a suite of remote computing services offered by Amazon. The most widely used of these include Amazon Elastic Compute Cloud (Amazon EC2), which provides resizable compute capacity in the cloud; Amazon Simple Storage Service (Amazon S3), which is an object storage service; and Amazon Elastic Block Store (Amazon EBS), which is designed to provide high performance, durable block storage volumes for use with AWS instances. AWS also provides other services, such as AWS Identity and Access Management (IAM), a service that enables organizations to control access to their AWS resources, and AWS Key Management Service (AWS KMS), which helps customers create and control the use of encryption keys.</s>

Awesome, Don't forget to stop your container once you are done.

In [None]:
!docker stop tgi && docker rm tgi