# T5 LLM Fine-Tuning with DeepSpeed and Kubeflow Trainer


This Notebook will fine-tune Text-to-Text Transfer Transformer (T5) with Wikihow dataset for text summarization using Kubeflow TrainJob and DeepSpeed.

Pretrained T5 model: https://huggingface.co/google-t5/t5-base

Wikihow dataset: https://huggingface.co/datasets/sentence-transformers/wikihow

This Notebook will use **4 x A100 NVIDIA GPUs**, to fine-tune T5 model on 2 nodes (every node has 2 GPUs).

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
# !pip install git+https://github.com/kubeflow/sdk.git@main

## Create Script to Fine-Tune T5 with DeepSpeed

We need to wrap our fine-tuning script into a function to create Kubeflow TrainJob.

In [None]:
def deepspeed_train_t5(args):
    import os
    import time
    import boto3
    import torch
    import torch.distributed as dist
    from torch.utils.data.distributed import DistributedSampler
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    from datasets import load_dataset
    import deepspeed
    import numpy as np

    # Initialize distributed environment.
    deepspeed.init_distributed(dist_backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])

    # Define the Wikihow dataset class
    class wikihow(torch.utils.data.Dataset):
        def __init__(self, tokenizer, num_samples):
            self.dataset = load_dataset(
                "sentence-transformers/wikihow", split=f"train[:{num_samples}]"
            )
            self.tokenizer = tokenizer

        def __len__(self):
            return len(self.dataset)

        def clean_text(self, text):
            if text is None:
                return ""

            return text.replace("\n", " ").replace("``", "").replace('"', "").strip()

        def convert_to_features(self, example_batch):
            input_ = self.clean_text(example_batch["text"])
            target_ = self.clean_text(example_batch["summary"])

            source = self.tokenizer(
                input_,
                max_length=512,
                padding="max_length",
                truncation=True,
                return_tensors="pt",
            )
            targets = self.tokenizer(
                target_,
                max_length=150,
                padding="max_length",
                truncation=True,
                return_tensors="pt",
            )

            return source, targets

        def __getitem__(self, index):
            source, targets = self.convert_to_features(self.dataset[index])
            return {
                "source_ids": source["input_ids"].squeeze(),
                "source_mask": source["attention_mask"].squeeze(),
                "target_ids": targets["input_ids"].squeeze(),
                "target_mask": targets["attention_mask"].squeeze(),
            }

    # Download model and tokenizer.
    if dist.get_rank() == 0:
        print("-" * 100)
        print("Downloading T5 Model")
        print("-" * 100)

    model = T5ForConditionalGeneration.from_pretrained(args["MODEL_NAME"])
    tokenizer = T5Tokenizer.from_pretrained(args["MODEL_NAME"])

    # Download dataset.
    dataset = wikihow(tokenizer, num_samples=int(args["NUM_SAMPLES"]))
    train_loader = torch.utils.data.DataLoader(
        dataset, batch_size=4, sampler=DistributedSampler(dataset)
    )

    # Define DeepSpeed configuration.
    # Train batch size = micro batch size * gradient steps * GPUs (e.g. 2 x 1 x 8 = 16).
    ds_config = {
        "train_micro_batch_size_per_gpu": 2,
        "gradient_accumulation_steps": 1,
        # "fp16": {"enabled": True}, # If your GPU (e.g. V100) doesn't support bf16, use fp16.
        "bf16": {"enabled": True},  # Enable mixed precision.
        "optimizer": {
            "type": "AdamW",
            "params": {"lr": 0.002},
        },
        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": 0,
                "warmup_max_lr": 0.001,
                "warmup_num_steps": 1000,
            },
        },
    }

    # Initialize model with DeepSpeed.
    model, _, _, _ = deepspeed.initialize(
        config=ds_config,
        model=model,
        model_parameters=model.parameters(),
    )

    # Start training process.
    if dist.get_rank() == 0:
        print("-" * 100)
        print("Starting DeepSpeed distributed training...")
        print("-" * 100)

    t0 = time.time()
    for epoch in range(1, 3):
        losses = []
        for batch_idx, batch in enumerate(train_loader):
            for key in batch.keys():
                batch[key] = batch[key].to(local_rank)
            # Forward pass.
            output = model(
                input_ids=batch["source_ids"],
                attention_mask=batch["source_mask"],
                labels=batch["target_ids"],
            )
            loss = output.loss

            # Run backpropagation.
            model.backward(loss)
            # Weight updates.
            model.step()
            losses.append(loss.item())
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                        epoch,
                        batch_idx * len(batch),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

        if dist.get_rank() == 0:
            print("-" * 100)
            print("Average Train Loss: {0:.4f}".format(np.mean(losses)))
            print("-" * 100)

    if dist.get_rank() == 0:
        print("-" * 100)
        print(f"DeepSpeed training time: {int(time.time() - t0)} seconds")
        print("-" * 100)

        print("Exporting HuggingFace model to S3")
        MODEL_PATH = os.path.join("/home/mpiuser", args["MODEL_NAME"])
        model.module.save_pretrained(MODEL_PATH)
        tokenizer.save_pretrained(MODEL_PATH)

        bucket = boto3.resource("s3").Bucket(args["BUCKET"])
        for file in os.listdir(MODEL_PATH):
            print(f"Uploading file {os.path.join(MODEL_PATH, file)}")
            bucket.upload_file(
                os.path.join(MODEL_PATH, file), os.path.join(args["MODEL_NAME"], file)
            )

## List Available Kubeflow Trainer Runtimes


Get available Kubeflow Trainer Runtimes with the `list_runtimes()` API.

You can inspect Runtime details, including the name, framework, and available devices on the single node.

- Runtimes with **CustomTrainer**: You must write the training script within the function.

- Runtimes with **BuiltinTrainer**: You can configure settings (e.g., LoRA Config) for LLM fine-tuning Job.


In [None]:
from kubeflow.trainer import TrainerClient, CustomTrainer

for r in TrainerClient().list_runtimes():
    if r.name == "deepspeed-distributed":
        print(f"Name: {r.name}, Framework: {r.trainer.framework}, Trainer Type: {r.trainer.trainer_type.value}\n")
        print(f"Runtime devices: {r.trainer.device} x {r.trainer.device_count}")
        deepspeed_runtime = r

## Create TrainJob for Distributed Training

Use the `train()` API to scale the training code across 2 Nodes and 8 GPUs.

Don't forget to update **the S3 bucket** name.

In [None]:
MODEL_NAME = "t5-base"
# BUCKET_NAME = "TODO: add your bucket here"

In [None]:
args = {
    "NUM_SAMPLES": "2000",
    "MODEL_NAME": MODEL_NAME,
    "BUCKET": BUCKET_NAME,
}

job_id = TrainerClient().train(
    trainer=CustomTrainer(
        func=deepspeed_train_t5,
        func_args=args,
        packages_to_install=["boto3"], # Custom packages to install at runtime.
        num_nodes=1,
    ),
    runtime=deepspeed_runtime,
)

In [None]:
# Train API generates a random TrainJob id.
job_id

## Check the TrainJob Info

Use the `list_jobs()` and `get_job()` APIs to get information about created TrainJob and its steps.

In [None]:
for job in TrainerClient().list_jobs():
    print(f"TrainJob: {job.name}, Status: {job.status}, Created at: {job.creation_timestamp}")

In [None]:
# We execute mpirun command on node-0, which functions as the MPI Launcher node.
for c in TrainerClient().get_job(name=job_id).steps:
    print(f"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\n")

## Get the TrainJob Logs

Use the `get_job_logs()` API to retrieve the TrainJob logs.

Since we distribute the dataset accross 4 GPUs (2 nodes x 2 GPUs), each rank processes `round(2000 / 4) = 500` samples.

In [None]:
_ = TrainerClient().get_job_logs(name=job_id, follow=True)

## Download the Trained Model

Finally, download fine-tuned model from S3 for evaluations.

In [None]:
LOCAL_DIR = "./t5-base"

In [None]:
import boto3
import os

os.makedirs(LOCAL_DIR, exist_ok=True)
s3 = boto3.client("s3")
for obj in s3.list_objects_v2(Bucket=BUCKET_NAME, Prefix="t5-base")["Contents"]:
    file = obj["Key"]

    print(f"Downloading file: {file}")
    s3.download_file(BUCKET_NAME, file, os.path.join(LOCAL_DIR, os.path.basename(file)))

## Evaluate Fine-Tuned T5 Model

After model is downloaded, you can load it into the HuggingFace pipeline.

The T5 model performs well for NLP tasks such as summarization, translation, and text classification.

In the example below, we'll demonstrate how to use a fine-tuned version of the T5 model to summarize documentation related to the Kubeflow Trainer project.

In [None]:
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

# Load the fine-tuned T5 model.
model = AutoModelForSeq2SeqLM.from_pretrained(LOCAL_DIR)
tokenizer = AutoTokenizer.from_pretrained(LOCAL_DIR)

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="pt")

text = """
summarize: In Kubeflow Trainer you can integrate other ML libraries such as HuggingFace,
DeepSpeed, or Megatron-LM with Kubeflow Trainer to orchestrate their ML training on Kubernetes.
Kubeflow Trainer allows you to effortlessly develop your LLMs with the Kubeflow Python SDK
and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.
Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs)
fine-tuning and enabling scalable, distributed training of machine learning (ML)
models across various frameworks, including PyTorch, JAX, TensorFlow, and XGBoost.
"""

summarizer(text, min_length=5, max_length=100)