# Fine-tune Llama-3.2-1B-Instruct with Alpaca Dataset

This example demonstrates how to fine-tune Llama-3.2-1B-Instruct model with the Alpaca Dataset using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK.

This notebooks walks you through the prerequisites of using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK, and how to submit TrainJob to bootstrap the fine-tuning workflow.

Llama-3.2-1B-Instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Alpaca Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
!pip install git+https://github.com/jaiakash/sdk.git@backend-public

## Prerequisites

### Install Official Training Runtimes

You need to make sure that you've installed the Kubeflow Trainer Controller Manager and Kubeflow Training Runtimes mentioned in the [installation guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/).

In [None]:
# List all available Kubeflow Training Runtimes.
from kubeflow.trainer import *
from kubernetes import client as k8s_client
import os

client = TrainerClient()
for runtime in client.list_runtimes():
    print(runtime)

### Create PVCs for Models and Datasets

Currently, we do not support automatically orchestrate the volume claim in (Cluster)TrainingRuntime.

So, we need to manually create PVCs for each models we want to fine-tune. Please note that **the PVC name must be equal to the ClusterTrainingRuntime name**. In this example, it's `torchtune-llama3.2-1b`.

REF: https://github.com/kubeflow/trainer/issues/2630

In [None]:
# Create a PersistentVolumeClaim for the TorchTune Llama 3.2 1B model.
client.backend.core_api.create_namespaced_persistent_volume_claim(
  namespace="default",
  body=k8s_client.V1PersistentVolumeClaim(
    api_version="v1",
    kind="PersistentVolumeClaim",
    metadata=k8s_client.V1ObjectMeta(name="torchtune-llama3.2-1b"),
    spec=k8s_client.V1PersistentVolumeClaimSpec(
      access_modes=["ReadWriteOnce"],
      resources=k8s_client.V1ResourceRequirements(
        requests={"storage": "200Gi"}
      ),
    ),
  ),
)

## Bootstrap LLM Fine-tuning Workflow

Kubeflow TrainJob will train the model in the referenced (Cluster)TrainingRuntime.

In [None]:
job_name = client.train(
    runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token=os.environ["HF_TOKEN"] # Replace with your Hugging Face token,
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET, split="train[:1000]"
            ),
            resources_per_node={
                "memory": "200G",
                "gpu": 1,
            },
            
        )
    )
)

## Watch the TrainJob Logs

We can use the `get_job_logs()` API to get the TrainJob logs.

### Dataset Initializer

In [None]:
from kubeflow.trainer.constants import constants

log_dict = client.get_job_logs(job_name, follow=False, step=constants.DATASET_INITIALIZER)
print(log_dict[constants.DATASET_INITIALIZER])

### Model Initializer

In [None]:
log_dict = client.get_job_logs(job_name, follow=False, step=constants.MODEL_INITIALIZER)
print(log_dict[constants.MODEL_INITIALIZER])

### Trainer Node

In [None]:
log_dict = client.get_job_logs(job_name, follow=False)
print(log_dict[f"{constants.NODE}-0"])

# Get the Fine-tuned Model

After Trainer node completes the fine-tuning task, the fine-tuned model will be stored into the `/workspace/output` directory, which can be shared across Pods through PVC mounting. You can find it in another Pod's `/<mountDir>/output` directory if you mount the PVC under `/<mountDir>`.