# Fine-tune Llama-3.2-1B-Instruct with Alpaca Dataset

This example demonstrates how to fine-tune Llama-3.2-1B-Instruct model with the Alpaca Dataset using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK.

This notebooks walks you through the prerequisites of using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK, and how to submit TrainJob to bootstrap the fine-tuning workflow.

Llama-3.2-1B-Instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Alpaca Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
# !pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

## Prerequisites

### Install Official Training Runtimes

You need to make sure that you've installed the Kubeflow Trainer Controller Manager and Kubeflow Training Runtimes mentioned in the [installation guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/).

In [20]:
# List all available Kubeflow Training Runtimes.
from kubeflow.trainer import TrainerClient

client = TrainerClient(namespace="kubeflow")
for runtime in client.list_runtimes():
    print(runtime)

Runtime(name='deepspeed-distributed', trainer=Trainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework=<Framework.DEEPSPEED: 'deepspeed'>, entrypoint=['mpirun', '--hostfile', '/etc/mpi/hostfile', 'bash', '-c'], accelerator='Unknown', accelerator_count=4), pretrained_model=None)
Runtime(name='mlx-distributed', trainer=Trainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework=<Framework.MLX: 'mlx'>, entrypoint=['mpirun', '--hostfile', '/etc/mpi/hostfile', 'bash', '-c'], accelerator='Unknown', accelerator_count=1), pretrained_model=None)
Runtime(name='mpi-distributed', trainer=Trainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework=<Framework.TORCH: 'torch'>, entrypoint=['torchrun'], accelerator='Unknown', accelerator_count=1), pretrained_model=None)
Runtime(name='torch-distributed', trainer=Trainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework=<Framework.TORCH: 'torch'>, entrypoint=['torchrun'], accele

### Create PVCs for Models and Datasets

Currently, we do not support automatically orchestrate the volume claim in (Cluster)TrainingRuntime.

So, we need to manually create PVCs for each models we want to fine-tune. In this example, we'll create a PVC `torchtune-llama3.2-1b`, of which name is the same with the ClusterTrainingRuntime we want to use.

REF: https://github.com/kubeflow/trainer/issues/2630

In [3]:
%%bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: torchtune-llama3.2-1b
  namespace: kubeflow
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
EOF

persistentvolumeclaim/torchtune-llama3.2-1b created


## Bootstrap LLM Fine-tuning Workflow

Kubeflow TrainJob will train the model in the referenced (Cluster)TrainingRuntime.

In [None]:
%%bash
export HF_TOKEN="<YOUR_HF_TOKEN_HERE>"

kubectl apply -f - <<EOF
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: torchtune-llama3-2-1b
  namespace: kubeflow
spec:
  runtimeRef:
    name: torchtune-llama3.2-1b
  trainer:
    resourcesPerNode:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
    numProcPerNode: 1
  initializer:
    model:
      env:
        - name: ACCESS_TOKEN
          value: "${HF_TOKEN}"
EOF

trainjob.trainer.kubeflow.org/torchtune-llama3-2-1b created


## Watch the TrainJob logs

We can use the `get_job_logs()` API to get the TrainJob logs.

### Dataset Initializer

In [38]:
from kubeflow.trainer.constants import constants

log_dict = client.get_job_logs("torchtune-llama3-2-1b", follow=False, step=constants.DATASET_INITIALIZER)
print(log_dict[constants.DATASET_INITIALIZER])

2025-06-15T07:29:54Z INFO     [__main__.py:16] Starting dataset initialization
2025-06-15T07:29:54Z INFO     [huggingface.py:28] Downloading dataset: tatsu-lab/alpaca
2025-06-15T07:29:54Z INFO     [huggingface.py:29] ----------------------------------------
Fetching 3 files: 100%|██████████| 3/3 [00:00<00:00, 926.78it/s]
2025-06-15T07:29:55Z INFO     [huggingface.py:40] Dataset has been downloaded



### Model Initializer

In [39]:
log_dict = client.get_job_logs("torchtune-llama3-2-1b", follow=False, step=constants.MODEL_INITIALIZER)
print(log_dict[constants.MODEL_INITIALIZER])

2025-06-15T07:30:06Z INFO     [__main__.py:16] Starting pre-trained model initialization
2025-06-15T07:30:06Z INFO     [huggingface.py:26] Downloading model: meta-llama/Llama-3.2-1B-Instruct
2025-06-15T07:30:06Z INFO     [huggingface.py:27] ----------------------------------------
Fetching 8 files: 100%|██████████| 8/8 [00:33<00:00,  4.25s/it]
2025-06-15T07:30:41Z INFO     [huggingface.py:43] Model has been downloaded



### Trainer Node

In [41]:
log_dict = client.get_job_logs("torchtune-llama3-2-1b", follow=False)
print(log_dict[f"{constants.NODE}-0"])

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /workspace/model
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /workspace/output/model
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.instruct_dataset
  data_dir: /workspace/dataset/data
  packed: false
  source: parquet
device: cuda
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 8
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /workspace/output/model/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.op