## Phase 2: Distributed Training using Kubeflow Training Operator and SDK

- **kubeflow-training SDK**: PyTorchJob creation and management
- **TRL + PEFT**: Modern fine-tuning with LoRA adapters
- **Distributed Training**: Multi-node GPU coordination 

### Training Configuration using kubeflow-training SDK

In [None]:
%pip install kubernetes yamlmagic

In [1]:
%load_ext yamlmagic

In [None]:
%%yaml training_parameters

# Model configuration
model_name_or_path: ibm-granite/granite-3.1-2b-instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
use_liger: false

# PEFT / LoRA configuration
use_peft: true
lora_r: 16
lora_alpha: 16  # Changed from 8 to 16 for better scaling
lora_dropout: 0.05
lora_target_modules: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
lora_modules_to_save: []

# QLoRA (BitsAndBytes)
load_in_4bit: false
load_in_8bit: false

# Dataset configuration (synthetic data from Ray preprocessing)
dataset_path: synthetic_gsm8k
dataset_config: main
dataset_train_split: train
dataset_test_split: test
dataset_text_field: text
dataset_kwargs:
  add_special_tokens: false
  append_concat_token: false

# SFT configuration  # Fixed typo
max_seq_length: 1024
dataset_batch_size: 1000
packing: false

# Training hyperparameters
num_train_epochs: 3
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
auto_find_batch_size: false
eval_strategy: epoch

# Precision and optimization
bf16: true
tf32: false
learning_rate: 1.0e-4  # Reduced from 2.0e-4 for more stable LoRA training
warmup_steps: 100      # Increased from 10 for better stability
lr_scheduler_type: inverse_sqrt
optim: adamw_torch_fused
max_grad_norm: 1.0
seed: 42

# Gradient settings
gradient_accumulation_steps: 1
gradient_checkpointing: false
gradient_checkpointing_kwargs:
  use_reentrant: false

# FSDP for distributed training
fsdp: "full_shard auto_wrap"
fsdp_config:
  activation_checkpointing: true
  cpu_ram_efficient_loading: false
  sync_module_states: true
  use_orig_params: true
  limit_all_gathers: false

# Checkpointing and logging
save_strategy: epoch
save_total_limit: 1
resume_from_checkpoint: false
log_level: warning
logging_strategy: steps
logging_steps: 10      # Reduced frequency from 1 to 10
report_to:
- tensorboard

output_dir: /shared/models/granite-3.1-2b-instruct-synthetic2

### Configure kubeflow-training Client

Set up the kubeflow-training SDK client following the sft.ipynb pattern:


In [3]:
# Configure kubeflow-training client (following sft.ipynb pattern)
from kubernetes import client
from kubeflow.training import TrainingClient
from kubeflow.training.models import V1Volume, V1VolumeMount, V1PersistentVolumeClaimVolumeSource

token="<auth_token>"
api_server="<api_server_url>"

configuration = client.Configuration()
configuration.host = api_server
configuration.api_key = {"authorization": f"Bearer {token}"}
# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA
configuration.verify_ssl = False

api_client = client.ApiClient(configuration)
training_client = TrainingClient(client_configuration=api_client.configuration)

print("kubeflow-training client configured")

kubeflow-training client configured


In [4]:
from scripts.kft_granite_training import training_func

job = training_client.create_job(
    job_kind="PyTorchJob",
    name="test1-training",
    # Use script file instead of function import
    train_func=training_func,
    # Pass YAML parameters as config
    parameters=training_parameters,
    # Distributed training configuration
    num_workers=2,
    num_procs_per_worker=2,
    resources_per_worker={
        "nvidia.com/gpu": 2,  # Uncomment for GPU training
        "memory": "24Gi",
        "cpu": 4,
    },
    base_image="quay.io/modh/training:py311-cuda124-torch251",
    # Environment variables for training
    env_vars={
        # HuggingFace configuration - use shared storage
        "HF_HOME": "/shared/huggingface_cache",
        "HF_DATASETS_CACHE": "/shared/huggingface_cache/datasets",
        "TOKENIZERS_PARALLELISM": "false",
        # Training configuration
        "PYTHONUNBUFFERED": "1",
        "NCCL_DEBUG": "INFO",
    },
    # Package dependencies
    packages_to_install=[
        "transformers>=4.36.0",
        "trl>=0.7.0",
        "datasets>=2.14.0",
        "peft>=0.6.0",
        "accelerate>=0.24.0",
        "torch>=2.0.0",
    ],
    volumes=[
        V1Volume(
            name="shared",
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="shared")
        ),
    ],
    volume_mounts=[
        V1VolumeMount(name="shared", mount_path="/shared"),
    ],
)

print(f"PyTorchJob submitted successfully")


PyTorchJob submitted successfully


### Create Training Job using kubeflow-training SDK

Create and submit the distributed training job following the sft.ipynb pattern:


### Monitor Training Job

Follow the training progress and logs:


In [None]:
# Monitor training job logs (following sft.ipynb pattern)
training_client.get_job_logs(
    name="test1-training",
    job_kind="PyTorchJob",
    follow=True,
)


In [6]:
# Delete the Training Job
training_client.delete_job("test1-training")
print("PytorchJob deleted!")

PytorchJob deleted!
