# QWEN3-0.6B LoRA Fine-tuning with MLflow Tracking

This notebook demonstrates how to fine-tune QWEN3-0.6B using LoRA on Amazon SageMaker with MLflow tracking integration.

## MLflow Integration

MLflow tracking allows you to:
- Track training parameters and hyperparameters
- Log training metrics (loss, evaluation metrics) in real-time
- Store model artifacts and configurations
- Compare different training runs
- Visualize training progress through the MLflow UI

## 1. Setup and Import Libraries

In [None]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
from datetime import datetime

In [None]:
# Upgrade SageMaker SDK to use PyTorch 2.7.1 for training job
# !pip install --upgrade sagemaker mlflow sagemaker-mlflow --quiet

## 2. Configure SageMaker Session and Role

In [None]:
# SageMaker session
sagemaker_session = sagemaker.Session()

# IAM role - Update this to your role
role = sagemaker.get_execution_role()
print(f"Using SageMaker execution role: {role}")

# S3 Bucket (default bucket)
bucket = sagemaker_session.default_bucket()
prefix = "qwen3-0-6-lora-samples-mlflow"

print(f"Using bucket: {bucket}")
print(f"Using prefix: {prefix}")

## 3. Configure MLflow Tracking Server

Configure the MLflow tracking server to record experiment data. The tracking server stores:
- Training parameters and hyperparameters
- Metrics logged during training
- Model artifacts and configurations

In [None]:
# MLflow tracking server configuration
mlflow_tracking_server_arn = "arn:aws:sagemaker:us-east-1:418272795925:mlflow-tracking-server/mlflow-test"
mlflow_experiment_name = "qwen3-lora-training"

print(f"MLflow Tracking Server ARN: {mlflow_tracking_server_arn}")
print(f"MLflow Experiment Name: {mlflow_experiment_name}")

## 4. Upload Local Sample Data to S3

In [None]:
# Upload local train.jsonl data to S3
print("Uploading train.jsonl data to S3...")
train_s3_uri = sagemaker_session.upload_data(
    path='samples/train.jsonl',
    bucket=bucket,
    key_prefix=f'{prefix}/data/train'
)
print(f"Training data uploaded to: {train_s3_uri}")

## 5. Configure Training Parameters

In [None]:
# Training configuration
exp_name = 'qwen3-0-6b-lora-fine-tuning-mlflow'
instance_type = 'ml.g5.2xlarge'

# Output paths
output_path = f"s3://{bucket}/{prefix}/output"
checkpoint_s3_uri = f"s3://{bucket}/{prefix}/checkpoints"

# Job name based on timestamp
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
job_name = f"{exp_name}-{timestamp}"

print(f"Job name: {job_name}")
print(f"Output path: {output_path}")
print(f"Checkpoint path: {checkpoint_s3_uri}")

## 6. Define Hyperparameters

These hyperparameters control:
- **Model**: Base model and configuration
- **Training**: Learning rate, batch size, epochs, etc.
- **LoRA**: Low-Rank Adaptation parameters for efficient fine-tuning
- **MLflow**: Tracking server and experiment configuration

In [None]:
# Set hyperparameters
hyperparameters = {
    # Model
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    
    # Training (HuggingFace TrainingArguments)
    "output_dir": "/opt/ml/model",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 64,
    "learning_rate": 2e-4,
    "weight_decay": 0.01,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "logging_steps": 1,
    "save_steps": 1,
    "save_strategy": "steps",
    "save_total_limit": 3,
    "do_eval": True,
    "eval_strategy": "steps",
    "eval_steps": 50,
    "metric_for_best_model": "eval_loss",
    "greater_is_better": False,
    "load_best_model_at_end": False,
    "report_to": "none",  # We're using MLflow instead
    "bf16": True,
    "gradient_checkpointing": True,
    
    # LoRA
    "lora_r": 4, 
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    "lora_target_modules": "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj",
    
    # Dataset
    "train_file": "/opt/ml/input/data/train/train.jsonl",
    "validation_split_percentage": 20,
    "block_size": 256,
    
    # MLflow Configuration
    "mlflow_tracking_server_arn": mlflow_tracking_server_arn,
    "mlflow_experiment_name": mlflow_experiment_name,
}

print("Hyperparameters configured successfully")
print(f"\nMLflow tracking enabled: {hyperparameters.get('mlflow_tracking_server_arn') is not None}")

## 7. Create PyTorch Estimator

**Important**: This estimator uses `train_mlflow.py` which includes MLflow integration for tracking experiments.

In [None]:
# PyTorch Estimator with MLflow-enabled training script
estimator = PyTorch(
    entry_point="train_with_mlflow.py",  # MLflow-enabled training script
    source_dir="src",
    role=role,
    instance_type=instance_type,
    instance_count=1,
    framework_version="2.7.1",  
    py_version="py312",  
    hyperparameters=hyperparameters,
    output_path=output_path,
    checkpoint_s3_uri=checkpoint_s3_uri,
    use_spot_instances=False,  
    max_run=24*60*60,  # Maximum 24 hours
    keep_alive_period_in_seconds=1800,
    volume_size=450,
    environment={
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
    },
)

print("PyTorch estimator created successfully")

## 8. Prepare Training Inputs

In [None]:
# Training data input - using train.jsonl
train_input = TrainingInput(
    s3_data=train_s3_uri,
    content_type="application/jsonl",
    s3_data_type="S3Prefix",
    distribution="FullyReplicated"
)

print(f"Training input configured with data from: {train_s3_uri}")

## 9. Start Training Job

This will submit the training job to SageMaker. The training script will:
1. Connect to the MLflow tracking server
2. Create a new experiment run
3. Log all hyperparameters
4. Log training and evaluation metrics in real-time
5. Save model artifacts to both S3 and MLflow

In [None]:
# Start training job
print(f"Starting training job: {job_name}")
print(f"Training data: {train_s3_uri}")
print(f"Note: The training script will automatically split train.jsonl - {100-hyperparameters['validation_split_percentage']}% for training, {hyperparameters['validation_split_percentage']}% for validation")
print(f"\nMLflow Experiment: {mlflow_experiment_name}")
print(f"MLflow Tracking Server: {mlflow_tracking_server_arn}")

estimator.fit(
    inputs={
        "train": train_input
    },
    job_name=job_name,
    wait=False  # Asynchronous start
)

print(f"\nTraining job '{job_name}' has been submitted!")
print(f"You can monitor the job in the SageMaker console")

## 10. Monitor Training Progress

You can monitor the training job in multiple ways:

### SageMaker Console
View the training job logs and status in the [SageMaker Console](https://console.aws.amazon.com/sagemaker/home#/jobs)

### MLflow UI
Access the MLflow tracking UI through SageMaker Studio to:
- View real-time training metrics
- Compare different runs
- Inspect logged parameters and artifacts

To access MLflow UI:
1. Go to SageMaker Studio
2. Navigate to the MLflow tracking server
3. Open the MLflow UI
4. Find your experiment: `qwen3-lora-training`
5. Click on the run to view details

In [None]:
# Optional: Wait for the training job to complete
# Uncomment the line below if you want to wait for the job to finish
# estimator.latest_training_job.wait()

## 11. Retrieve Training Job Information

In [None]:
# Get training job name
training_job_name = estimator.latest_training_job.name
print(f"Training Job Name: {training_job_name}")

# Get model artifact location
# Note: This will only work after the training job completes
try:
    model_data = estimator.model_data
    print(f"Model artifacts location: {model_data}")
except:
    print("Model artifacts will be available after training completes")

## 12. View MLflow Experiment Results

Once the training job starts, you can view the results in MLflow:

### Logged Information

The training script logs the following to MLflow:

**Parameters:**
- Model name and configuration
- LoRA hyperparameters (r, alpha, dropout, target modules)
- Training hyperparameters (learning rate, batch size, epochs, etc.)
- Dataset configuration (block size, split percentage, dataset sizes)

**Metrics:**
- Training loss (per step)
- Evaluation loss (per eval step)
- Learning rate (per step)

**Artifacts:**
- LoRA configuration file
- Model checkpoints
- Training logs

### Accessing MLflow UI

In [None]:
import mlflow

# Set tracking URI
mlflow.set_tracking_uri(mlflow_tracking_server_arn)

# Get experiment
experiment = mlflow.get_experiment_by_name(mlflow_experiment_name)
print(f"Experiment ID: {experiment.experiment_id}")

# List recent runs
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id], order_by=["start_time DESC"], max_results=5)
print(runs[['run_id', 'start_time', 'status', 'tags.mlflow.runName']])

## Summary

This notebook demonstrated how to:

1. **Configure MLflow tracking** with SageMaker managed MLflow tracking server
2. **Launch a training job** that automatically logs experiments to MLflow
3. **Track training progress** through both SageMaker and MLflow interfaces
4. **Log comprehensive metadata** including parameters, metrics, and artifacts

### Key Benefits of MLflow Integration

- **Experiment Tracking**: Automatically track all training runs with parameters and metrics
- **Reproducibility**: All hyperparameters and configurations are logged for easy reproduction
- **Comparison**: Compare different training runs side-by-side in the MLflow UI
- **Collaboration**: Share experiment results with team members through a centralized tracking server
- **Model Registry**: Seamlessly transition to model registration and deployment

### Next Steps

After training completes:
1. Review training metrics in the MLflow UI
2. Compare with other training runs to identify best hyperparameters
3. Register the best model to MLflow Model Registry
4. Deploy the model for inference using SageMaker endpoints

For more information:
- [SageMaker MLflow Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html)
- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
- [SageMaker Training Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html)