## Fine-Tuning and Evaluating LLMs with SageMaker Pipelines and MLflow

### 1. Setup and Dependencies

In [1]:
!pip install sagemaker==2.225.0  datasets==2.18.0 transformers==4.40.0 mlflow==2.13.2 sagemaker-mlflow==0.1.0 --quiet

In [2]:
%load_ext autoreload
%autoreload 2

**Importing Libraries and Setting Up Environment**

This part imports all necessary Python modules. It includes SageMaker-specific imports for pipeline creation and execution, as well as user-defined functions for the pipeline steps like finetune_llama7b_hf and preprocess_llama3.

In [3]:
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.function_step import step

# from steps.finetune_llama8b_hf import finetune_llama8b
# from steps.preprocess_llama3 import preprocess
# from steps.evaluation_mlflow import evaluation
from steps.finetune_llama3_classifier import launch_hf_training_job
from steps.evaluation_classifier import evaluate_model
from steps.preprocess_job_descriptions import preprocess_job_data

from steps.utils import create_training_job_name
import os

# os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


2025-06-05 20:58:55.503015: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:datasets:PyTorch version 2.7.0+cu118 available.
INFO:datasets:TensorFlow version 2.12.1 available.
INFO:datasets:JAX version 0.4.20 available.


### 2. SageMaker Session and IAM Role

`get_execution_role()`: Retrieves the IAM role that SageMaker will use to access AWS resources. This role needs appropriate permissions for tasks like accessing S3 buckets and creating SageMaker resources.

In [4]:
import boto3

try:
    role = sagemaker.get_execution_role()
    print(role)
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session()

arn:aws:iam::174671970284:role/service-role/AmazonSageMaker-ExecutionRole-20240216T153805


### 3. Configuration

**Training Configuration**

The train_config dictionary is comprehensive, including:

Experiment naming for tracking purposes
Model specifications (ID, version, name)
Infrastructure details (instance types and counts for fine-tuning and deployment)
Training hyperparameters (epochs, batch size)

This configuration allows for easy adjustment of the training process without changing the core pipeline code.

In [5]:
train_config = {
    "experiment_name": "all_target_modules_1K",
    "model_id": "meta-llama/Meta-Llama-3-8B",
    "model_version": "3.0.2",
    "model_name": "llama-3-8b",
    "endpoint_name": "llama-3-8b",
    "finetune_instance_type": "ml.g5.12xlarge",
    "finetune_num_instances": 1,
    "instance_type": "ml.g5.12xlarge",
    "num_instances": 1,
    "epoch": 1,
    "per_device_train_batch_size": 4,
}

**LoRA Parameters**

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique for large language models. The parameters here (lora_r, lora_alpha, lora_dropout) control the behavior of LoRA during fine-tuning, affecting the trade-off between model performance and computational efficiency.

In [6]:
lora_params = {"lora_r": 8, "lora_alpha": 16, "lora_dropout": 0.05}

### 4. MLflow Setup

MLflow integration is crucial for experiment tracking and management.

mlflow_arn: The ARN for the MLflow tracking server. You can get this ARN from SageMaker Studio UI. This allows the pipeline to log metrics, parameters, and artifacts to a central location.

experiment_name: give appropriate name for experimentation

In [7]:
# mlflow_arn = "<MLflow_tracking_server_ARN>"  # fill MLflow tracking server ARN
# experiment_name = "sm-pipelines-finetuning-eval"
mlflow_arn = "arn:aws:sagemaker:us-east-1:174671970284:mlflow-tracking-server/mlflow-d-8mkvrvo3fobb-27-10-47-37" # <--- REPLACE THIS
experiment_name = "JobDescriptionClassification-Llama3-FineTuning"

### 5. Dataset Configuration

For the purpose of fine tuning and evaluation we are going too use `HuggingFaceH4/no_robots` dataset

In [8]:
pipeline_name = "JobDescClassification-Llama3-Pipeline-V5" 
base_job_prefix = "job-desc-classify"

processed_data_s3_prefix = f"{base_job_prefix}/processed_data/v3"

default_raw_data_s3_uri = "s3://sagemaker-us-east-1-174671970284/raw_job_data/poc_multilingual_set_20250604_214156/raw_jds_translated_v2.jsonl"

### 6. Pipeline Steps

This section defines the core components of the SageMaker pipeline.

In [9]:
from sagemaker.workflow.parameters import ParameterString
import json

In [10]:
lora_config = ParameterString(name="lora_config", default_value=json.dumps(lora_params))

**Preprocessing Step**

This step handles data preparation. We are going to prepare data for training and evaluation. We will log this data in MLflow

In [11]:
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.function_step import step
from sagemaker.workflow.parameters import ParameterString # If needed for other params

# Assuming preprocess_job_descriptions.py is in a 'steps' directory
from steps.preprocess_job_descriptions import preprocess_job_data

# Define parameters for the preprocess step
s3_bucket_name = sagemaker.Session().default_bucket() # or your specific bucket

output_s3_prefix_jobs = "processed_data/my_job_class_experiment"

# Create the preprocessing step using the imported function
preprocess_jobs_step = step(
    preprocess_job_data,
    # instance_type="ml.g5.12xlarge",
    instance_type="ml.m5.large",
    name="PreprocessJobDescriptions" # SageMaker step name
)(
    raw_dataset_identifier=default_raw_data_s3_uri,
    s3_output_bucket=s3_bucket_name,
    s3_output_prefix=output_s3_prefix_jobs,
    job_desc_column="description", # Example: if your column is named 'description'
    category_column="job_category", # Example: if your column is named 'job_category'
    max_samples_per_split=1000, # Optional: for faster testing
    mlflow_arn=mlflow_arn,       # Your MLflow tracking server ARN
    experiment_name=experiment_name, # Your MLflow experiment name for preprocessing
    run_name=ExecutionVariables.PIPELINE_EXECUTION_ID, # Links MLflow run to pipeline execution
)

# Example: Define a pipeline with just this step
pipeline_name_jobs = "JobDescPreprocessPipeline"
job_pipeline = Pipeline(
    name=pipeline_name_jobs,
    steps=[preprocess_jobs_step],
    # parameters=[...] # if you have pipeline-level parameters
    sagemaker_session=sess,
    # instance_type="ml.g5.12xlarge",
    # volume_size=100,
)

# Upsert and run the pipeline
job_pipeline.upsert(role_arn=role)


2025-06-05 20:59:01,376 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessPipeline/PreprocessJobDescriptions/2025-06-05-20-59-01-160/function
2025-06-05 20:59:01,426 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessPipeline/PreprocessJobDescriptions/2025-06-05-20-59-01-160/arguments
2025-06-05 20:59:01,853 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessPipeline/PreprocessJobDescriptions/2025-06-05-20-59-01-853/function
2025-06-05 20:59:01,913 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessPipeline/PreprocessJobDescriptions/2025-06-05-20-59-01-853/arguments


{'PipelineArn': 'arn:aws:sagemaker:us-east-1:174671970284:pipeline/JobDescPreprocessPipeline',
 'ResponseMetadata': {'RequestId': 'c1f02539-6866-4ee3-bae6-46cfa82b304c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c1f02539-6866-4ee3-bae6-46cfa82b304c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '93',
   'date': 'Thu, 05 Jun 2025 20:59:02 GMT'},
  'RetryAttempts': 0}}

In [12]:
sagemaker.image_uris.get_base_python_image_uri('us-east-1', py_version='310')

'081325390199.dkr.ecr.us-east-1.amazonaws.com/sagemaker-base-python-310:1.0'

In [13]:
job_execution = job_pipeline.start()

In [14]:
# You can then access the outputs:
training_data_path = preprocess_jobs_step.properties.Outputs['train_data_s3_path']
validation_data_path = preprocess_jobs_step.properties.Outputs['validation_data_s3_path']
mlflow_run_id_preprocess = preprocess_jobs_step.properties.Outputs['mlflow_run_id']

AttributeError: 'DelayedReturn' object has no attribute 'properties'

In [None]:
# pipeline_name = "fmops-training-evaulation-pipeline-mlflow"

# default_bucket = sagemaker.Session().default_bucket()
# main_data_path = f"s3://{default_bucket}"
# evaluation_data_path = (
#     main_data_path
#     + "/datasets/hf_no_robots/evaluation/automatic_small/dataset_evaluation_small.jsonl"
# )
# output_data_path = main_data_path + "/datasets/hf_no_robots/output_" + pipeline_name

# # You can add your own evaluation dataset code into this step
# preprocess_step_ret = step(preprocess, name="preprocess")(
#     default_bucket,
#     dataset_name,
#     train_sample=100,
#     eval_sample=100,
#     mlflow_arn=mlflow_arn,
#     experiment_name=experiment_name,
#     run_name=ExecutionVariables.PIPELINE_EXECUTION_ID,
# )

# print("The pipeline name is " + pipeline_name)
# # Mark the name of this bucket for reviewing the artifacts generated by this pipeline at the end of the execution
# print("Output S3 bucket: " + output_data_path)

**Fine-tuning Step**

This is where the actual model adaptation occurs. The step takes the preprocessed data and applies it to fine-tune the base LLM (in this case, a Llama model). It incorporates the LoRA technique for efficient adaptation.

In [None]:
finetune_ret = step(finetune_llama7b, name="finetune_llama8b_instruction")(
    preprocess_step_ret,
    train_config,
    lora_config,
    role,
    mlflow_arn,
    experiment_name,
    ExecutionVariables.PIPELINE_EXECUTION_ID,
)

**Evaluation Step**

After fine-tuning, this step assesses the model's performance. It uses built-in evaluation function in MLflow to evaluate metrices like toxicity, exact_match etc:

It will then log the results in MLflow

In [None]:
evaluate_finetuned_llama7b_instruction_mlflow = step(
    evaluation,
    name="evaluate_finetuned_llama8b_instr",
    # keep_alive_period_in_seconds=1200,
    instance_type="ml.g5.12xlarge",
    volume_size=100,
)(train_config, preprocess_step_ret, finetune_ret, mlflow_arn, experiment_name, "")

### 7. Pipeline Creation and Execution

This final section brings all the components together into an executable pipeline.

**Creating the Pipeline**

The pipeline object is created with all defined steps. The lora_config is passed as a parameter, allowing for easy modification of LoRA settings between runs.

In [None]:
from sagemaker import get_execution_role

pipeline = Pipeline(
    name=pipeline_name,
    steps=[evaluate_finetuned_llama7b_instruction_mlflow],
    parameters=[lora_config],
)

**Upserting the Pipeline**

This step either creates a new pipeline in SageMaker or updates an existing one with the same name. It's a key part of the MLOps process, allowing for iterative refinement of the pipeline.

In [None]:
pipeline.upsert(role)

**Starting the Pipeline Execution**

This command kicks off the actual execution of the pipeline in SageMaker. From this point, SageMaker will orchestrate the execution of each step, managing resources and data flow between steps.

In [None]:
execution1 = pipeline.start()

Now lets run another experiment with different LORA configuration

In [None]:
lora_params_2 = {"lora_r": 32, "lora_alpha": 64, "lora_dropout": 0.05}

In [None]:
execution2 = pipeline.start(
    parameters={
        "lora_config": json.dumps(lora_params_2),
    }
)

# Clean up

In [None]:
sagemaker_client = boto3.client("sagemaker")
response = sagemaker_client.delete_pipeline(
    PipelineName=pipeline_name,
)