## (Nick) Fine-Tuning and Evaluating LLMs with SageMaker Pipelines and MLflow

### 1. Setup and Dependencies

In [1]:
!pip install sagemaker==2.225.0  datasets==2.18.0 transformers==4.40.0 mlflow==2.13.2 sagemaker-mlflow==0.1.0 protobuf==3.20.3 --quiet

In [2]:
%load_ext autoreload
%autoreload 2

**Importing Libraries and Setting Up Environment**

This part imports all necessary Python modules. It includes SageMaker-specific imports for pipeline creation and execution, as well as user-defined functions for the pipeline steps like finetune_llama7b_hf and preprocess_llama3.

In [3]:
import sys
print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

Python version: 3.12.9 | packaged by conda-forge | (main, Feb 14 2025, 08:00:06) [GCC 13.3.0]
Python version info: sys.version_info(major=3, minor=12, micro=9, releaselevel='final', serial=0)


In [4]:
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.function_step import step

# from steps.finetune_llama8b_hf import finetune_llama8b
# from steps.preprocess_llama3 import preprocess
# from steps.evaluation_mlflow import evaluation
from steps.finetune_llama3_classifier import finetune_classifier_hf
# from steps.evaluation_classifier import evaluate_model
from steps.preprocess_job_descriptions import preprocess_job_data

from steps.utils import create_training_job_name
import os

os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


INFO:datasets:PyTorch version 2.5.1 available.
INFO:datasets:TensorFlow version 2.18.0 available.


In [5]:
sagemaker.image_uris.get_base_python_image_uri('us-east-1', py_version='312')

'081325390199.dkr.ecr.us-east-1.amazonaws.com/sagemaker-base-python-312:1.0'

### 2. SageMaker Session and IAM Role

`get_execution_role()`: Retrieves the IAM role that SageMaker will use to access AWS resources. This role needs appropriate permissions for tasks like accessing S3 buckets and creating SageMaker resources.

In [6]:
import boto3

try:
    role = sagemaker.get_execution_role()
    print(role)
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session()

sagemaker.config INFO - Fetched defaults config from location: /home/sagemaker-user/job-classification-sagemaker-mlflow
arn:aws:iam::174671970284:role/service-role/AmazonSageMaker-ExecutionRole-20240216T153805


### 3. Configuration

**Training Configuration**

The train_config dictionary is comprehensive, including:

Experiment naming for tracking purposes
Model specifications (ID, version, name)
Infrastructure details (instance types and counts for fine-tuning and deployment)
Training hyperparameters (epochs, batch size)

This configuration allows for easy adjustment of the training process without changing the core pipeline code.

In [72]:
train_config = {
    "experiment_name": "progressive_metrics_test",
    "model_id": "meta-llama/Meta-Llama-3-8B",
    "model_version": "3.0.2",
    "model_name": "llama-3-8b",
    "endpoint_name": "llama-3-8b",
    "finetune_instance_type": "ml.g5.24xlarge",
    "source_directory": "scripts/python",  # Make sure this contains your utils folder
    "entry_point_script": "finetune_entrypoint.py",
    "merge_weights": True,
    "finetune_num_instances": 1,
    "instance_type": "ml.g5.12xlarge",
    "num_instances": 1,
    "learning_rate": 2e-4,
    "epoch": 2,  # 2 epochs for better curves
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "max_seq_length": 512,
    
    # Progressive logging settings
    "limit_train_samples": 500,   # More samples for longer training
    "limit_eval_samples": 50,     # More eval samples
    "logging_steps": 3,           # Log every 3 steps
    "eval_steps": 6,              # Evaluate every 6 steps
    
    "bf16": True,
    "gradient_checkpointing": True,
    "hf_token": ""
}

**LoRA Parameters**

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique for large language models. The parameters here (lora_r, lora_alpha, lora_dropout) control the behavior of LoRA during fine-tuning, affecting the trade-off between model performance and computational efficiency.

In [54]:
lora_params = {"lora_r": 8, "lora_alpha": 16, "lora_dropout": 0.05, "merge_weights": True}

### 4. MLflow Setup

MLflow integration is crucial for experiment tracking and management.

mlflow_arn: The ARN for the MLflow tracking server. You can get this ARN from SageMaker Studio UI. This allows the pipeline to log metrics, parameters, and artifacts to a central location.

experiment_name: give appropriate name for experimentation

In [55]:
# mlflow_arn = "<MLflow_tracking_server_ARN>"  # fill MLflow tracking server ARN
# experiment_name = "sm-pipelines-finetuning-eval"
mlflow_arn = "arn:aws:sagemaker:us-east-1:174671970284:mlflow-tracking-server/mlflow-d-8mkvrvo3fobb-27-10-47-37" # <--- REPLACE THIS
experiment_name = "JobDescriptionClassification-Llama3-FineTuning"

### 5. Dataset Configuration

For the purpose of fine tuning and evaluation we are going too use `HuggingFaceH4/no_robots` dataset

In [56]:
pipeline_name = "JobDescClassification-Llama3-Pipeline-V5" 
base_job_prefix = "job-desc-classify"

processed_data_s3_prefix = f"{base_job_prefix}/processed_data/v3"

# default_raw_data_s3_uri = "s3://sagemaker-us-east-1-174671970284/raw_job_data/poc_multilingual_set_20250604_214156/raw_jds_translated_v2.jsonl"
default_raw_data_s3_uri = "s3://sagemaker-us-east-1-174671970284/raw_job_description_data/v2_translated/raw_jds_translated_v2.jsonl"


In [None]:
# STANDALONE EVALUATION - Use this to evaluate an existing fine-tuned model

# Step 1: Define your model and data paths
# Replace these with your actual S3 paths
model_s3_path = "s3://sagemaker-us-east-1-174671970284/huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015/output/"  # From your fine-tuning job
test_data_s3_path = "s3://sagemaker-us-east-1-174671970284/processed_data/my_job_class_experiment/test/test_dataset.jsonl"  # Your test data
poc_categories_s3_path = "s3://sagemaker-us-east-1-174671970284/processed_data/my_job_class_experiment/poc_categories.json"  # Your categories

# Step 2: Import and create evaluation step
from steps.evaluation_classifier import evaluate_model
from sagemaker.workflow.function_step import step

# Create standalone evaluation step
standalone_evaluate_step = step(
    evaluate_model,
    instance_type="ml.g5.24xlarge",  # You can use smaller instance
    name="StandaloneEvaluation"
)(
    model_s3_path_or_mlflow_uri=model_s3_path,
    test_data_s3_path=test_data_s3_path,
    poc_categories_s3_path=poc_categories_s3_path,
    batch_size=4,  # Smaller batch for memory efficiency
    mlflow_arn=mlflow_arn,
    experiment_name=experiment_name,
    run_id=None  # Will create new MLflow run
)

# Step 3: Create and run standalone pipeline
standalone_pipeline_name = "StandaloneEvaluationPipeline"
standalone_pipeline = Pipeline(
    name=standalone_pipeline_name,
    steps=[standalone_evaluate_step],
    sagemaker_session=sess,
)

standalone_pipeline.upsert(role_arn=role)
standalone_execution = standalone_pipeline.start()

print(f"Standalone evaluation started: {standalone_execution.describe()}")

# Wait for completion and get results
# standalone_execution.wait()
# print("Standalone evaluation completed!")

# Alternative: Run evaluation locally (if you have the model downloaded)
# This would run evaluation on your notebook instance instead of SageMaker
"""
# LOCAL EVALUATION EXAMPLE (uncomment to use)
import sys
sys.path.append('steps')  # Add steps to path
from evaluate_classifier import evaluate_model

# Run evaluation locally
local_results = evaluate_model(
    model_s3_path_or_mlflow_uri="s3://your-model-path/",
    test_data_s3_path="s3://your-test-data-path/test_dataset.jsonl",
    poc_categories_s3_path="s3://your-categories-path/poc_categories.json",
    batch_size=2,  # Smaller batch for local execution
    mlflow_arn=mlflow_arn,
    experiment_name=experiment_name,
    run_id=None
)

print(f"Local evaluation results: {local_results}")
"""

### 6. Pipeline Steps

This section defines the core components of the SageMaker pipeline.

In [57]:
from sagemaker.workflow.parameters import ParameterString
import json

In [63]:
lora_config = ParameterString(name="lora_config", default_value=json.dumps(lora_params))

**Preprocessing Step**

This step handles data preparation. We are going to prepare data for training and evaluation. We will log this data in MLflow

In [68]:
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.function_step import step
from sagemaker.workflow.parameters import ParameterString # If needed for other params

# Assuming preprocess_job_descriptions.py is in a 'steps' directory
from steps.preprocess_job_descriptions import preprocess_job_data

# Define parameters for the preprocess step
s3_bucket_name = sagemaker.Session().default_bucket() # or your specific bucket

output_s3_prefix_jobs = "processed_data/my_job_class_experiment"

# Create the preprocessing step using the imported function
preprocess_jobs_step = step(
    preprocess_job_data,
    # instance_type="ml.g5.12xlarge",
    instance_type="ml.m5.large",
    name="PreprocessJobDescriptions" # SageMaker step name
)(
    raw_dataset_identifier=default_raw_data_s3_uri,
    s3_output_bucket=s3_bucket_name,
    s3_output_prefix=output_s3_prefix_jobs,
    job_desc_column="job_description_text", # Example: if your column is named 'description'
    category_column="category_label", # Example: if your column is named 'job_category'
    max_samples_per_split=1000, # Optional: for faster testing
    mlflow_arn=mlflow_arn,       # Your MLflow tracking server ARN
    experiment_name=experiment_name, # Your MLflow experiment name for preprocessing
    run_name=ExecutionVariables.PIPELINE_EXECUTION_ID, # Links MLflow run to pipeline execution
)

print("The pipeline name is " + pipeline_name)
# Mark the name of this bucket for reviewing the artifacts generated by this pipeline at the end of the execution


The pipeline name is JobDescClassification-Llama3-Pipeline-V5


In [69]:
from steps.finetune_llama3_classifier import finetune_classifier_hf

# Create the fine-tuning step
finetune_job_classifier_step = step(
    finetune_classifier_hf,
    name="FineTuneJobClassifier"
)(
    preprocess_step_output=preprocess_jobs_step, # Pass the entire output dict
    train_config=train_config,
    lora_config=lora_config,
    role=role,
    mlflow_arn=mlflow_arn,
    experiment_name=experiment_name
)

**Preprocess and FT PLINE**|

In [70]:
# Example: Define a pipeline with just this step
pipeline_name_jobs = "JobDescPreprocessFineTunePipeline"

job_pipeline = Pipeline(
    name=pipeline_name_jobs,
    steps=[preprocess_jobs_step, finetune_job_classifier_step],
    parameters=[lora_config], # if you have pipeline-level parameters
    sagemaker_session=sess,
)

# Upsert and run the pipeline
job_pipeline.upsert(role_arn=role)

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns


2025-06-09 17:56:41,108 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/PreprocessJobDescriptions/2025-06-09-17-56-39-566/function
2025-06-09 17:56:41,199 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/PreprocessJobDescriptions/2025-06-09-17-56-39-566/arguments
2025-06-09 17:56:41,425 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmp742kbk89/requirements.txt'
2025-06-09 17:56:41,454 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/PreprocessJobDescriptions/2025-06-09-17-56-39-566/pre_exec_script_and_dependencies'
2025-06-09 17:56:41,464 sagemaker.remote_function INFO     Copied user workspace to '/tmp/tmpi8mz1uog/temp_work

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


2025-06-09 17:56:43,171 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/FineTuneJobClassifier/2025-06-09-17-56-39-566/function
2025-06-09 17:56:43,223 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/FineTuneJobClassifier/2025-06-09-17-56-39-566/arguments
2025-06-09 17:56:43,296 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpuwk4yui2/requirements.txt'
2025-06-09 17:56:43,326 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-174671970284/JobDescPreprocessFineTunePipeline/FineTuneJobClassifier/2025-06-09-17-56-39-566/pre_exec_script_and_dependencies'
2025-06-09 17:56:43,771 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-1

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:174671970284:pipeline/JobDescPreprocessFineTunePipeline',
 'ResponseMetadata': {'RequestId': 'c2d03cf7-e263-4fda-821a-06be3994c6f7',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c2d03cf7-e263-4fda-821a-06be3994c6f7',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '101',
   'date': 'Mon, 09 Jun 2025 17:56:44 GMT'},
  'RetryAttempts': 0}}

In [71]:
execution = job_pipeline.start()
print(f"Pipeline execution started: {execution.describe()}")

Pipeline execution started: {'PipelineArn': 'arn:aws:sagemaker:us-east-1:174671970284:pipeline/JobDescPreprocessFineTunePipeline', 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:174671970284:pipeline/JobDescPreprocessFineTunePipeline/execution/wei8jkm553r4', 'PipelineExecutionDisplayName': 'execution-1749491806927', 'PipelineExecutionStatus': 'Executing', 'CreationTime': datetime.datetime(2025, 6, 9, 17, 56, 46, 829000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2025, 6, 9, 17, 56, 46, 829000, tzinfo=tzlocal()), 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:174671970284:user-profile/d-8mkvrvo3fobb/default-20240216t153804', 'UserProfileName': 'default-20240216t153804', 'DomainId': 'd-8mkvrvo3fobb', 'IamIdentity': {'Arn': 'arn:aws:sts::174671970284:assumed-role/AmazonSageMaker-ExecutionRole-20240216T153805/SageMaker', 'PrincipalId': 'AROASRK2CX7WPM2ML6UZA:SageMaker'}}, 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:174671970284:user-p

**FULL PIPELINE FROM HERE**

In [None]:
from steps.evaluate_classifier import evaluate_model

# Create evaluation step that takes fine-tuning output
evaluate_classifier_step = step(
    evaluate_model,
    instance_type="ml.g5.xlarge",  # Smaller instance for evaluation
    name="EvaluateJobClassifier"
)(
    # Use output from fine-tuning step
    model_s3_path_or_mlflow_uri=finetune_job_classifier_step["s3_model_artifacts"],
    # Use test data from preprocessing step  
    test_data_s3_path=preprocess_jobs_step["test_data_s3_path"],
    # Use categories from preprocessing step
    poc_categories_s3_path=preprocess_jobs_step["poc_categories_s3_path"],
    batch_size=4,
    mlflow_arn=mlflow_arn,
    experiment_name=experiment_name,
    run_id=finetune_job_classifier_step["mlflow_run_id"]
)

In [None]:

pipeline_name_jobs = "JobDescPreprocessFineTuneEvalPipeline"
job_pipeline = Pipeline(
    name=pipeline_name_jobs,
    steps=[
        preprocess_jobs_step, 
        finetune_job_classifier_step,
        evaluate_classifier_step  # NEW: Added evaluation step
    ],
    parameters=[lora_config],
    sagemaker_session=sess,
)

In [None]:
# Upsert and run the pipeline
job_pipeline.upsert(role_arn=role)

print(f"Pipeline '{pipeline_name_jobs}' created with preprocessing, fine-tuning, and evaluation steps!")

FINISH

In [None]:
# Cell 6: Monitor Pipeline (Optional)
execution.wait()
print("Pipeline execution completed!")

# Get evaluation results
steps_result = execution.list_steps()
for step_info in steps_result:
    if step_info['StepName'] == 'EvaluateJobClassifier':
        print(f"Evaluation step status: {step_info['StepStatus']}")
        break

In [None]:
preprocess_jobs_step["train_data_s3_path"]

In [None]:
# You can then access the outputs:
training_data_path = preprocess_jobs_step.properties.Outputs['train_data_s3_path']
validation_data_path = preprocess_jobs_step.properties.Outputs['validation_data_s3_path']
mlflow_run_id_preprocess = preprocess_jobs_step.properties.Outputs['mlflow_run_id']

### 7. Pipeline Creation and Execution

This final section brings all the components together into an executable pipeline.

**Creating the Pipeline**

The pipeline object is created with all defined steps. The lora_config is passed as a parameter, allowing for easy modification of LoRA settings between runs.

In [None]:
from sagemaker import get_execution_role

pipeline = Pipeline(
    name=pipeline_name,
    steps=[evaluate_finetuned_llama7b_instruction_mlflow],
    parameters=[lora_config],
)

**Upserting the Pipeline**

This step either creates a new pipeline in SageMaker or updates an existing one with the same name. It's a key part of the MLOps process, allowing for iterative refinement of the pipeline.

In [None]:
pipeline.upsert(role)

**Starting the Pipeline Execution**

This command kicks off the actual execution of the pipeline in SageMaker. From this point, SageMaker will orchestrate the execution of each step, managing resources and data flow between steps.

In [None]:
execution1 = pipeline.start()

Now lets run another experiment with different LORA configuration

# Clean up

In [None]:
sagemaker_client = boto3.client("sagemaker")
response = sagemaker_client.delete_pipeline(
    PipelineName=pipeline_name,
)

In [17]:
# METHOD 1: Get model path from completed fine-tuning job
import boto3

def get_model_artifacts_from_training_job(training_job_name):
    """Get S3 model artifacts path from a training job name"""
    sm_client = boto3.client('sagemaker')
    
    response = sm_client.describe_training_job(TrainingJobName=training_job_name)
    model_artifacts_s3_path = response['ModelArtifacts']['S3ModelArtifacts']
    
    print(f"Model artifacts S3 path: {model_artifacts_s3_path}")
    return model_artifacts_s3_path

# Use this if you know your training job name
training_job_name = "huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015"  # Replace with actual name
model_s3_path = get_model_artifacts_from_training_job(training_job_name)



Model artifacts S3 path: s3://sagemaker-us-east-1-174671970284/huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015/output/model.tar.gz


In [24]:
# Add this to your notebook to check the S3 path first
import boto3

def check_s3_path(s3_path):
    """Check what's actually in the S3 path"""
    s3 = boto3.client('s3')
    
    # Parse S3 path
    bucket_name = s3_path.replace('s3://', '').split('/')[0]
    prefix = '/'.join(s3_path.replace('s3://', '').split('/')[1:])
    
    print(f"Bucket: {bucket_name}")
    print(f"Prefix: {prefix}")
    
    try:
        paginator = s3.get_paginator('list_objects_v2')
        objects = []
        
        for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
            if 'Contents' in page:
                for obj in page['Contents']:
                    objects.append(obj['Key'])
        
        print(f"\nFound {len(objects)} objects:")
        for obj in objects[:10]:  # Show first 10
            print(f"  {obj}")
        
        if len(objects) > 10:
            print(f"  ... and {len(objects) - 10} more")
        
        # Check for required model files
        required_files = ['config.json', 'pytorch_model.bin', 'tokenizer.json']
        missing_files = []
        
        for req_file in required_files:
            if not any(req_file in obj for obj in objects):
                missing_files.append(req_file)
        
        if missing_files:
            print(f"\nWARNING: Missing required files: {missing_files}")
        else:
            print(f"\n✓ All required model files found!")
            
        return len(objects) > 0
        
    except Exception as e:
        print(f"Error checking S3 path: {e}")
        return False

# Replace with your actual model S3 path
model_s3_path = "s3://sagemaker-us-east-1-174671970284/huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015/output/"  # UPDATE THIS
check_s3_path(model_s3_path)

Bucket: sagemaker-us-east-1-174671970284
Prefix: huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015/output/

Found 1 objects:
  huggingface-qlora-1-8-2025-06-09-13-13--2025-06-09-13-13-39-015/output/model.tar.gz



True