# ML Engineer Core Workflow with MLFlow and Model Registry

This notebook provides enhanced functionality for ML Engineers to execute and manage YOLOv11 training pipelines with MLFlow experiment tracking and SageMaker Model Registry integration.

## Workflow Overview

1. **Pipeline Configuration**: Set up YOLOv11 training pipeline parameters
2. **Pipeline Execution**: Execute the training pipeline with MLFlow tracking
3. **Pipeline Monitoring**: Monitor training progress and results
4. **Model Registration**: Register trained models in SageMaker Model Registry
5. **Model Management**: Manage model versions and approval workflows

## Prerequisites

- AWS account with appropriate permissions
- AWS CLI configured with "ab" profile
- SageMaker Studio access with ML Engineer role
- Access to the drone imagery dataset in S3 bucket: `lucaskle-ab3-project-pv`
- Labeled data in YOLOv11 format
- SageMaker managed MLFlow tracking server

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
# Install required packages
!pip install --quiet mlflow>=3.0.0 requests-auth-aws-sigv4>=0.7 boto3>=1.28.0 sagemaker>=2.190.0 pandas>=2.0.0 matplotlib>=3.7.0 numpy>=1.24.0 PyYAML>=6.0

print("✅ Required packages installed successfully!")

In [None]:
import os
import boto3
import sagemaker
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import json
import time
from IPython.display import display, HTML
import mlflow
import mlflow.sagemaker
from sagemaker.model_registry import ModelPackage
from sagemaker.model_registry.model_registry import ModelRegistry

# Set up AWS session with "ab" profile
session = boto3.Session(profile_name='ab')
sagemaker_session = sagemaker.Session(boto_session=session)
sagemaker_client = session.client('sagemaker')
region = session.region_name
account_id = session.client('sts').get_caller_identity()['Account']

# Set up MLFlow tracking with SageMaker managed server
# Download the SageMaker MLFlow helper
try:
    s3_client.download_file(
        'lucaskle-ab3-project-pv', 
        'mlflow-sagemaker/utils/sagemaker_mlflow_helper.py', 
        'sagemaker_mlflow_helper.py'
    )
    
    # Import the helper
    from sagemaker_mlflow_helper import get_sagemaker_mlflow_helper
    
    # Initialize SageMaker MLflow helper
    mlflow_helper = get_sagemaker_mlflow_helper(aws_profile='ab')
    
    # Get server info
    server_info = mlflow_helper.get_tracking_server_info()
    mlflow_tracking_uri = server_info.get('url', 'https://t-2vktx6phiclp.us-east-1.experiments.sagemaker.aws')
    
    print(f"✅ Connected to SageMaker managed MLflow server")
    print(f"Server Status: {server_info.get('status', 'Unknown')}")
    print(f"MLflow Version: {server_info.get('mlflow_version', 'Unknown')}")
    
    # Create experiment using helper
    experiment_name = "yolov11-drone-detection"
    mlflow_helper.create_experiment(experiment_name)
    
except Exception as e:
    print(f"⚠️  Could not connect to SageMaker managed MLflow: {e}")
    print("Using basic MLflow setup as fallback")
    experiment_name = "yolov11-drone-detection"
    mlflow.set_experiment(experiment_name)
    mlflow_tracking_uri = "file:///tmp/mlruns"

# Set up visualization
plt.rcParams["figure.figsize"] = (12, 6)

# Define bucket name and role
BUCKET_NAME = 'lucaskle-ab3-project-pv'
ROLE_ARN = sagemaker_session.get_caller_identity_arn()

# Model Registry configuration
MODEL_PACKAGE_GROUP_NAME = "yolov11-drone-detection-models"

print(f"Data Bucket: {BUCKET_NAME}")
print(f"Region: {region}")
print(f"Account ID: {account_id}")
print(f"Role ARN: {ROLE_ARN}")
print(f"MLFlow Experiment: {experiment_name}")
print(f"MLFlow Tracking URI: {mlflow_tracking_uri}")
print(f"Model Package Group: {MODEL_PACKAGE_GROUP_NAME}")

# Helper functions for MLflow logging (works with both managed and direct MLflow)
def log_params(params_dict):
    """Log parameters using available MLflow method"""
    if 'mlflow_helper' in locals() and mlflow_helper:
        mlflow_helper.log_params(params_dict)
    else:
        mlflow.log_params(params_dict)

def log_metrics(metrics_dict, step=None):
    """Log metrics using available MLflow method"""
    if 'mlflow_helper' in locals() and mlflow_helper:
        mlflow_helper.log_metrics(metrics_dict, step=step)
    else:
        for key, value in metrics_dict.items():
            mlflow.log_metric(key, value, step=step)

def log_artifact(local_path, artifact_path=None):
    """Log artifact using available MLflow method"""
    if 'mlflow_helper' in locals() and mlflow_helper:
        mlflow_helper.log_artifact(local_path, artifact_path)
    else:
        mlflow.log_artifact(local_path, artifact_path)

def start_run(run_name=None, experiment_name=None, tags=None):
    """Start MLflow run using available method"""
    if 'mlflow_helper' in locals() and mlflow_helper:
        return mlflow_helper.start_run(run_name=run_name, experiment_name=experiment_name, tags=tags)
    else:
        return mlflow.start_run(run_name=run_name, tags=tags)

print("✅ MLflow helper functions loaded")

## 1. Setup Model Registry

First, let's create the Model Package Group in SageMaker Model Registry if it doesn't exist.

In [None]:
# Function to create model package group
def create_model_package_group(group_name, description="YOLOv11 drone detection models"):
    """Create a model package group in SageMaker Model Registry"""
    try:
        # Check if group already exists
        response = sagemaker_client.describe_model_package_group(
            ModelPackageGroupName=group_name
        )
        print(f"Model package group '{group_name}' already exists.")
        print(f"Status: {response['ModelPackageGroupStatus']}")
        return response
    except sagemaker_client.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            # Group doesn't exist, create it
            print(f"Creating model package group: {group_name}")
            response = sagemaker_client.create_model_package_group(
                ModelPackageGroupName=group_name,
                ModelPackageGroupDescription=description
            )
            print(f"Created model package group: {response['ModelPackageGroupArn']}")
            return response
        else:
            raise e

# Create model package group
model_package_group = create_model_package_group(MODEL_PACKAGE_GROUP_NAME)

## 2. Pipeline Configuration

Let's configure our YOLOv11 training pipeline parameters with MLFlow tracking.

In [None]:
# Function to list available datasets
def list_datasets(bucket, prefix="datasets/"):
    """List available datasets in S3"""
    s3_client = session.client('s3')
    response = s3_client.list_objects_v2(
        Bucket=bucket,
        Prefix=prefix,
        Delimiter='/'
    )
    
    datasets = []
    if 'CommonPrefixes' in response:
        for obj in response['CommonPrefixes']:
            dataset_prefix = obj['Prefix']
            dataset_name = dataset_prefix.split('/')[-2]
            datasets.append({
                'name': dataset_name,
                'prefix': dataset_prefix
            })
    
    return datasets

# List available datasets
datasets = list_datasets(BUCKET_NAME)

print(f"Found {len(datasets)} datasets:")
for i, dataset in enumerate(datasets):
    print(f"  {i+1}. {dataset['name']} - s3://{BUCKET_NAME}/{dataset['prefix']}")

# If no datasets found, provide instructions
if not datasets:
    print("\nNo datasets found. Please prepare a dataset using the Data Scientist notebook first.")
    print("The dataset should be organized in the following structure:")
    print("s3://lucaskle-ab3-project-pv/datasets/your_dataset_name/")
    print("├── train/")
    print("│   ├── images/")
    print("│   └── labels/")
    print("└── val/")
    print("    ├── images/")
    print("    └── labels/")

In [None]:
# Define training parameters with MLFlow tracking
training_params = {
    # Dataset parameters
    'dataset_name': datasets[0]['name'] if datasets else 'your_dataset_name',
    'dataset_prefix': datasets[0]['prefix'] if datasets else 'datasets/your_dataset_name/',
    
    # Model parameters
    'model_variant': 'yolov11n',  # Options: yolov11n, yolov11s, yolov11m, yolov11l, yolov11x
    'image_size': 640,  # Input image size (px)
    
    # Training parameters
    'batch_size': 16,
    'epochs': 50,
    'learning_rate': 0.001,
    
    # Infrastructure parameters
    'instance_type': 'ml.g4dn.xlarge',
    'instance_count': 1,
    'use_spot': True,
    'max_wait': 36000,  # Max wait time for spot instances (seconds)
    'max_run': 3600,    # Max run time (seconds)
    
    # Output parameters
    'output_path': f"s3://{BUCKET_NAME}/model-artifacts/",
    'job_name': f"yolov11-training-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}",
    
    # MLFlow parameters
    'experiment_name': experiment_name,
    'run_name': f"yolov11-run-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
}

# Display training parameters
print("YOLOv11 Training Parameters:")
for key, value in training_params.items():
    print(f"  {key}: {value}")

## 3. Pipeline Execution with MLFlow Tracking

Now let's execute the YOLOv11 training pipeline with comprehensive MLFlow tracking.

In [None]:
# Function to create and execute training job with MLFlow tracking
def execute_training_job_with_mlflow(params):
    """Create and execute SageMaker training job for YOLOv11 with MLFlow tracking"""
    
    # Start MLFlow run
    with mlflow.start_run(run_name=params['run_name']) as run:
        # Log parameters to MLFlow
        mlflow.log_param("model_variant", params['model_variant'])
        mlflow.log_param("image_size", params['image_size'])
        mlflow.log_param("batch_size", params['batch_size'])
        mlflow.log_param("epochs", params['epochs'])
        mlflow.log_param("learning_rate", params['learning_rate'])
        mlflow.log_param("instance_type", params['instance_type'])
        mlflow.log_param("instance_count", params['instance_count'])
        mlflow.log_param("use_spot", params['use_spot'])
        mlflow.log_param("dataset_name", params['dataset_name'])
        mlflow.log_param("dataset_prefix", params['dataset_prefix'])
        
        # Define hyperparameters for SageMaker
        hyperparameters = {
            "model_variant": params['model_variant'],
            "image_size": str(params['image_size']),
            "batch_size": str(params['batch_size']),
            "epochs": str(params['epochs']),
            "learning_rate": str(params['learning_rate']),
            "mlflow_run_id": run.info.run_id,
            "mlflow_experiment_id": run.info.experiment_id
        }
        
        # Define input data channels
        input_data = {
            'training': f"s3://{BUCKET_NAME}/{params['dataset_prefix']}"
        }
        
        # Create SageMaker estimator
        estimator = sagemaker.estimator.Estimator(
            image_uri=f"{account_id}.dkr.ecr.{region}.amazonaws.com/yolov11-training:latest",
            role=ROLE_ARN,
            instance_count=params['instance_count'],
            instance_type=params['instance_type'],
            hyperparameters=hyperparameters,
            output_path=params['output_path'],
            sagemaker_session=sagemaker_session,
            use_spot_instances=params['use_spot'],
            max_wait=params['max_wait'] if params['use_spot'] else None,
            max_run=params['max_run']
        )
        
        # Start training job
        print(f"Starting training job: {params['job_name']}")
        print(f"MLFlow Run ID: {run.info.run_id}")
        
        # Log training job details to MLFlow
        mlflow.log_param("sagemaker_job_name", params['job_name'])
        mlflow.log_param("output_path", params['output_path'])
        
        # Start the training job
        estimator.fit(input_data, job_name=params['job_name'], wait=False)
        
        # Log additional metadata
        mlflow.set_tag("sagemaker_job_name", params['job_name'])
        mlflow.set_tag("model_type", "YOLOv11")
        mlflow.set_tag("task_type", "object_detection")
        mlflow.set_tag("dataset", params['dataset_name'])
        
        return params['job_name'], run.info.run_id

# Execute training job with MLFlow tracking
try:
    job_name, mlflow_run_id = execute_training_job_with_mlflow(training_params)
    print(f"\nTraining job started: {job_name}")
    print(f"MLFlow Run ID: {mlflow_run_id}")
    print(f"You can monitor the job in the SageMaker console or using the cell below.")
except Exception as e:
    print(f"Error starting training job: {str(e)}")
    print("\nPossible causes:")
    print("1. The dataset doesn't exist or has incorrect structure")
    print("2. The YOLOv11 training container doesn't exist in ECR")
    print("3. Insufficient permissions to start training job")
    print("\nPlease check the error message and try again.")

## 4. Pipeline Monitoring with Enhanced Metrics

Let's monitor the progress of our training job and update MLFlow with metrics.

In [None]:
# Function to monitor training job and update MLFlow
def monitor_training_job_with_mlflow(job_name, mlflow_run_id):
    """Monitor SageMaker training job status and update MLFlow"""
    # Get job description
    response = sagemaker_client.describe_training_job(
        TrainingJobName=job_name
    )
    
    # Extract job status
    status = response['TrainingJobStatus']
    creation_time = response['CreationTime']
    last_modified_time = response.get('LastModifiedTime', creation_time)
    
    # Calculate duration
    duration = last_modified_time - creation_time
    duration_minutes = duration.total_seconds() / 60
    
    # Display job information
    print(f"Job Name: {job_name}")
    print(f"Status: {status}")
    print(f"Creation Time: {creation_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Last Modified: {last_modified_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Duration: {duration_minutes:.2f} minutes")
    print(f"MLFlow Run ID: {mlflow_run_id}")
    
    # Update MLFlow with job status
    with mlflow.start_run(run_id=mlflow_run_id):
        mlflow.log_metric("training_duration_minutes", duration_minutes)
        mlflow.set_tag("job_status", status)
        mlflow.set_tag("last_updated", last_modified_time.isoformat())
    
    # Display additional information based on status
    if status == 'InProgress':
        print("\nJob is still running. Check back later for results.")
    elif status == 'Completed':
        print("\nJob completed successfully!")
        model_artifacts = response['ModelArtifacts']['S3ModelArtifacts']
        print(f"Model artifacts: {model_artifacts}")
        
        # Update MLFlow with completion details
        with mlflow.start_run(run_id=mlflow_run_id):
            mlflow.log_param("model_artifacts_path", model_artifacts)
            mlflow.set_tag("training_completed", "true")
            
    elif status == 'Failed':
        print("\nJob failed!")
        failure_reason = response.get('FailureReason', 'Unknown')
        print(f"Failure reason: {failure_reason}")
        
        # Update MLFlow with failure details
        with mlflow.start_run(run_id=mlflow_run_id):
            mlflow.set_tag("failure_reason", failure_reason)
            mlflow.set_tag("training_failed", "true")
            
    elif status == 'Stopped':
        print("\nJob was stopped.")
        
        # Update MLFlow with stopped status
        with mlflow.start_run(run_id=mlflow_run_id):
            mlflow.set_tag("training_stopped", "true")
    
    return response

# Monitor the training job
try:
    if 'job_name' in locals() and 'mlflow_run_id' in locals():
        job_response = monitor_training_job_with_mlflow(job_name, mlflow_run_id)
    else:
        print("No active training job to monitor.")
        print("Please execute a training job first.")
except Exception as e:
    print(f"Error monitoring training job: {str(e)}")

In [None]:
# Refresh job status (run this cell to update status)
try:
    if 'job_name' in locals() and 'mlflow_run_id' in locals():
        job_response = monitor_training_job_with_mlflow(job_name, mlflow_run_id)
    else:
        print("No active training job to monitor.")
        print("Please execute a training job first.")
except Exception as e:
    print(f"Error monitoring training job: {str(e)}")

## 5. Model Registration in SageMaker Model Registry

Once the training job is complete, let's register the model in SageMaker Model Registry.

In [None]:
# Function to register model in Model Registry
def register_model_in_registry(job_name, mlflow_run_id, model_package_group_name):
    """Register trained model in SageMaker Model Registry"""
    
    # Get training job details
    response = sagemaker_client.describe_training_job(
        TrainingJobName=job_name
    )
    
    # Check if job is completed
    if response['TrainingJobStatus'] != 'Completed':
        print(f"Training job is not completed yet. Status: {response['TrainingJobStatus']}")
        return None
    
    # Get model artifacts
    model_artifacts = response['ModelArtifacts']['S3ModelArtifacts']
    
    # Create model package
    model_package_name = f"yolov11-model-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # Define inference specification
    inference_specification = {
        'Containers': [
            {
                'Image': f"{account_id}.dkr.ecr.{region}.amazonaws.com/yolov11-inference:latest",
                'ModelDataUrl': model_artifacts,
                'Environment': {
                    'SAGEMAKER_PROGRAM': 'inference.py',
                    'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/code'
                }
            }
        ],
        'SupportedContentTypes': ['application/json', 'image/jpeg', 'image/png'],
        'SupportedResponseMIMETypes': ['application/json']
    }
    
    # Create model package
    try:
        create_response = sagemaker_client.create_model_package(
            ModelPackageGroupName=model_package_group_name,
            ModelPackageDescription=f"YOLOv11 drone detection model trained from job {job_name}",
            InferenceSpecification=inference_specification,
            ModelApprovalStatus='PendingManualApproval',
            MetadataProperties={
                'GeneratedBy': f'sagemaker-training-job-{job_name}',
                'ProjectId': 'yolov11-drone-detection',
                'Repository': 'mlops-sagemaker-demo'
            },
            Tags=[
                {'Key': 'Project', 'Value': 'MLOps-SageMaker-Demo'},
                {'Key': 'Model', 'Value': 'YOLOv11'},
                {'Key': 'Task', 'Value': 'ObjectDetection'},
                {'Key': 'TrainingJob', 'Value': job_name},
                {'Key': 'MLFlowRunId', 'Value': mlflow_run_id}
            ]
        )
        
        model_package_arn = create_response['ModelPackageArn']
        print(f"Model registered successfully!")
        print(f"Model Package ARN: {model_package_arn}")
        
        # Update MLFlow with model registration details
        with mlflow.start_run(run_id=mlflow_run_id):
            mlflow.log_param("model_package_arn", model_package_arn)
            mlflow.log_param("model_package_group", model_package_group_name)
            mlflow.set_tag("model_registered", "true")
            mlflow.set_tag("model_approval_status", "PendingManualApproval")
            
            # Log model to MLFlow
            mlflow.log_artifact(model_artifacts, "model_artifacts")
        
        return model_package_arn
        
    except Exception as e:
        print(f"Error registering model: {str(e)}")
        return None

# Register the model
try:
    if 'job_name' in locals() and 'mlflow_run_id' in locals():
        model_package_arn = register_model_in_registry(job_name, mlflow_run_id, MODEL_PACKAGE_GROUP_NAME)
        if model_package_arn:
            print(f"\nModel registration completed!")
            print(f"You can view the model in the SageMaker Model Registry console.")
    else:
        print("No completed training job to register.")
        print("Please complete a training job first.")
except Exception as e:
    print(f"Error during model registration: {str(e)}")

## 6. Model Management and Approval Workflow

Let's manage the registered models and handle approval workflows.

In [None]:
# Function to list models in the registry
def list_models_in_registry(model_package_group_name):
    """List all models in the Model Registry"""
    try:
        response = sagemaker_client.list_model_packages(
            ModelPackageGroupName=model_package_group_name,
            SortBy='CreationTime',
            SortOrder='Descending'
        )
        
        models = response.get('ModelPackageSummaryList', [])
        
        if not models:
            print(f"No models found in group: {model_package_group_name}")
            return []
        
        print(f"Found {len(models)} models in group: {model_package_group_name}")
        print("\nModel List:")
        print("-" * 80)
        
        for i, model in enumerate(models):
            print(f"{i+1}. Model Package ARN: {model['ModelPackageArn']}")
            print(f"   Status: {model['ModelPackageStatus']}")
            print(f"   Approval Status: {model['ModelApprovalStatus']}")
            print(f"   Creation Time: {model['CreationTime'].strftime('%Y-%m-%d %H:%M:%S')}")
            if 'ModelPackageDescription' in model:
                print(f"   Description: {model['ModelPackageDescription']}")
            print("-" * 80)
        
        return models
        
    except Exception as e:
        print(f"Error listing models: {str(e)}")
        return []

# List models in registry
models = list_models_in_registry(MODEL_PACKAGE_GROUP_NAME)

In [None]:
# Function to approve a model
def approve_model(model_package_arn, approval_description="Model approved for deployment"):
    """Approve a model in the Model Registry"""
    try:
        response = sagemaker_client.update_model_package(
            ModelPackageArn=model_package_arn,
            ModelApprovalStatus='Approved',
            ApprovalDescription=approval_description
        )
        
        print(f"Model approved successfully!")
        print(f"Model Package ARN: {model_package_arn}")
        print(f"Approval Description: {approval_description}")
        
        return True
        
    except Exception as e:
        print(f"Error approving model: {str(e)}")
        return False

# Example: Approve the latest model (uncomment to use)
# if models:
#     latest_model_arn = models[0]['ModelPackageArn']
#     approve_model(latest_model_arn, "Model approved after validation")

In [None]:
# Function to get model details
def get_model_details(model_package_arn):
    """Get detailed information about a model"""
    try:
        response = sagemaker_client.describe_model_package(
            ModelPackageName=model_package_arn
        )
        
        print(f"Model Package Details:")
        print(f"ARN: {response['ModelPackageArn']}")
        print(f"Status: {response['ModelPackageStatus']}")
        print(f"Approval Status: {response['ModelApprovalStatus']}")
        print(f"Creation Time: {response['CreationTime'].strftime('%Y-%m-%d %H:%M:%S')}")
        
        if 'ModelPackageDescription' in response:
            print(f"Description: {response['ModelPackageDescription']}")
        
        if 'InferenceSpecification' in response:
            containers = response['InferenceSpecification']['Containers']
            print(f"\nInference Specification:")
            for i, container in enumerate(containers):
                print(f"  Container {i+1}:")
                print(f"    Image: {container['Image']}")
                print(f"    Model Data: {container['ModelDataUrl']}")
        
        if 'Tags' in response:
            print(f"\nTags:")
            for tag in response['Tags']:
                print(f"  {tag['Key']}: {tag['Value']}")
        
        return response
        
    except Exception as e:
        print(f"Error getting model details: {str(e)}")
        return None

# Example: Get details of the latest model (uncomment to use)
# if models:
#     latest_model_arn = models[0]['ModelPackageArn']
#     model_details = get_model_details(latest_model_arn)

## 7. MLFlow Experiment Management

Let's explore and compare experiments in MLFlow.

In [None]:
# Function to list MLFlow experiments
def list_mlflow_experiments():
    """List all MLFlow experiments"""
    try:
        experiments = mlflow.search_experiments()
        
        print(f"Found {len(experiments)} experiments:")
        print("-" * 80)
        
        for exp in experiments:
            print(f"Experiment ID: {exp.experiment_id}")
            print(f"Name: {exp.name}")
            print(f"Lifecycle Stage: {exp.lifecycle_stage}")
            if exp.tags:
                print(f"Tags: {exp.tags}")
            print("-" * 80)
        
        return experiments
        
    except Exception as e:
        print(f"Error listing experiments: {str(e)}")
        return []

# List experiments
experiments = list_mlflow_experiments()

In [None]:
# Function to search MLFlow runs
def search_mlflow_runs(experiment_name, max_results=10):
    """Search MLFlow runs in an experiment"""
    try:
        # Get experiment by name
        experiment = mlflow.get_experiment_by_name(experiment_name)
        if not experiment:
            print(f"Experiment '{experiment_name}' not found")
            return []
        
        # Search runs
        runs = mlflow.search_runs(
            experiment_ids=[experiment.experiment_id],
            max_results=max_results,
            order_by=["start_time DESC"]
        )
        
        if runs.empty:
            print(f"No runs found in experiment '{experiment_name}'")
            return []
        
        print(f"Found {len(runs)} runs in experiment '{experiment_name}':")
        print("-" * 100)
        
        # Display run information
        for idx, run in runs.iterrows():
            print(f"Run ID: {run['run_id']}")
            print(f"Status: {run['status']}")
            print(f"Start Time: {run['start_time']}")
            
            # Display parameters
            param_cols = [col for col in runs.columns if col.startswith('params.')]
            if param_cols:
                print("Parameters:")
                for param_col in param_cols:
                    param_name = param_col.replace('params.', '')
                    param_value = run[param_col]
                    if pd.notna(param_value):
                        print(f"  {param_name}: {param_value}")
            
            # Display metrics
            metric_cols = [col for col in runs.columns if col.startswith('metrics.')]
            if metric_cols:
                print("Metrics:")
                for metric_col in metric_cols:
                    metric_name = metric_col.replace('metrics.', '')
                    metric_value = run[metric_col]
                    if pd.notna(metric_value):
                        print(f"  {metric_name}: {metric_value}")
            
            # Display tags
            tag_cols = [col for col in runs.columns if col.startswith('tags.')]
            if tag_cols:
                print("Tags:")
                for tag_col in tag_cols:
                    tag_name = tag_col.replace('tags.', '')
                    tag_value = run[tag_col]
                    if pd.notna(tag_value):
                        print(f"  {tag_name}: {tag_value}")
            
            print("-" * 100)
        
        return runs
        
    except Exception as e:
        print(f"Error searching runs: {str(e)}")
        return []

# Search runs in the current experiment
runs_df = search_mlflow_runs(experiment_name)

In [None]:
# Function to compare runs
def compare_runs(runs_df, metrics_to_compare=['training_duration_minutes']):
    """Compare MLFlow runs"""
    if runs_df.empty:
        print("No runs to compare")
        return
    
    print("Run Comparison:")
    print("=" * 120)
    
    # Select relevant columns for comparison
    comparison_cols = ['run_id', 'status', 'start_time']
    
    # Add parameter columns
    param_cols = [col for col in runs_df.columns if col.startswith('params.')]
    comparison_cols.extend(param_cols)
    
    # Add metric columns
    for metric in metrics_to_compare:
        metric_col = f'metrics.{metric}'
        if metric_col in runs_df.columns:
            comparison_cols.append(metric_col)
    
    # Display comparison table
    comparison_df = runs_df[comparison_cols].copy()
    
    # Rename columns for better display
    column_mapping = {}
    for col in comparison_df.columns:
        if col.startswith('params.'):
            column_mapping[col] = col.replace('params.', 'param_')
        elif col.startswith('metrics.'):
            column_mapping[col] = col.replace('metrics.', 'metric_')
    
    comparison_df = comparison_df.rename(columns=column_mapping)
    
    # Display the comparison
    display(comparison_df)
    
    return comparison_df

# Compare runs if available
if not runs_df.empty:
    comparison_df = compare_runs(runs_df)
else:
    print("No runs available for comparison")

## 8. Training Metrics Visualization

Let's visualize training metrics from completed jobs.

In [None]:
# Function to get training metrics from CloudWatch
def get_training_metrics_with_mlflow(job_name, mlflow_run_id):
    """Get training metrics from CloudWatch and log to MLFlow"""
    # Get job description
    response = sagemaker_client.describe_training_job(
        TrainingJobName=job_name
    )
    
    # Check if job is complete
    if response['TrainingJobStatus'] != 'Completed':
        print(f"Job is not yet complete. Current status: {response['TrainingJobStatus']}")
        return None
    
    # Get CloudWatch metrics
    cloudwatch = session.client('cloudwatch')
    
    # Define metrics to retrieve
    metrics = [
        'train:loss',
        'val:loss',
        'val:mAP50',
        'val:mAP50-95'
    ]
    
    # Get metrics data
    metrics_data = {}
    final_metrics = {}
    
    for metric_name in metrics:
        try:
            cw_response = cloudwatch.get_metric_statistics(
                Namespace='SageMaker',
                MetricName=metric_name,
                Dimensions=[
                    {
                        'Name': 'TrainingJobName',
                        'Value': job_name
                    }
                ],
                StartTime=response['CreationTime'],
                EndTime=response['LastModifiedTime'],
                Period=60,  # 1-minute periods
                Statistics=['Average']
            )
            
            # Extract datapoints
            datapoints = cw_response.get('Datapoints', [])
            if datapoints:
                # Sort by timestamp
                datapoints.sort(key=lambda x: x['Timestamp'])
                
                # Extract values
                timestamps = [dp['Timestamp'] for dp in datapoints]
                values = [dp['Average'] for dp in datapoints]
                
                metrics_data[metric_name] = {
                    'timestamps': timestamps,
                    'values': values
                }
                
                # Store final metric value
                if values:
                    final_metrics[metric_name] = values[-1]
                    
        except Exception as e:
            print(f"Error retrieving metric {metric_name}: {str(e)}")
    
    # Log final metrics to MLFlow
    if final_metrics:
        with mlflow.start_run(run_id=mlflow_run_id):
            for metric_name, value in final_metrics.items():
                # Clean metric name for MLFlow
                clean_name = metric_name.replace(':', '_')
                mlflow.log_metric(clean_name, value)
    
    return metrics_data, final_metrics

# Get and visualize training metrics
try:
    if 'job_name' in locals() and 'mlflow_run_id' in locals():
        metrics_data, final_metrics = get_training_metrics_with_mlflow(job_name, mlflow_run_id)
        
        if metrics_data:
            # Plot metrics
            fig, axes = plt.subplots(2, 1, figsize=(12, 10))
            
            # Plot loss
            if 'train:loss' in metrics_data:
                axes[0].plot(
                    metrics_data['train:loss']['timestamps'],
                    metrics_data['train:loss']['values'],
                    label='Train Loss',
                    marker='o'
                )
            
            if 'val:loss' in metrics_data:
                axes[0].plot(
                    metrics_data['val:loss']['timestamps'],
                    metrics_data['val:loss']['values'],
                    label='Validation Loss',
                    marker='s'
                )
            
            axes[0].set_title('Training and Validation Loss')
            axes[0].set_xlabel('Time')
            axes[0].set_ylabel('Loss')
            axes[0].legend()
            axes[0].grid(True, alpha=0.3)
            
            # Plot mAP
            if 'val:mAP50' in metrics_data:
                axes[1].plot(
                    metrics_data['val:mAP50']['timestamps'],
                    metrics_data['val:mAP50']['values'],
                    label='mAP@0.5',
                    marker='o'
                )
            
            if 'val:mAP50-95' in metrics_data:
                axes[1].plot(
                    metrics_data['val:mAP50-95']['timestamps'],
                    metrics_data['val:mAP50-95']['values'],
                    label='mAP@0.5:0.95',
                    marker='s'
                )
            
            axes[1].set_title('Validation mAP')
            axes[1].set_xlabel('Time')
            axes[1].set_ylabel('mAP')
            axes[1].legend()
            axes[1].grid(True, alpha=0.3)
            
            plt.tight_layout()
            plt.show()
            
            # Display final metrics
            if final_metrics:
                print("\nFinal Training Metrics:")
                print("=" * 40)
                for metric_name, value in final_metrics.items():
                    print(f"{metric_name}: {value:.4f}")
        else:
            print("No metrics available yet. Job may still be running or has failed.")
    else:
        print("No active training job to monitor.")
        print("Please execute a training job first.")
except Exception as e:
    print(f"Error retrieving training metrics: {str(e)}")

## 9. Summary and Next Steps

In this enhanced notebook, we've executed and monitored a YOLOv11 training pipeline with comprehensive MLFlow tracking and SageMaker Model Registry integration. Here's a summary of what we've accomplished:

### Completed Tasks:

1. **Model Registry Setup**:
   - Created Model Package Group for organizing YOLOv11 models
   - Configured model registration workflow

2. **Pipeline Configuration with MLFlow**:
   - Listed available datasets
   - Configured training parameters with MLFlow experiment tracking

3. **Enhanced Pipeline Execution**:
   - Created and executed SageMaker training job with MLFlow integration
   - Logged all parameters, metrics, and metadata to MLFlow

4. **Comprehensive Monitoring**:
   - Monitored training job status with real-time updates to MLFlow
   - Tracked training duration and job status

5. **Model Registration**:
   - Registered trained models in SageMaker Model Registry
   - Configured approval workflows for production deployment
   - Linked MLFlow runs with registered models

6. **Model Management**:
   - Listed and managed models in the registry
   - Implemented model approval workflows
   - Retrieved detailed model information

7. **Experiment Management**:
   - Listed and compared MLFlow experiments
   - Searched and analyzed training runs
   - Compared model performance across runs

8. **Metrics Visualization**:
   - Retrieved training metrics from CloudWatch
   - Logged final metrics to MLFlow
   - Visualized training progress and model performance

### Key Features:

- **Complete MLFlow Integration**: All training parameters, metrics, and artifacts are tracked
- **Model Registry Integration**: Trained models are automatically registered with proper metadata
- **Approval Workflows**: Models require manual approval before deployment
- **Experiment Comparison**: Easy comparison of different training runs
- **Comprehensive Monitoring**: Real-time tracking of training progress
- **Visualization**: Training metrics are visualized for better understanding

### Next Steps:

1. **Model Deployment**: Deploy approved models to SageMaker endpoints
2. **A/B Testing**: Set up A/B testing for model comparison
3. **Model Monitoring**: Implement data drift and model performance monitoring
4. **Automated Retraining**: Set up automated retraining based on performance degradation
5. **CI/CD Integration**: Integrate with CI/CD pipelines for automated model deployment

### Best Practices Implemented:

- **Experiment Tracking**: All experiments are tracked with MLFlow
- **Model Versioning**: Models are versioned in the Model Registry
- **Approval Workflows**: Manual approval required for production deployment
- **Metadata Management**: Comprehensive metadata tracking for reproducibility
- **Cost Optimization**: Use of spot instances for training
- **Error Handling**: Robust error handling throughout the workflow

This enhanced workflow provides a production-ready foundation for YOLOv11 model development and deployment with proper governance, tracking, and management capabilities.