# Deploy Best MLflow Model to SageMaker Endpoint

This notebook demonstrates how to:
1. Query MLflow tracking server to find the best performing model
2. Deploy the best model to a SageMaker real-time endpoint
3. Test the deployed endpoint

## 1. Setup and Import Libraries

In [None]:
import os
import boto3
import sagemaker
import mlflow
import pandas as pd
from datetime import datetime
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

In [None]:
# Upgrade packages if needed
# !pip install --upgrade sagemaker mlflow --quiet

## 2. Configure SageMaker Session and MLflow

In [None]:
# SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name

# Clients
sm_client = boto3.client('sagemaker', region_name=region)

# S3 Bucket
bucket = sagemaker_session.default_bucket()
prefix = "qwen3-0-6-lora-samples-mlflow"

print(f"Using bucket: {bucket}")
print(f"Using region: {region}")
print(f"Using role: {role}")

In [None]:
# MLflow tracking server configuration
# Update this ARN to match your MLflow tracking server
mlflow_tracking_server_arn = "YOUR_ARN_HERE"
mlflow_experiment_name = "qwen3-lora-training"

print(f"MLflow Tracking Server ARN: {mlflow_tracking_server_arn}")
print(f"MLflow Experiment Name: {mlflow_experiment_name}")

## 3. Connect to MLflow Tracking Server

In [None]:
# Set tracking URI
mlflow.set_tracking_uri(mlflow_tracking_server_arn)
print(f"MLflow tracking URI configured: {mlflow_tracking_server_arn}")

# Get experiment
experiment = mlflow.get_experiment_by_name(mlflow_experiment_name)
if experiment is None:
    raise ValueError(f"Experiment '{mlflow_experiment_name}' not found. Please check the experiment name.")

print(f"\nExperiment ID: {experiment.experiment_id}")
print(f"Experiment Name: {experiment.name}")

## 4. Search MLflow Runs to Find Best Model

We'll use `mlflow.search_runs()` to find the model with the lowest evaluation loss.

In [None]:
# Search for all completed runs in the experiment
runs_df = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="status = 'FINISHED'",
    order_by=["metrics.eval_loss ASC"],  # Sort by eval_loss in ascending order (lower is better)
    max_results=10
)

print(f"Found {len(runs_df)} completed runs")

# Display relevant columns
if len(runs_df) > 0:
    columns_to_display = [
        'run_id', 
        'start_time',
        'metrics.eval_loss',
        'metrics.train_loss',
        'params.learning_rate',
        'params.num_train_epochs',
        'tags.mlflow.runName'
    ]
    # Filter columns that exist
    existing_columns = [col for col in columns_to_display if col in runs_df.columns]
    display(runs_df[existing_columns].head(10))
else:
    print("No completed runs found in the experiment.")
    print("Please ensure that at least one training job has completed successfully.")

## 5. Select Best Model

Select the model with the lowest evaluation loss.

In [None]:
if len(runs_df) == 0:
    raise ValueError("No completed runs found. Cannot proceed with deployment.")

# Get the best run (already sorted by eval_loss ascending)
best_run = runs_df.iloc[0]
best_run_id = best_run['run_id']
best_eval_loss = best_run.get('metrics.eval_loss', 'N/A')
best_train_loss = best_run.get('metrics.train_loss', 'N/A')

print("=== Best Model Selected ===")
print(f"Run ID: {best_run_id}")
print(f"Run Name: {best_run.get('tags.mlflow.runName', 'N/A')}")
print(f"Eval Loss: {best_eval_loss}")
print(f"Train Loss: {best_train_loss}")
print(f"Learning Rate: {best_run.get('params.learning_rate', 'N/A')}")
print(f"Epochs: {best_run.get('params.num_train_epochs', 'N/A')}")
print(f"Started: {best_run['start_time']}")

## 6. Get Model Artifacts from MLflow

Retrieve the model artifacts path from the MLflow run.

In [None]:
# Get run details
run = mlflow.get_run(best_run_id)

print("=== Run Details ===")
print(f"Run ID: {run.info.run_id}")
print(f"Status: {run.info.status}")
print(f"\nLogged Artifacts:")

# List artifacts in the run
client = mlflow.tracking.MlflowClient()
artifacts = client.list_artifacts(best_run_id)

for artifact in artifacts:
    print(f"  - {artifact.path}")

# Note: In this workflow, the model artifacts are stored in S3 by the SageMaker training job
# We need to get the S3 model data path from the training job metadata
# The training job name is typically stored in the run tags
training_job_name = run.data.tags.get('mlflow.runName', '')
print(f"\nTraining Job Name (from MLflow run name): {training_job_name}")

## 7. Get Model Data S3 Path from SageMaker Training Job

Since the model artifacts are stored in S3 by SageMaker, we need to retrieve the S3 path from the training job.

In [None]:
# For this example, we'll need to find the corresponding SageMaker training job
# The training job outputs the model to S3, which we'll use for deployment

# List recent training jobs to find the one matching our run
# You may need to adjust this based on your naming convention
output_path = f"s3://{bucket}/{prefix}/output"

# Get the model data path from the training job
# Try to find the training job by searching for jobs with matching timestamps
# Or use a stored training job name if available

# Option 1: If you have the training job name from the MLflow run
# You can extract it from tags or parameters

# For now, let's list recent training jobs and let the user select
response = sm_client.list_training_jobs(
    SortBy='CreationTime',
    SortOrder='Descending',
    MaxResults=20,
    StatusEquals='Completed'
)

print("=== Recent Completed Training Jobs ===")
for idx, job in enumerate(response['TrainingJobSummaries'][:10]):
    print(f"{idx}: {job['TrainingJobName']} (Created: {job['CreationTime']})")

# For demonstration, let's use the most recent completed training job
# In production, you should match the training job name from MLflow tags
latest_training_job_name = response['TrainingJobSummaries'][0]['TrainingJobName']
print(f"\nUsing training job: {latest_training_job_name}")
print("\nNote: In production, you should store the SageMaker training job name in MLflow tags")
print("to ensure you're deploying the correct model.")

In [None]:
# Get training job details to retrieve model data S3 path
training_job_details = sm_client.describe_training_job(
    TrainingJobName=latest_training_job_name
)

model_data_s3_uri = training_job_details['ModelArtifacts']['S3ModelArtifacts']

print("=== Model Artifacts ===")
print(f"Training Job: {latest_training_job_name}")
print(f"Model Data S3 URI: {model_data_s3_uri}")
print(f"Training Job Status: {training_job_details['TrainingJobStatus']}")

## 8. Create SageMaker Model from Best MLflow Run

In [None]:
# Get PyTorch inference image
from sagemaker import image_uris

# Use PyTorch 2.7.1 inference image
pytorch_inference_image_uri = image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="2.7.1",
    py_version="py312",
    instance_type="ml.g5.2xlarge",
    image_scope="inference"
)

print(f"PyTorch Inference Image: {pytorch_inference_image_uri}")

In [None]:
# Create model name with timestamp
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
model_name = f"qwen3-mlflow-best-model-{timestamp}"

# Create model
create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': pytorch_inference_image_uri,
        'ModelDataUrl': model_data_s3_uri,
        'Environment': {
            'SAGEMAKER_PROGRAM': 'inference.py',
            'SAGEMAKER_SUBMIT_DIRECTORY': model_data_s3_uri,
            'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
            'SAGEMAKER_REGION': region,
        }
    },
    Tags=[
        {'Key': 'Project', 'Value': 'MLOps-Workshop'},
        {'Key': 'Model', 'Value': 'QWEN3-0.6B'},
        {'Key': 'MLflowRunId', 'Value': best_run_id},
        {'Key': 'EvalLoss', 'Value': str(best_eval_loss)}
    ]
)

print("\n=== Model Created ===")
print(f"Model Name: {model_name}")
print(f"Model ARN: {create_model_response['ModelArn']}")
print(f"Model Data: {model_data_s3_uri}")
print(f"MLflow Run ID: {best_run_id}")
print(f"Best Eval Loss: {best_eval_loss}")

## 9. Create Endpoint Configuration

In [None]:
# Endpoint configuration name
endpoint_config_name = f"qwen3-mlflow-config-{timestamp}"

# Create endpoint configuration
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'AllTraffic',
            'ModelName': model_name,
            'InstanceType': 'ml.g5.2xlarge',
            'InitialInstanceCount': 1,
            'InitialVariantWeight': 1,
            'ManagedInstanceScaling': {
                'Status': 'ENABLED',
                'MinInstanceCount': 1,
                'MaxInstanceCount': 1
            },
            'RoutingConfig': {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            }
        }
    ],
    Tags=[
        {'Key': 'Project', 'Value': 'MLOps-Workshop'},
        {'Key': 'Model', 'Value': 'QWEN3-0.6B'},
        {'Key': 'Source', 'Value': 'MLflow'},
        {'Key': 'MLflowRunId', 'Value': best_run_id}
    ]
)

print("\n=== Endpoint Configuration Created ===")
print(f"Config Name: {endpoint_config_name}")
print(f"Config ARN: {endpoint_config_response['EndpointConfigArn']}")
print(f"Instance Type: ml.g5.2xlarge")
print(f"Initial Instance Count: 1")

## 10. Create and Deploy SageMaker Endpoint

In [None]:
# Endpoint name
endpoint_name = f"qwen3-mlflow-endpoint-{timestamp}"

# Create endpoint
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
    Tags=[
        {'Key': 'Project', 'Value': 'MLOps-Workshop'},
        {'Key': 'Model', 'Value': 'QWEN3-0.6B'},
        {'Key': 'DeploymentType', 'Value': 'RealTime'},
        {'Key': 'Source', 'Value': 'MLflow'},
        {'Key': 'MLflowRunId', 'Value': best_run_id}
    ]
)

print("\n=== Creating Endpoint ===")
print(f"Endpoint Name: {endpoint_name}")
print(f"Endpoint ARN: {create_endpoint_response['EndpointArn']}")
print(f"Status: Creating...")
print(f"\nThis process will take 5-10 minutes. You can monitor progress in the SageMaker console.")

## 11. Wait for Endpoint to be Ready

In [None]:
# Wait for endpoint to be in service
print("Waiting for endpoint to be in service...")

waiter = sm_client.get_waiter('endpoint_in_service')

try:
    waiter.wait(
        EndpointName=endpoint_name,
        WaiterConfig={
            'Delay': 30,  # Check every 30 seconds
            'MaxAttempts': 40  # Maximum 20 minutes
        }
    )
    
    # Get endpoint status
    endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    
    print("\n=== Endpoint Deployment Successful ===")
    print(f"Endpoint Name: {endpoint_name}")
    print(f"Status: {endpoint_response['EndpointStatus']}")
    print(f"Created: {endpoint_response['CreationTime']}")
    print(f"\nDeployed Model Details:")
    print(f"  - MLflow Run ID: {best_run_id}")
    print(f"  - Eval Loss: {best_eval_loss}")
    print(f"  - Model S3 URI: {model_data_s3_uri}")
    
except Exception as e:
    print(f"Error waiting for endpoint: {e}")
    print("Please check the endpoint status in the SageMaker console.")

## 12. Test the Endpoint

In [None]:
# Create a predictor for the endpoint
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

print("=== Testing Endpoint ===")
print("\nNote: If you get a timeout error, the model might still be loading.")
print("Wait a few minutes and try again.\n")

In [None]:
# Test 1: Simple greeting
test_input_1 = {
    "inputs": "Hi!",
    "parameters": {
        "max_new_tokens": 20,
        "temperature": 0.1,
        "do_sample": False
    }
}

print("Test 1: Simple Greeting")
print(f"Input: {test_input_1['inputs']}")
response_1 = predictor.predict(data=test_input_1)
print(f"Output: {response_1}\n")

## Summary

This notebook demonstrated the complete workflow for:

1. **MLflow Integration**:
   - Connected to SageMaker MLflow tracking server
   - Queried experiment runs using `mlflow.search_runs()`
   - Selected the best model based on evaluation loss

2. **Model Deployment**:
   - Retrieved model artifacts from the best MLflow run
   - Created a SageMaker Model with proper tagging
   - Deployed to a real-time SageMaker endpoint

3. **Testing and Validation**:
   - Tested the endpoint with sample inputs
   - Verified model performance

4. **Traceability**:
   - Maintained full traceability from MLflow run to deployed endpoint
   - Tagged all resources with MLflow run ID for tracking

### Key Benefits

- **Automated Model Selection**: Uses MLflow to automatically find the best model
- **Experiment Tracking**: Full visibility into model performance and hyperparameters
- **Reproducibility**: Can always trace back to the exact training run
- **Governance**: Clear audit trail from training to deployment

### Next Steps

1. **A/B Testing**: Deploy multiple model versions and compare performance
2. **Auto-Scaling**: Configure auto-scaling policies based on traffic
3. **Model Monitoring**: Set up CloudWatch alarms for latency and errors
4. **CI/CD Pipeline**: Automate the deployment process with SageMaker Pipelines
5. **Model Registry**: Register the best model in MLflow Model Registry for production

For more information:
- [SageMaker MLflow Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html)
- [MLflow Model Registry](https://mlflow.org/docs/latest/model-registry.html)
- [SageMaker Endpoint Deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)