# SageMaker SDK Training & Hyperparameter Tuning

**Lab 3 - Assignments 2 & 3 Answer Notebook**

This notebook demonstrates model training and hyperparameter tuning using the SageMaker Python SDK, answering:
- **Assignment 2**: Training with Framework Estimators
- **Assignment 3**: Hyperparameter Tuning with the SDK

**Key Benefits of SDK Approach:**
- 80% less code compared to sagemaker-core
- High-level ML abstractions (Estimators, Tuners, Predictors)
- Automatic handling of AWS resource configuration
- Clean inference with automatic serialization
- Integrated hyperparameter tuning workflow

In [1]:
%load_ext autoreload
%autoreload 2

## Setup and Configuration

We'll use our existing `CoreLabSession` for session management but switch to SageMaker SDK for ML operations.

In [2]:
from corelab.core.session import CoreLabSession

# Use our custom session for authentication and S3 management
lab_session = CoreLabSession('pytorch', 'customer-churn',
                            default_folder='sagemaker_sdk_notebook', 
                            create_run_folder=True,
                             aws_profile='sagemaker-role')
lab_session.print()

# Get SageMaker session for SDK integration
sagemaker_session = lab_session.get_sagemaker_session()

  domain: The machine learning domain of the model and its components. Valid Values: COMPUTER_VISION \| NATURAL_LANGUAGE_PROCESSING \| MACHINE_LEARNING
  schedule_expression: A cron expression that describes details about the monitoring schedule. The supported cron expressions are:   If you want to set the job to start every hour, use the following:  Hourly: cron(0 \* ? \* \* \*)    If you want to start the job daily:  cron(0 [00-23] ? \* \* \*)    If you want to run the job one time, immediately, use the following keyword:  NOW    For example, the following are valid cron expressions:   Daily at noon UTC: cron(0 12 ? \* \* \*)    Daily at midnight UTC: cron(0 0 ? \* \* \*)    To support running every 6, 12 hours, the following are also supported:  cron(0 [00-23]/[01-24] ? \* \* \*)  For example, the following are valid cron expressions:   Every 12 hours, starting at 5pm UTC: cron(0 17/12 ? \* \* \*)    Every two hours starting at midnight: cron(0 0/2 ? \* \* \*)       Even though the 

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/machiel/Library/Application Support/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name machiel-crystalline to get Role path.


falling back to profile: sagemaker-role
AWS region: eu-central-1
Execution role arn:aws:iam::136548476532:role/service-role/AmazonSageMaker-ExecutionRole-20250902T164316
Output bucket uri: s3://sagemaker-eu-central-1-136548476532/sagemaker_sdk_notebook/2025-10-22T08-47-08
Framework: pytorch
Project name: customer-churn


## 🎓 Assignment 2: Training with Framework Estimators

This section demonstrates training with the **PyTorch Framework Estimator** with a custom XGBoost training script - the modern, flexible approach (Lab 3 Option A - Recommended).

**What You'll Learn:**
- Using PyTorch Framework Estimator for custom training logic
- Running XGBoost within PyTorch container (modern Python ecosystem)
- Creating custom training scripts with SageMaker conventions
- Passing hyperparameters as command-line arguments
- Automatic dependency installation via requirements.txt

**Key Approach:**
- **Framework Estimator**: PyTorch (provides modern container environment)
- **ML Library**: XGBoost (installed via requirements.txt)
- **Training Script**: Custom `train.py` with full control over training logic

**SDK vs. sagemaker-core:**
What took 50+ lines of TrainingJob configuration becomes just a few lines with the Framework Estimator pattern, while maintaining full flexibility through custom training scripts.

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3


In [None]:
from sagemaker.pytorch import PyTorch

# Create PyTorch Framework Estimator with custom XGBoost training script
# This uses PyTorch container for modern Python ecosystem while training with XGBoost
my_estimator = PyTorch(
    entry_point='train.py',           # Custom training script
    source_dir='src/',                # Directory with train.py and requirements.txt
    framework_version='2.6.0',          # PyTorch version (not XGBoost!)
    py_version='py312',
    role=lab_session.role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    output_path=lab_session.base_s3_uri,
    sagemaker_session=sagemaker_session,
    base_job_name='pytorch-xgboost-churn',

    # XGBoost hyperparameters (passed to train.py as CLI arguments)
    # Note: Use hyphens not underscores for CLI arg compatibility
    hyperparameters={
        'max-depth': 5,
        'eta': 0.2,
        'gamma': 4,
        'min-child-weight': 6,
        'subsample': 0.8,
        'objective': 'binary:logistic',
        'num-round': 100,
        'eval-metric': 'auc'
    }
)

print("✅ PyTorch Framework Estimator configured")
print(f"Training will use: {my_estimator.instance_type}")
print(f"Entry point: {my_estimator.entry_point}")
print(f"Output location: {my_estimator.output_path}")

In [14]:
# Train the model - just one line!
# Compare this to the complex TrainingJob.create() in sagemaker-core

s3_train_path = "s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-22T09-00-07/customer-churn-2025-10-22T09-00-07/jobs/customer-churn-pytorch-processing-2025-10-22T09-00-42/validation.csv"

s3_validation_path = "s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-22T09-00-07/customer-churn-2025-10-22T09-00-07/jobs/customer-churn-pytorch-processing-2025-10-22T09-00-42/validation.csv"

s3_test_path = "s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-22T09-00-07/customer-churn-2025-10-22T09-00-07/jobs/customer-churn-pytorch-processing-2025-10-22T09-00-42/validation.csv"


my_estimator.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print(f"✅ Training completed!")
print(f"Model artifacts: {my_estimator.model_data}")

## ⚡ Assignment 3: Hyperparameter Tuning with the SDK

This section demonstrates automated hyperparameter optimization using the **SageMaker SDK's HyperparameterTuner**.

**What You'll Learn:**
- Defining hyperparameter search spaces with typed parameters
- Configuring Bayesian optimization strategy
- Running parallel tuning jobs with resource management
- Analyzing tuning results and selecting best models

**SDK vs. sagemaker-core:**
The HyperparameterTuner class makes tuning much more intuitive compared to the complex HyperParameterTuningJobConfig shapes from Lab 1.

In [None]:
from sagemaker.tuner import (
    HyperparameterTuner,
    IntegerParameter,
    ContinuousParameter
)

# Define hyperparameter ranges - much cleaner than sagemaker-core!
# Note: Use hyphens to match CLI argument format in train.py
hyperparameter_ranges = {
    'max-depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.3),
    'gamma': ContinuousParameter(0, 5),
    'min-child-weight': ContinuousParameter(1, 10),
    'subsample': ContinuousParameter(0.5, 1.0),
    'num-round': IntegerParameter(50, 200)
}

# Create tuner with metric definitions
# IMPORTANT: metric_definitions is required for framework estimators (not built-in algorithms)
# since SageMaker doesn't know how to parse metrics from custom training scripts
tuner = HyperparameterTuner(
    my_estimator,
    objective_metric_name='validation:auc',
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=[
        {'Name': 'validation:auc', 'Regex': 'validation-auc:([0-9\\.]+)'},
        {'Name': 'train:auc', 'Regex': 'train-auc:([0-9\\.]+)'}
    ],
    max_jobs=3,
    max_parallel_jobs=3,
    base_tuning_job_name='pytorch-xgboost-tuning'
)

print("✅ Hyperparameter tuner configured")
print(f"Will run {tuner.max_jobs} tuning jobs")
print(f"Optimizing: {tuner.objective_metric_name}")

In [None]:
# Start tuning - one line vs complex sagemaker-core setup!
tuner.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print("✅ Hyperparameter tuning completed!")

# Get best training job details using HyperparameterTuningJobAnalytics
# Note: best_training_job() returns a string (job name), not a dictionary
from sagemaker.analytics import HyperparameterTuningJobAnalytics

tuner_analytics = HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.name)
full_df = tuner_analytics.dataframe()

# Get best training job (highest validation:auc)
best_job_row = full_df.sort_values('FinalObjectiveValue', ascending=False).iloc[0]

print(f"\nBest job: {best_job_row['TrainingJobName']}")
print(f"Best AUC: {best_job_row['FinalObjectiveValue']:.4f}")

print("\nBest hyperparameters:")
for key in hyperparameter_ranges.keys():
    print(f"  {key}: {best_job_row[key]}")

## Model Deployment

Now we'll deploy the model using different strategies: provisioned endpoints, serverless endpoints, and batch transform.

## Create PyTorchModel with Custom Inference Handler

Before deploying, we create a `PyTorchModel` with our custom `inference.py` handler. This model will be reused for all deployment types (provisioned endpoint, serverless endpoint, and batch transform), ensuring consistent inference behavior.

In [None]:
final_estimator = tuner.best_estimator() if 'tuner' in locals() else my_estimator

# Create PyTorchModel with custom inference handler
# This will be reused for all deployments (endpoints and batch transform)

churn_model = final_estimator.create_model(source_dir=final_estimator.source_dir, entry_point='inference.py')

print("✅ Churn model created with custom inference handler")
print(f"Model data: {churn_model.model_data}")
print(f"Entry point: {churn_model.entry_point}")
print(f"Model name: {churn_model.name}")


In [None]:
from sagemaker import Predictor
from sagemaker.serverless import ServerlessInferenceConfig

# Clean up previous endpoint!
try:
    p = Predictor(endpoint_name=lab_session.serverless_endpoint_name)
    p.delete_endpoint()
    print("Removed previous endpoint (config)")
except Exception as e:
    print("No previous endpoint found (", e, ")")
    pass

# Deploy serverless endpoint using the same PyTorchModel
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=5,
)

serverless_predictor = churn_model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name=lab_session.serverless_endpoint_name,
)

print(f"✅ Serverless model deployed: {serverless_predictor.endpoint_name}")
print(f"Memory: {serverless_config.memory_size_in_mb}MB")
print(f"Max concurrency: {serverless_config.max_concurrency}")

## Batch Transform

The SageMaker SDK also simplifies batch inference.

**Key Points:**

1. **Reusing PyTorchModel**: We use the same `PyTorchModel` created earlier that includes our custom `inference.py` handler. This ensures consistent inference behavior across endpoints and batch transform.

2. **Custom Inference Handler**: The `inference.py` script handles XGBoost models in the PyTorch container with four functions:
   - `model_fn()`: Load the XGBoost model from disk
   - `input_fn()`: Parse CSV input into XGBoost DMatrix (handles structured arrays)
   - `predict_fn()`: Run inference with the model
   - `output_fn()`: Format predictions as CSV output

In [None]:
# Create transformer from the PyTorchModel with custom inference handler
# Uses the same pytorch_model we created earlier with inference.py

transformer = churn_model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=lab_session.transform_output_s3_uri,
)

# Run batch transform
transformer.transform(
    data=s3_test_path,
    content_type='text/csv',
    split_type='Line'
)

print(f"✅ Batch transform completed!")
print(f"Results saved to: {transformer.output_path}")

## Clean Inference

This is where the SageMaker SDK really shines - compare this clean interface to the fiddly `invoke()` + `read()` + `decode()` + `split()` in sagemaker-core!

In [None]:
from io import StringIO
from sagemaker.s3 import S3Downloader
import time
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

test_features = pd.read_csv(StringIO(S3Downloader.read_file(s3_test_path + "test.csv")))
# pd.read_csv(s3_train_path)
# Test both endpoints with clean interface - no more fiddly response parsing!
sample_data = test_features.head(10).values

# Serverless endpoint  
print("☁️  SERVERLESS ENDPOINT:")
start_time = time.time()
serverless_predictions = serverless_predictor.predict(sample_data)  # Also clean!
serverless_latency = (time.time() - start_time) * 1000

print(f"   Predictions shape: {np.array(serverless_predictions).shape}")
print(f"   Latency: {serverless_latency:.1f}ms")
print(f"   Sample predictions: {serverless_predictions[:4]}")

print()

# Compare results
# predictions_match = np.allclose(provisioned_predictions, serverless_predictions, rtol=1e-5)
# print(f"✅ Predictions match: {predictions_match}")
# print(f"📊 Latency difference: {abs(serverless_latency - provisioned_latency):.1f}ms")

## Cleanup

The SageMaker SDK also makes cleanup simpler with built-in methods.

In [None]:
# Clean up resources - comprehensive cleanup including configurations and models

print("🧹 Cleaning up resources...")

# Import boto3 for comprehensive cleanup

try:
    serverless_predictor.delete_endpoint()
    print("✅ Serverless endpoint deleted")
except Exception as e:
    print(f"⚠️  Could not delete serverless endpoint: {e}")

try:
    churn_model.delete_model()
    print("Model deleted")
except Exception as e:
    print("Could not delete churn model: {e}")

sagemaker_client = sagemaker_session.sagemaker_client

# List any remaining resources for verification
print("\n📋 Checking for remaining resources...")
try:
    # Check for any endpoints with our prefix
    remaining_endpoints = sagemaker_client.list_endpoints(
        NameContains='customer-churn-pytorch',
        MaxResults=10
    )
    if remaining_endpoints['Endpoints']:
        print(f"⚠️  Found {len(remaining_endpoints['Endpoints'])} remaining endpoints")
        for ep in remaining_endpoints['Endpoints']:
            print(f"   - {ep['EndpointName']}")
    else:
        print("✅ No remaining endpoints found")
        
    # Check for endpoint configs
    remaining_configs = sagemaker_client.list_endpoint_configs(
        NameContains='customer-churn-pytorch',
        MaxResults=10
    )
    if remaining_configs['EndpointConfigs']:
        print(f"⚠️  Found {len(remaining_configs['EndpointConfigs'])} remaining endpoint configs")
        for config in remaining_configs['EndpointConfigs']:
            print(f"   - {config['EndpointConfigName']}")
    else:
        print("✅ No remaining endpoint configs found")
        
except Exception as e:
    print(f"Could not list remaining resources: {e}")

print("\n✨ Cleanup completed!")
print(f"   Storage location: {lab_session.base_s3_uri}")
print("\n📝 Remember to delete S3 data when you're completely done!")

## Summary: SageMaker SDK vs sagemaker-core

This notebook demonstrates the dramatic improvements in developer experience when using the SageMaker SDK:

### Code Reduction:
- **Training**: 50+ lines → 10 lines (80% reduction)
- **Hyperparameter Tuning**: 40+ lines → 15 lines (70% reduction)  
- **Deployment**: 30+ lines → 5 lines (85% reduction)
- **Inference**: Fiddly response parsing → Clean `.predict()` calls

### Developer Experience:
- ✅ **Intuitive**: ML-focused abstractions (Estimators, Predictors)
- ✅ **Less error-prone**: Automatic configuration and validation
- ✅ **Cleaner inference**: No manual response parsing
- ✅ **Better debugging**: Framework-specific error handling
- ✅ **Local mode**: Test everything locally before deployment

### When to use each:
- **SageMaker SDK**: ML development, experimentation, production ML workflows
- **sagemaker-core**: Infrastructure management, custom tooling, precise AWS API control

### Best of both worlds:
Our `CoreLabSession` provides session management while SageMaker SDK handles ML operations - giving you both control and convenience!