# SageMaker SDK Training & Hyperparameter Tuning

**Lab 3 - Assignments 2 & 3 Answer Notebook**

This notebook demonstrates model training and hyperparameter tuning using the SageMaker Python SDK, answering:
- **Assignment 2**: Training with Framework Estimators
- **Assignment 3**: Hyperparameter Tuning with the SDK

**Key Benefits of SDK Approach:**
- 80% less code compared to sagemaker-core
- High-level ML abstractions (Estimators, Tuners, Predictors)
- Automatic handling of AWS resource configuration
- Clean inference with automatic serialization
- Integrated hyperparameter tuning workflow

In [None]:
%load_ext autoreload
%autoreload 2

## Setup and Configuration

We'll use our existing `CoreLabSession` for session management but switch to SageMaker SDK for ML operations.

In [None]:
from corelab.core.session import CoreLabSession

# Use our custom session for authentication and S3 management
lab_session = CoreLabSession('pytorch', 'customer-churn',
                            default_folder='sagemaker_sdk_notebook', 
                            create_run_folder=True,
                             aws_profile='sagemaker-role')
lab_session.print()

# Get SageMaker session for SDK integration
sagemaker_session = lab_session.get_sagemaker_session()

## Data Preparation

Same data preparation as the core notebook - this part doesn't change.

In [None]:
from io import StringIO
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = lab_session.core_session.read_s3_file(
    f"sagemaker-example-files-prod-{lab_session.region}", 
    "datasets/tabular/synthetic/churn.txt"
)
df = pd.read_csv(StringIO(data))

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['Churn?'].value_counts(normalize=True)['True.']:.1%}")

In [None]:
# Data preprocessing
df = df.drop("Phone", axis=1)  # Remove unique identifier
df["Area Code"] = df["Area Code"].astype(object)  # Treat as categorical

# Remove highly correlated features (charges are derived from minutes)
df = df.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)

# One-hot encode categorical features
model_data = pd.get_dummies(df)

# Move target to first column (XGBoost convention)
model_data = pd.concat([
    model_data["Churn?_True."],
    model_data.drop(["Churn?_False.", "Churn?_True."], axis=1),
], axis=1)

print(f"Processed data shape: {model_data.shape}")
print(f"Features: {model_data.shape[1] - 1}")

In [None]:
# Split into train/validation/test
train_data, temp_data = train_test_split(model_data, test_size=0.33, random_state=42)
validation_data, test_data = train_test_split(temp_data, test_size=0.33, random_state=42)

print(f"Train: {train_data.shape[0]} samples")
print(f"Validation: {validation_data.shape[0]} samples") 
print(f"Test: {test_data.shape[0]} samples")

# Save and upload datasets
train_data.to_csv("train.csv", header=False, index=False)
validation_data.to_csv("validation.csv", header=False, index=False)

# Store test target separately for evaluation
test_target = test_data.iloc[:, 0]  # First column is target
test_features = test_data.iloc[:, 1:]  # Rest are features
test_features.to_csv("test.csv", header=False, index=False)

# Upload to S3
s3_train_path = lab_session.core_session.upload_data("train.csv")
s3_validation_path = lab_session.core_session.upload_data("validation.csv")
s3_test_path = lab_session.core_session.upload_data("test.csv")

print(f"\nData uploaded to S3:")
print(f"Train: {s3_train_path}")
print(f"Validation: {s3_validation_path}")
print(f"Test: {s3_test_path}")

## 🎓 Assignment 2: Training with Framework Estimators

This section demonstrates training with the **PyTorch Framework Estimator** with a custom XGBoost training script - the modern, flexible approach (Lab 3 Option A - Recommended).

**What You'll Learn:**
- Using PyTorch Framework Estimator for custom training logic
- Running XGBoost within PyTorch container (modern Python ecosystem)
- Creating custom training scripts with SageMaker conventions
- Passing hyperparameters as command-line arguments
- Automatic dependency installation via requirements.txt

**Key Approach:**
- **Framework Estimator**: PyTorch (provides modern container environment)
- **ML Library**: XGBoost (installed via requirements.txt)
- **Training Script**: Custom `train.py` with full control over training logic

**SDK vs. sagemaker-core:**
What took 50+ lines of TrainingJob configuration becomes just a few lines with the Framework Estimator pattern, while maintaining full flexibility through custom training scripts.

In [None]:
from sagemaker.pytorch import PyTorch

# Create PyTorch Framework Estimator with custom XGBoost training script
# This uses PyTorch container for modern Python ecosystem while training with XGBoost
xgb_estimator = PyTorch(
    entry_point='train.py',           # Custom training script
    source_dir='src/',                # Directory with train.py and requirements.txt
    framework_version='2.6.0',          # PyTorch version (not XGBoost!)
    py_version='py312',
    role=lab_session.role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    output_path=lab_session.base_s3_uri,
    sagemaker_session=sagemaker_session,
    base_job_name='pytorch-xgboost-churn',

    # XGBoost hyperparameters (passed to train.py as CLI arguments)
    # Note: Use hyphens not underscores for CLI arg compatibility
    hyperparameters={
        'max-depth': 5,
        'eta': 0.2,
        'gamma': 4,
        'min-child-weight': 6,
        'subsample': 0.8,
        'objective': 'binary:logistic',
        'num-round': 100,
        'eval-metric': 'auc'
    }
)

print("✅ PyTorch Framework Estimator configured")
print(f"Training will use: {xgb_estimator.instance_type}")
print(f"Entry point: {xgb_estimator.entry_point}")
print(f"Output location: {xgb_estimator.output_path}")

In [None]:
# Train the model - just one line!
# Compare this to the complex TrainingJob.create() in sagemaker-core

xgb_estimator.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print(f"✅ Training completed!")
print(f"Model artifacts: {xgb_estimator.model_data}")

## ⚡ Assignment 3: Hyperparameter Tuning with the SDK

This section demonstrates automated hyperparameter optimization using the **SageMaker SDK's HyperparameterTuner**.

**What You'll Learn:**
- Defining hyperparameter search spaces with typed parameters
- Configuring Bayesian optimization strategy
- Running parallel tuning jobs with resource management
- Analyzing tuning results and selecting best models

**SDK vs. sagemaker-core:**
The HyperparameterTuner class makes tuning much more intuitive compared to the complex HyperParameterTuningJobConfig shapes from Lab 1.

In [None]:
from sagemaker.tuner import (
    HyperparameterTuner,
    IntegerParameter,
    ContinuousParameter
)

# Define hyperparameter ranges - much cleaner than sagemaker-core!
# Note: Use hyphens to match CLI argument format in train.py
hyperparameter_ranges = {
    'max-depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.3),
    'gamma': ContinuousParameter(0, 5),
    'min-child-weight': ContinuousParameter(1, 10),
    'subsample': ContinuousParameter(0.5, 1.0),
    'num-round': IntegerParameter(50, 200)
}

# Create tuner with metric definitions
# IMPORTANT: metric_definitions is required for framework estimators (not built-in algorithms)
# since SageMaker doesn't know how to parse metrics from custom training scripts
tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name='validation:auc',
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=[
        {'Name': 'validation:auc', 'Regex': 'validation-auc:([0-9\\.]+)'},
        {'Name': 'train:auc', 'Regex': 'train-auc:([0-9\\.]+)'}
    ],
    max_jobs=3,
    max_parallel_jobs=3,
    base_tuning_job_name='pytorch-xgboost-tuning'
)

print("✅ Hyperparameter tuner configured")
print(f"Will run {tuner.max_jobs} tuning jobs")
print(f"Optimizing: {tuner.objective_metric_name}")

In [None]:
# Start tuning - one line vs complex sagemaker-core setup!
tuner.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print("✅ Hyperparameter tuning completed!")

# Get best training job details using HyperparameterTuningJobAnalytics
# Note: best_training_job() returns a string (job name), not a dictionary
from sagemaker.analytics import HyperparameterTuningJobAnalytics

tuner_analytics = HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.name)
full_df = tuner_analytics.dataframe()

# Get best training job (highest validation:auc)
best_job_row = full_df.sort_values('FinalObjectiveValue', ascending=False).iloc[0]

print(f"\nBest job: {best_job_row['TrainingJobName']}")
print(f"Best AUC: {best_job_row['FinalObjectiveValue']:.4f}")

print("\nBest hyperparameters:")
for key in hyperparameter_ranges.keys():
    print(f"  {key}: {best_job_row[key]}")

## Model Deployment

Now we'll deploy the model using different strategies: provisioned endpoints, serverless endpoints, and batch transform.

## Create PyTorchModel with Custom Inference Handler

Before deploying, we create a `PyTorchModel` with our custom `inference.py` handler. This model will be reused for all deployment types (provisioned endpoint, serverless endpoint, and batch transform), ensuring consistent inference behavior.

In [None]:
final_estimator = tuner.best_estimator() if 'tuner' in locals() else xgb_estimator

# Create PyTorchModel with custom inference handler
# This will be reused for all deployments (endpoints and batch transform)

pytorch_model = xgb_estimator.create_model(source_dir=xgb_estimator.source_dir, entry_point='inference.py')

print("✅ PyTorchModel created with custom inference handler")
print(f"Model data: {pytorch_model.model_data}")
print(f"Entry point: {pytorch_model.entry_point}")
print(f"Model name: {pytorch_model.name}")


In [None]:
from sagemaker.serverless import ServerlessInferenceConfig

# Deploy serverless endpoint using the same PyTorchModel
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10,
)

serverless_predictor = pytorch_model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name=lab_session.serverless_endpoint_name
)

print(f"✅ Serverless model deployed: {serverless_predictor.endpoint_name}")
print(f"Memory: {serverless_config.memory_size_in_mb}MB")
print(f"Max concurrency: {serverless_config.max_concurrency}")

## Batch Transform

The SageMaker SDK also simplifies batch inference.

**Key Points:**

1. **Reusing PyTorchModel**: We use the same `PyTorchModel` created earlier that includes our custom `inference.py` handler. This ensures consistent inference behavior across endpoints and batch transform.

2. **Custom Inference Handler**: The `inference.py` script handles XGBoost models in the PyTorch container with four functions:
   - `model_fn()`: Load the XGBoost model from disk
   - `input_fn()`: Parse CSV input into XGBoost DMatrix (handles structured arrays)
   - `predict_fn()`: Run inference with the model
   - `output_fn()`: Format predictions as CSV output

In [None]:
# Create transformer from the PyTorchModel with custom inference handler
# Uses the same pytorch_model we created earlier with inference.py

transformer = pytorch_model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=lab_session.transform_output_s3_uri,
)

# Run batch transform
transformer.transform(
    data=s3_test_path,
    content_type='text/csv',
    split_type='Line'
)

print(f"✅ Batch transform completed!")
print(f"Results saved to: {transformer.output_path}")

## Clean Inference

This is where the SageMaker SDK really shines - compare this clean interface to the fiddly `invoke()` + `read()` + `decode()` + `split()` in sagemaker-core!

In [None]:
import time
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Test both endpoints with clean interface - no more fiddly response parsing!
sample_data = test_features.head(10).values

# Serverless endpoint  
print("☁️  SERVERLESS ENDPOINT:")
start_time = time.time()
serverless_predictions = serverless_predictor.predict(sample_data)  # Also clean!
serverless_latency = (time.time() - start_time) * 1000

print(f"   Predictions shape: {np.array(serverless_predictions).shape}")
print(f"   Latency: {serverless_latency:.1f}ms")
print(f"   Sample predictions: {serverless_predictions[:4]}")

print()

# Compare results
# predictions_match = np.allclose(provisioned_predictions, serverless_predictions, rtol=1e-5)
# print(f"✅ Predictions match: {predictions_match}")
# print(f"📊 Latency difference: {abs(serverless_latency - provisioned_latency):.1f}ms")

In [None]:
# Evaluate on full test set
print("=== MODEL PERFORMANCE ===")

# Get predictions for full test set
test_predictions = serverless_predictor.predict(test_features.values)
test_probabilities = np.array(test_predictions)
test_binary = (test_probabilities >= 0.5).astype(int)

# Calculate metrics
accuracy = accuracy_score(test_target, test_binary)
precision = precision_score(test_target, test_binary)
recall = recall_score(test_target, test_binary)
auc = roc_auc_score(test_target, test_probabilities)

print(f"Test Set Performance:")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  ROC AUC:   {auc:.4f}")

print(f"\n📊 Tested on {len(test_target)} samples")
print(f"🎯 Churn rate in test set: {test_target.mean():.1%}")

## Cleanup

The SageMaker SDK also makes cleanup simpler with built-in methods.

In [None]:
# Clean up resources - comprehensive cleanup including configurations and models

print("🧹 Cleaning up resources...")

# Import boto3 for comprehensive cleanup
import boto3
sagemaker_client = boto3.client('sagemaker', region_name=lab_session.region)

try:
    serverless_predictor.delete_endpoint(delete_endpoint_config=False)  # Don't auto-delete config  
    print("✅ Serverless endpoint deleted")
except Exception as e:
    print(f"⚠️  Could not delete serverless endpoint: {e}")


try:
    serverless_config_name = serverless_predictor.endpoint_name
    sagemaker_client.describe_endpoint_config(EndpointConfigName=serverless_config_name)
    sagemaker_client.delete_endpoint_config(EndpointConfigName=serverless_config_name)
    print(f"✅ Deleted endpoint config: {serverless_config_name}")
except sagemaker_client.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'ValidationException':
        print(f"ℹ️  Endpoint config {serverless_config_name} not found or already deleted")
    else:
        print(f"⚠️  Could not delete serverless endpoint config: {e}")

# Delete models
print("\n🗑️  Deleting models...")
try:
    # For the serverless endpoint
    try:
        response = sagemaker_client.describe_endpoint(EndpointName=serverless_predictor.endpoint_name)
        config_name = response['EndpointConfigName']
        config_response = sagemaker_client.describe_endpoint_config(EndpointConfigName=config_name)
        model_name = config_response['ProductionVariants'][0]['ModelName']
        sagemaker_client.delete_model(ModelName=model_name)
        print(f"✅ Deleted model: {model_name}")
    except:
        pass  # Model might already be deleted
        
except Exception as e:
    print(f"ℹ️  Some models may not have been deleted (they might be shared or already deleted)")

# List any remaining resources for verification
print("\n📋 Checking for remaining resources...")
try:
    # Check for any endpoints with our prefix
    remaining_endpoints = sagemaker_client.list_endpoints(
        NameContains='customer-churn-pytorch',
        MaxResults=10
    )
    if remaining_endpoints['Endpoints']:
        print(f"⚠️  Found {len(remaining_endpoints['Endpoints'])} remaining endpoints")
        for ep in remaining_endpoints['Endpoints']:
            print(f"   - {ep['EndpointName']}")
    else:
        print("✅ No remaining endpoints found")
        
    # Check for endpoint configs
    remaining_configs = sagemaker_client.list_endpoint_configs(
        NameContains='customer-churn-pytorch',
        MaxResults=10
    )
    if remaining_configs['EndpointConfigs']:
        print(f"⚠️  Found {len(remaining_configs['EndpointConfigs'])} remaining endpoint configs")
        for config in remaining_configs['EndpointConfigs']:
            print(f"   - {config['EndpointConfigName']}")
    else:
        print("✅ No remaining endpoint configs found")
        
except Exception as e:
    print(f"Could not list remaining resources: {e}")

print("\n✨ Cleanup completed!")
print(f"   Storage location: {lab_session.base_s3_uri}")
print("\n📝 Remember to delete S3 data when you're completely done!")

## Summary: SageMaker SDK vs sagemaker-core

This notebook demonstrates the dramatic improvements in developer experience when using the SageMaker SDK:

### Code Reduction:
- **Training**: 50+ lines → 10 lines (80% reduction)
- **Hyperparameter Tuning**: 40+ lines → 15 lines (70% reduction)  
- **Deployment**: 30+ lines → 5 lines (85% reduction)
- **Inference**: Fiddly response parsing → Clean `.predict()` calls

### Developer Experience:
- ✅ **Intuitive**: ML-focused abstractions (Estimators, Predictors)
- ✅ **Less error-prone**: Automatic configuration and validation
- ✅ **Cleaner inference**: No manual response parsing
- ✅ **Better debugging**: Framework-specific error handling
- ✅ **Local mode**: Test everything locally before deployment

### When to use each:
- **SageMaker SDK**: ML development, experimentation, production ML workflows
- **sagemaker-core**: Infrastructure management, custom tooling, precise AWS API control

### Best of both worlds:
Our `CoreLabSession` provides session management while SageMaker SDK handles ML operations - giving you both control and convenience!