# SageMaker SDK 2025 Approach

This notebook demonstrates the same customer churn prediction workflow as `core.ipynb` but using the modern SageMaker Python SDK for a much cleaner developer experience.

**Key Benefits:**
- 80% less code for the same functionality
- High-level ML abstractions (Estimators, Predictors)
- Automatic handling of AWS resource configuration
- Clean inference with automatic serialization
- Better error handling and debugging

In [None]:
%load_ext autoreload
%autoreload 2

## Setup and Configuration

We'll use our existing `CoreLabSession` for session management but switch to SageMaker SDK for ML operations.

In [None]:
from corelab.core.session import CoreLabSession

# Use our custom session for authentication and S3 management
lab_session = CoreLabSession('xgboost', 'customer-churn', 
                            default_folder='sagemaker_sdk_notebook', 
                            create_run_folder=True)
lab_session.print()

# Get SageMaker session for SDK integration
sagemaker_session = lab_session.get_sagemaker_session()

## Data Preparation

Same data preparation as the core notebook - this part doesn't change.

In [None]:
from io import StringIO
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = lab_session.core_session.read_s3_file(
    f"sagemaker-example-files-prod-{lab_session.region}", 
    "datasets/tabular/synthetic/churn.txt"
)
df = pd.read_csv(StringIO(data))

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['Churn?'].value_counts(normalize=True)['True.']:.1%}")

In [None]:
# Data preprocessing
df = df.drop("Phone", axis=1)  # Remove unique identifier
df["Area Code"] = df["Area Code"].astype(object)  # Treat as categorical

# Remove highly correlated features (charges are derived from minutes)
df = df.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)

# One-hot encode categorical features
model_data = pd.get_dummies(df)

# Move target to first column (XGBoost convention)
model_data = pd.concat([
    model_data["Churn?_True."],
    model_data.drop(["Churn?_False.", "Churn?_True."], axis=1),
], axis=1)

print(f"Processed data shape: {model_data.shape}")
print(f"Features: {model_data.shape[1] - 1}")

In [None]:
# Split into train/validation/test
train_data, temp_data = train_test_split(model_data, test_size=0.33, random_state=42)
validation_data, test_data = train_test_split(temp_data, test_size=0.33, random_state=42)

print(f"Train: {train_data.shape[0]} samples")
print(f"Validation: {validation_data.shape[0]} samples") 
print(f"Test: {test_data.shape[0]} samples")

# Save and upload datasets
train_data.to_csv("train.csv", header=False, index=False)
validation_data.to_csv("validation.csv", header=False, index=False)

# Store test target separately for evaluation
test_target = test_data.iloc[:, 0]  # First column is target
test_features = test_data.iloc[:, 1:]  # Rest are features
test_features.to_csv("test.csv", header=False, index=False)

# Upload to S3
s3_train_path = lab_session.core_session.upload_data("train.csv")
s3_validation_path = lab_session.core_session.upload_data("validation.csv")
s3_test_path = lab_session.core_session.upload_data("test.csv")

print(f"\nData uploaded to S3:")
print(f"Train: {s3_train_path}")
print(f"Validation: {s3_validation_path}")
print(f"Test: {s3_test_path}")

## Training with XGBoost Estimator

This is where the SageMaker SDK shines - what took 50+ lines in sagemaker-core becomes just a few lines with the Estimator pattern.

In [1]:
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import Compute, InputData, Channel, S3DataSource, DataSource

# S3 data (CSV or LIBSVM). XGBoost supports both.
train_s3 = "s3://<your-bucket>/data/train.csv"
val_s3   = "s3://<your-bucket>/data/validation.csv"

# (Option A) Use simple InputData (string S3 URI is allowed)
train_input = InputData(channel_name="train", data_source=s3_train_path)
val_input   = InputData(channel_name="validation", data_source=s3_validation_path)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/machiel/Library/Application Support/sagemaker/config.yaml


In [None]:
from sagemaker.xgboost import XGBoost

# Create XGBoost estimator - much cleaner than sagemaker-core!
xgb_estimator = XGBoost(
    framework_version='1.7-1',
    role=lab_session.role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    volume_size=30,
    output_path=lab_session.base_s3_uri,
    sagemaker_session=sagemaker_session,
    base_job_name='xgboost-churn',
    
    # XGBoost hyperparameters
    hyperparameters={
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'max_depth': 5,
        'eta': 0.2,
        'gamma': 4,
        'min_child_weight': 6,
        'subsample': 0.8,
        'num_round': 100,
        'verbosity': 0
    }
)

print("✅ XGBoost Estimator configured")
print(f"Training will use: {xgb_estimator.instance_type}")
print(f"Output location: {xgb_estimator.output_path}")

In [None]:
# Train the model - just one line!
# Compare this to the complex TrainingJob.create() in sagemaker-core

xgb_estimator.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print(f"✅ Training completed!")
print(f"Model artifacts: {xgb_estimator.model_data}")

## Hyperparameter Tuning

The SageMaker SDK makes hyperparameter tuning much more intuitive with the `HyperparameterTuner` class.

In [None]:
from sagemaker.tuner import (
    HyperparameterTuner,
    IntegerParameter,
    ContinuousParameter
)

# Define hyperparameter ranges - much cleaner than sagemaker-core!
hyperparameter_ranges = {
    'max_depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.3),
    'gamma': ContinuousParameter(0, 5),
    'min_child_weight': ContinuousParameter(1, 10),
    'subsample': ContinuousParameter(0.5, 1.0),
    'num_round': IntegerParameter(50, 200)
}

# Create tuner
tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name='validation:auc',
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=3,
    base_tuning_job_name='xgboost-tuning'
)

print("✅ Hyperparameter tuner configured")
print(f"Will run {tuner.max_jobs} tuning jobs")
print(f"Optimizing: {tuner.objective_metric_name}")

In [None]:
# Start tuning - one line vs complex sagemaker-core setup!
tuner.fit({
    'train': s3_train_path,
    'validation': s3_validation_path
})

print("✅ Hyperparameter tuning completed!")

# Get best training job details
best_job = tuner.best_training_job()
print(f"Best job: {best_job['TrainingJobName']}")
print(f"Best AUC: {best_job['FinalHyperParameterTuningJobObjectiveMetric']['Value']:.4f}")

# Print best hyperparameters
print("\nBest hyperparameters:")
for key, value in best_job['TunedHyperParameters'].items():
    print(f"  {key}: {value}")

## Model Deployment

The SageMaker SDK makes deployment incredibly simple - just call `deploy()` on the tuner or estimator!

In [None]:
# Deploy the best model from tuning - this is so much cleaner!
# Compare to the manual Model.create() + EndpointConfig.create() + Endpoint.create() in sagemaker-core

predictor = tuner.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    endpoint_name=lab_session.endpoint_name
)

print(f"✅ Model deployed to endpoint: {predictor.endpoint_name}")
print(f"Endpoint URL: {predictor.endpoint}")

## Serverless Deployment

The SageMaker SDK also makes serverless deployment much simpler.

In [None]:
from sagemaker.serverless import ServerlessInferenceConfig

# Deploy serverless endpoint - much cleaner than sagemaker-core!
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10,
    provisioned_concurrency=1
)

serverless_predictor = tuner.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name=lab_session.serverless_endpoint_name
)

print(f"✅ Serverless model deployed: {serverless_predictor.endpoint_name}")
print(f"Memory: {serverless_config.memory_size_in_mb}MB")
print(f"Max concurrency: {serverless_config.max_concurrency}")

## Batch Transform

The SageMaker SDK also simplifies batch inference.

In [None]:
# Create transformer - much simpler than sagemaker-core!
transformer = tuner.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=lab_session.transform_output_s3_uri,
    base_transform_job_name='xgboost-batch-transform'
)

# Run batch transform
transformer.transform(
    data=s3_test_path,
    content_type='text/csv',
    split_type='Line'
)

print(f"✅ Batch transform completed!")
print(f"Results saved to: {transformer.output_path}")

## Clean Inference

This is where the SageMaker SDK really shines - compare this clean interface to the fiddly `invoke()` + `read()` + `decode()` + `split()` in sagemaker-core!

In [None]:
import time
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Test both endpoints with clean interface - no more fiddly response parsing!
sample_data = test_features.head(10).values

print("=== ENDPOINT COMPARISON ===")
print(f"Testing with {len(sample_data)} samples\n")

# Provisioned endpoint
print("🖥️  PROVISIONED ENDPOINT:")
start_time = time.time()
provisioned_predictions = predictor.predict(sample_data)  # Clean and simple!
provisioned_latency = (time.time() - start_time) * 1000

print(f"   Predictions shape: {np.array(provisioned_predictions).shape}")
print(f"   Latency: {provisioned_latency:.1f}ms")
print(f"   Sample predictions: {provisioned_predictions[:3]}")

print()

# Serverless endpoint  
print("☁️  SERVERLESS ENDPOINT:")
start_time = time.time()
serverless_predictions = serverless_predictor.predict(sample_data)  # Also clean!
serverless_latency = (time.time() - start_time) * 1000

print(f"   Predictions shape: {np.array(serverless_predictions).shape}")
print(f"   Latency: {serverless_latency:.1f}ms")
print(f"   Sample predictions: {serverless_predictions[:3]}")

print()

# Compare results
predictions_match = np.allclose(provisioned_predictions, serverless_predictions, rtol=1e-5)
print(f"✅ Predictions match: {predictions_match}")
print(f"📊 Latency difference: {abs(serverless_latency - provisioned_latency):.1f}ms")

In [None]:
# Evaluate on full test set
print("=== MODEL PERFORMANCE ===")

# Get predictions for full test set
test_predictions = predictor.predict(test_features.values)
test_probabilities = np.array(test_predictions)
test_binary = (test_probabilities >= 0.5).astype(int)

# Calculate metrics
accuracy = accuracy_score(test_target, test_binary)
precision = precision_score(test_target, test_binary)
recall = recall_score(test_target, test_binary)
auc = roc_auc_score(test_target, test_probabilities)

print(f"Test Set Performance:")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  ROC AUC:   {auc:.4f}")

print(f"\n📊 Tested on {len(test_target)} samples")
print(f"🎯 Churn rate in test set: {test_target.mean():.1%}")

## Cleanup

The SageMaker SDK also makes cleanup simpler with built-in methods.

In [None]:
# Clean up resources - much simpler than sagemaker-core!

print("🧹 Cleaning up resources...")

# Delete endpoints (predictors handle the cleanup automatically)
print("\n🗑️  Deleting endpoints...")
try:
    predictor.delete_endpoint()
    print("✅ Provisioned endpoint deleted")
except Exception as e:
    print(f"❌ Error deleting provisioned endpoint: {e}")

try:
    serverless_predictor.delete_endpoint()
    print("✅ Serverless endpoint deleted")
except Exception as e:
    print(f"❌ Error deleting serverless endpoint: {e}")

# Note: The SageMaker SDK automatically cleans up endpoint configs
# and models when deleting endpoints (unless they're shared)

print("\n✨ Cleanup completed!")
print("\n💰 Cost Summary:")
print(f"   Training time: ~2-3 minutes")
print(f"   Tuning time: ~10-15 minutes")
print(f"   Inference time: ~5 minutes")
print(f"   Storage location: {lab_session.base_s3_uri}")
print("\n📝 Remember to delete S3 data when you're completely done!")

## Summary: SageMaker SDK vs sagemaker-core

This notebook demonstrates the dramatic improvements in developer experience when using the SageMaker SDK:

### Code Reduction:
- **Training**: 50+ lines → 10 lines (80% reduction)
- **Hyperparameter Tuning**: 40+ lines → 15 lines (70% reduction)  
- **Deployment**: 30+ lines → 5 lines (85% reduction)
- **Inference**: Fiddly response parsing → Clean `.predict()` calls

### Developer Experience:
- ✅ **Intuitive**: ML-focused abstractions (Estimators, Predictors)
- ✅ **Less error-prone**: Automatic configuration and validation
- ✅ **Cleaner inference**: No manual response parsing
- ✅ **Better debugging**: Framework-specific error handling
- ✅ **Local mode**: Test everything locally before deployment

### When to use each:
- **SageMaker SDK**: ML development, experimentation, production ML workflows
- **sagemaker-core**: Infrastructure management, custom tooling, precise AWS API control

### Best of both worlds:
Our `CoreLabSession` provides session management while SageMaker SDK handles ML operations - giving you both control and convenience!