# SageMaker Core Pipeline - Data Prep, Training, and Model Creation

This notebook demonstrates how to create a complete ML pipeline using SageMaker Core that includes:
1. Data Processing - Prepare and split the customer churn dataset
2. Model Training - Train an XGBoost model on processed data
3. Model Evaluation - Evaluate the trained model on holdout data
4. Model Creation - Create a deployable SageMaker model from training artifacts

In [17]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Initialize CoreLab Session

In [2]:
from corelab.core.session import CoreLabSession

lab_session = CoreLabSession(
    'xgboost',
    'customer-churn-pipeline',
    default_folder='pipeline_notebook',
    create_run_folder=True,
    aws_profile='sagemaker-role'
)
lab_session.print()
core_session = lab_session.core_session

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/machiel/Library/Application Support/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name machiel-crystalline to get Role path.


falling back to profile: sagemaker-role
AWS region: eu-central-1
Execution role arn:aws:iam::136548476532:role/service-role/AmazonSageMaker-ExecutionRole-20250902T164316
Output bucket uri: s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31
Framework: xgboost
Project name: customer-churn-pipeline


## Import SageMaker Pipeline Components

Note: SageMaker Pipelines SDK (not sagemaker-core) is used for pipeline orchestration.

In [3]:
# Pipeline-specific imports from SageMaker SDK
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CacheConfig
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.parameters import (
    ParameterFloat,
    ParameterString
)
from sagemaker.workflow.properties import PropertyFile

# Processing imports - using XGBoostProcessor for better framework integration
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor

# Training imports  
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

# Model imports
from sagemaker.model import Model

print("All pipeline modules imported successfully")

All pipeline modules imported successfully


In [4]:
# Create PipelineSession for proper pipeline execution context
pipeline_session = LocalPipelineSession(
    boto_session=lab_session.core_session.boto_session,
    default_bucket=lab_session.core_session.default_bucket(),
    default_bucket_prefix=lab_session.core_session.default_bucket_prefix
)

print(f"📦 Default bucket: {pipeline_session.default_bucket()}")
print(f"📁 Bucket prefix: {pipeline_session.default_bucket_prefix}")

📦 Default bucket: sagemaker-eu-central-1-136548476532
📁 Bucket prefix: pipeline_notebook/2025-10-22T13-54-31


## Define input and output locations

In [5]:
# Define data locations
data_s3_uri = f"s3://sagemaker-example-files-prod-{lab_session.region}/datasets/tabular/synthetic/churn.txt"
pipeline_output_s3_uri = lab_session.pipeline_output_s3_uri

print(f"📁 Data S3 URI: {data_s3_uri}")
print(f"📤 Pipeline Output S3 URI: {pipeline_output_s3_uri}")

📁 Data S3 URI: s3://sagemaker-example-files-prod-eu-central-1/datasets/tabular/synthetic/churn.txt
📤 Pipeline Output S3 URI: s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/pipeline_output


## Define Pipeline Parameters

Pipeline parameters allow us to customize pipeline executions without modifying the code.

In [6]:
# Define pipeline parameters for flexibility

# Processing parameters
processing_instance_type = ParameterString(
    name="ProcessingInstanceType",
    default_value="ml.m5.large"
)

train_test_split = ParameterFloat(
    name="TrainTestSplit",
    default_value=0.33
)

# Training parameters
training_instance_type = ParameterString(
    name="TrainingInstanceType",
    default_value="ml.m5.large"
)

max_depth = ParameterString(
    name="MaxDepth",
    default_value="5"
)

num_round = ParameterString(
    name="NumRound",
    default_value="100"
)

print("Pipeline parameters defined")

Pipeline parameters defined


In [7]:
# Configure step caching for faster pipeline iterations
# Cache expires after 7 days - adjust based on your needs
cache_config = CacheConfig(
    enable_caching=True,
    expire_after="PT1H"  # ISO 8601 duration format: P7D = 7 days
)

print("✅ Cache configuration created (1-hour TTL)")
print("   Steps will reuse cached results when inputs haven't changed")

✅ Cache configuration created (1-hour TTL)
   Steps will reuse cached results when inputs haven't changed


## Configure Step Caching

Step caching allows SageMaker Pipelines to reuse results from previous executions when inputs haven't changed, significantly speeding up development iterations and reducing costs.

**How it works:**
- Cache key includes: step inputs, code hash, container image, instance type, and hyperparameters
- **Cache hit**: If all match a previous execution within TTL → Skip step, reuse outputs
- **Cache miss**: If anything differs → Re-run step

**Benefits:**
- ⚡ Faster pipeline iterations during development
- 💰 Cost savings by skipping expensive training/processing
- 🐛 Useful for debugging downstream steps without re-running upstream

## Step 1: Define Processing Step

This step processes raw data and splits it into train, validation, and test sets.

In [8]:
from sagemaker.pytorch import PyTorchProcessor
from pathlib import Path
import os
# Create PyTorchProcessor - this gives us the best maintained image even though we don't need pytorch itself
xgb_processor = PyTorchProcessor(
    framework_version='2.6.0',
    py_version='py312',
    instance_type=processing_instance_type,
    instance_count=1,
    role=lab_session.role,
    sagemaker_session=pipeline_session,
    volume_size_in_gb=30,
    max_runtime_in_seconds=3600,
    env={"PYTHONUNBUFFERED": "1"},
    base_job_name='churn-preprocessing'
)

src_dir = Path(os.getcwd(), 'src').resolve()

# Use step_args pattern for proper pipeline integration
processor_args = xgb_processor.run(
    code="preprocessing.py",
    source_dir=str(src_dir) + '/',  # Directory with code and requirements.txt
    inputs=[
        ProcessingInput(
            source=data_s3_uri,
            destination="/opt/ml/processing/input/data"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            destination=f"{pipeline_output_s3_uri}/data/train"
        ),
        ProcessingOutput(
            output_name="validation", 
            source="/opt/ml/processing/output/validation",
            destination=f"{pipeline_output_s3_uri}/data/validation"
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            destination=f"{pipeline_output_s3_uri}/data/test"
        )
    ],
    arguments=["--train-test-split", train_test_split.to_string()]
)

step_process = ProcessingStep(
    name="PreprocessCustomerChurnData",
    step_args=processor_args,
    cache_config=cache_config
)

print("✅ Processing step defined with caching enabled")

INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


✅ Processing step defined with caching enabled




## Step 2: Define Training Step

This step trains an XGBoost model using the processed data from Step 1.

In [9]:
from sagemaker.pytorch import PyTorch

# Create estimator using Pytorch image and our own entry-point
my_estimator = PyTorch(
    framework_version='2.6.0',
    py_version='py312',
    entry_point='train.py',
    source_dir=str(src_dir) + '/',
    instance_type=training_instance_type,
    instance_count=1,
    role=lab_session.role,
    output_path=f"{pipeline_output_s3_uri}/models",
    sagemaker_session=pipeline_session,
    hyperparameters={
        "max_depth": max_depth,
        "eta": "0.2",
        "gamma": "4",
        "min_child_weight": "6",
        "subsample": "0.8",
        "verbosity": "0",
        "objective": "binary:logistic",
        "num_round": num_round
    }
)

# Use step_args pattern for training step
training_args = my_estimator.fit(
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv"
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,
            content_type="text/csv"
        )
    }
)

step_train = TrainingStep(
    name="TrainXGBoostModel",
    step_args=training_args,
    cache_config=cache_config
)

print("✅ Training step defined with caching enabled")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.


✅ Training step defined with caching enabled


## Step 3: Define Evaluation Step

This step evaluates the trained model against the validation dataset created during preprocessing.


In [10]:
# Create ScriptProcessor for evaluation to run custom metrics
evaluation_src_dir = Path(os.getcwd(), 'src').resolve()
script_processor = PyTorchProcessor(
    framework_version='2.6.0',
    py_version='py312',
    instance_type=processing_instance_type,
    instance_count=1,
    role=lab_session.role,
    sagemaker_session=pipeline_session,
    base_job_name='churn-evaluation',
    volume_size_in_gb=30,
    max_runtime_in_seconds=3600,
    env={'PYTHONUNBUFFERED': '1'}
)

evaluation_args = script_processor.run(
    code='evaluate.py',
    source_dir=str(evaluation_src_dir) + '/',
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination='/opt/ml/processing/model'
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs['validation'].S3Output.S3Uri,
            destination='/opt/ml/processing/evaluation'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='evaluation',
            source='/opt/ml/processing/output'
        )
    ]
)

evaluation_report = PropertyFile(
    name='EvaluationReport',
    output_name='evaluation',
    path='evaluation.json'
)

step_evaluate = ProcessingStep(
    name='EvaluateModel',
    step_args=evaluation_args,
    property_files=[evaluation_report],
    cache_config=cache_config
)

print('✅ Evaluation step defined with caching enabled')


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


✅ Evaluation step defined with caching enabled


## Step 4: Define Model Creation Step

This step creates a SageMaker Model from the trained model artifacts.

In [11]:
from sagemaker.pytorch import PyTorchModel

# Create a Model object using pipeline session for consistency
model = PyTorchModel(
    framework_version='2.6.0',
    py_version='py312',
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    role=lab_session.role,
    sagemaker_session=pipeline_session
)

# Use step_args pattern for model creation
model_create_args = model.create(instance_type='ml.m5.large')

step_create_model = ModelStep(
    name="CreateXGBoostModel",
    step_args=model_create_args
)

print("✅ Model creation step defined")

✅ Model creation step defined


In [12]:
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.functions import Join

model_metrics = ModelMetrics(
  model_statistics=MetricsSource(
      s3_uri=Join(on="/", values=[
          step_evaluate.properties.ProcessingOutputConfig.Outputs["evaluation"].S3Output.S3Uri,
          "evaluation.json"
      ]),
      content_type="application/json"
  )
)

## Optional: Model Registry Step

Register the model in SageMaker Model Registry for versioning and deployment management.

In [None]:
register_args = model.register(content_types=["text/csv"], response_types=["text/csv"],
                          inference_instances=["ml.m5.large", "ml.m5.xlarge"], transform_instances=["ml.m5.large"],
                          model_package_group_name="customer-churn-models", approval_status="Approved",
                               model_metrics=model_metrics,
                          description="XGBoost model for customer churn prediction")
step_register_model = ModelStep(
    name="RegisterXGBoostModel",
    step_args=register_args
)

print("✅ Model registration step defined")

## Create the Pipeline

In [13]:
# Create the pipeline with fixed name for versioning
# SageMaker Pipelines now support versioning - use fixed names instead of timestamps
pipeline_name = "customer-churn-pipeline"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        train_test_split,
        training_instance_type,
        max_depth,
        num_round
    ],
    steps=[
        step_process,
        step_train,
        step_evaluate,
        step_create_model,
    ],
    sagemaker_session=pipeline_session
)

print(f"🚀 Pipeline Name: {pipeline_name}")
print(f"📊 Pipeline Steps: {len(pipeline.steps)}")
print("ℹ️  Using fixed name - SageMaker will create versions automatically")


🚀 Pipeline Name: customer-churn-pipeline
📊 Pipeline Steps: 4
ℹ️  Using fixed name - SageMaker will create versions automatically


## Validate Pipeline Definition

In [14]:
import json
import os
print('cwd', os.getcwd())
# Validate the pipeline definition
pipeline_definition = json.loads(pipeline.definition())
print("Pipeline definition validated successfully!")
print(f"\nPipeline has {len(pipeline_definition['Steps'])} steps:")
for step in pipeline_definition['Steps']:
    print(f"  - {step['Name']}: {step['Type']}")

INFO:sagemaker.processing:Uploaded /Users/machiel/Development/crystalline/sagemaker/corelab/answers/lab4-pipeline/src/ to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/a5f4c96a25fe258f6d745647f29a69916a8ab2e3212c2fc025eda04817f59559/sourcedir.tar.gz


cwd /Users/machiel/Development/crystalline/sagemaker/corelab/answers/lab4-pipeline


INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/688eae6b48c95bd4e87bf60c57d4a119c4b00fe04a0b3e00a5c0202befc915c6/runproc.sh
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.processing:Uploaded /Users/machiel/Development/crystalline/sagemaker/corelab/answers/lab4-pipeline/src/ to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/0672d4f0c8734492b27878be09b21837c05e8d9972240bec747260336b870f29/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/1de680f13a58371bcae623a94882c9316259bd6263be2afff8e006c05ffceeed/runpr

Pipeline definition validated successfully!

Pipeline has 4 steps:
  - PreprocessCustomerChurnData: Processing
  - TrainXGBoostModel: Training
  - EvaluateModel: Processing
  - CreateXGBoostModel-CreateModel: Model


## Create/Update Pipeline in AWS

In [15]:
# Create/update the pipeline (creates new version if pipeline exists)
response = pipeline.upsert(role_arn=lab_session.role)
print(f"✅ Pipeline '{pipeline_name}' created/updated successfully")

# Check if this created a new version
try:
    versions = pipeline.list_pipeline_versions()
    version_count = len(versions)
    latest_version = versions[0]['PipelineVersion'] if versions else 1
    print(f"📋 Pipeline now has {version_count} version(s), latest: v{latest_version}")
except:
    print("📋 Version information not available")


INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.


✅ Pipeline 'customer-churn-pipeline' created/updated successfully
📋 Version information not available


In [None]:
# Start pipeline execution
execution = pipeline.start(
    parameters={
        "ProcessingInstanceType": "ml.m5.large",
        "TrainingInstanceType": "ml.m5.large",
        "TrainTestSplit": 0.33,
        "MaxDepth": "5",
        "NumRound": "100"
    }
)

print("🚀 Pipeline execution started")
print(f"📝 Execution ARN: {execution.arn}")
print(f"📊 Status: {execution.describe()['PipelineExecutionStatus']}")

## Local run!

In [16]:
execution = pipeline.start()

INFO:sagemaker.processing:Uploaded /Users/machiel/Development/crystalline/sagemaker/corelab/answers/lab4-pipeline/src/ to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/a5f4c96a25fe258f6d745647f29a69916a8ab2e3212c2fc025eda04817f59559/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/688eae6b48c95bd4e87bf60c57d4a119c4b00fe04a0b3e00a5c0202befc915c6/runproc.sh
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.processing:Uploaded /Users/machiel/Development/crystalline/sagemaker/corelab/answers/lab4-pipeline/src/ to s3://sagemaker-eu-central-1-136548476532/pipeline_notebook/2025-10-22T13-54-31/customer-churn-pipeline/code/0672d4f0c8734492b27878be09b21837c05e8d9972240bec747260336b870f29/sourcedir.tar.gz
INFO:sagemaker.processing:run

 Container lgp7es6hyh-algo-1-c4gc6  Creating
 Container lgp7es6hyh-algo-1-c4gc6  Created
Attaching to lgp7es6hyh-algo-1-c4gc6
lgp7es6hyh-algo-1-c4gc6  | CodeArtifact repository not specified. Skipping login.
lgp7es6hyh-algo-1-c4gc6  | Found existing installation: typing 3.7.4.3
lgp7es6hyh-algo-1-c4gc6  | Uninstalling typing-3.7.4.3:
lgp7es6hyh-algo-1-c4gc6  |   Successfully uninstalled typing-3.7.4.3
lgp7es6hyh-algo-1-c4gc6  | [0mCollecting xgboost==3.0.5 (from -r requirements.txt (line 1))
lgp7es6hyh-algo-1-c4gc6  |   Downloading xgboost-3.0.5-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
lgp7es6hyh-algo-1-c4gc6  | Collecting sagemaker-training>=5.1.1 (from -r requirements.txt (line 3))
lgp7es6hyh-algo-1-c4gc6  |   Downloading sagemaker_training-5.1.1.tar.gz (59 kB)
lgp7es6hyh-algo-1-c4gc6  |   Installing build dependencies ... [?25ldone
lgp7es6hyh-algo-1-c4gc6  | [?25h  Getting requirements to build wheel ... [?25ldone
lgp7es6hyh-algo-1-c4gc6  | [?25h  Preparing

INFO:sagemaker.local.image:===== Job Complete =====
INFO:sagemaker.local.entities:Pipeline step 'PreprocessCustomerChurnData' SUCCEEDED.
INFO:sagemaker.local.entities:Starting pipeline step: 'TrainXGBoostModel'
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.local.image:'Docker Compose' found using Docker CLI.
INFO:sagemaker.local.local_session:Starting training job
INFO:sagemaker.local.image:Using the long-lived AWS credentials found in session
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-le5pg:
    command: train
    co

 Container 012omd7011-algo-1-le5pg  Creating
 Container 012omd7011-algo-1-le5pg  Created
Attaching to 012omd7011-algo-1-le5pg
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,643 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,644 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,645 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,653 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,657 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:43,659 sagemaker_pytorch_container.training INFO     Invoking user training script.
012omd7011-algo-1-le5pg  | 2025-10-22 13:55:

INFO:sagemaker.local.image:===== Job Complete =====
INFO:sagemaker.local.entities:Pipeline step 'TrainXGBoostModel' SUCCEEDED.
INFO:sagemaker.local.entities:Starting pipeline step: 'EvaluateModel'
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.local.image:'Docker Compose' found using Docker CLI.
INFO:sagemaker.local.local_session:Starting processing job
INFO:sagemaker.local.image:Using the long-lived AWS credentials found in session
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-s5pkl:
    container_name: yvwecp2qsh-algo-1

 Container yvwecp2qsh-algo-1-s5pkl  Creating
 Container yvwecp2qsh-algo-1-s5pkl  Created
Attaching to yvwecp2qsh-algo-1-s5pkl
yvwecp2qsh-algo-1-s5pkl  | CodeArtifact repository not specified. Skipping login.
yvwecp2qsh-algo-1-s5pkl  | Found existing installation: typing 3.7.4.3
yvwecp2qsh-algo-1-s5pkl  | Uninstalling typing-3.7.4.3:
yvwecp2qsh-algo-1-s5pkl  |   Successfully uninstalled typing-3.7.4.3
yvwecp2qsh-algo-1-s5pkl  | [0mCollecting xgboost==3.0.5 (from -r requirements.txt (line 1))
yvwecp2qsh-algo-1-s5pkl  |   Downloading xgboost-3.0.5-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
yvwecp2qsh-algo-1-s5pkl  | Collecting sagemaker-training>=5.1.1 (from -r requirements.txt (line 3))
yvwecp2qsh-algo-1-s5pkl  |   Downloading sagemaker_training-5.1.1.tar.gz (59 kB)
yvwecp2qsh-algo-1-s5pkl  |   Installing build dependencies ... [?25ldone
yvwecp2qsh-algo-1-s5pkl  | [?25h  Getting requirements to build wheel ... [?25ldone
yvwecp2qsh-algo-1-s5pkl  | [?25h  Preparing

INFO:sagemaker.local.image:===== Job Complete =====


[Kyvwecp2qsh-algo-1-s5pkl exited with code 0
[31m[1mAborting on container exit...[0m
 Container yvwecp2qsh-algo-1-s5pkl  Stopping
 Container yvwecp2qsh-algo-1-s5pkl  Stopped


INFO:sagemaker.local.entities:Pipeline step 'EvaluateModel' SUCCEEDED.
INFO:sagemaker.local.entities:Starting pipeline step: 'CreateXGBoostModel-CreateModel'
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.local.entities:Pipeline step 'CreateXGBoostModel-CreateModel' SUCCEEDED.
INFO:sagemaker.local.entities:Pipeline execution f1309111-5e55-4c3a-9a91-f3501c4c03b1 SUCCEEDED


## Monitor Pipeline Execution

In [None]:
# Monitor execution status
execution.wait()

## Retrieve Pipeline Outputs

In [None]:
# Get execution steps details
execution_steps = execution.list_steps()

for step in execution_steps:
    print(f"\nStep: {step['StepName']}")
    print(f"  Status: {step['StepStatus']}")

    if step['StepName'] == 'TrainXGBoostModel' and step['StepStatus'] == 'Succeeded':
        # Get training job details
        training_job_arn = step['Metadata']['TrainingJob']['Arn']
        print(f"  Training Job ARN: {training_job_arn}")

    elif step['StepName'] == 'CreateXGBoostModel' and step['StepStatus'] == 'Succeeded':
        # Get model details
        model_arn = step['Metadata']['Model']['Arn']
        print(f"  Model ARN: {model_arn}")

    elif step['StepName'] == 'EvaluateModel' and step['StepStatus'] == 'Succeeded':
        outputs = step['Metadata']['ProcessingJob']['ProcessingOutputConfig']['Outputs']
        eval_uri = next((o['S3Output']['S3Uri'] for o in outputs if o['OutputName'] == 'evaluation'), None)
        if eval_uri:
            print(f"  Evaluation report: {eval_uri}/evaluation.json")


## View Pipeline Execution in SageMaker Studio

You can also view and manage your pipeline execution in SageMaker Studio:
1. Open SageMaker Studio
2. Navigate to the Pipelines section
3. Select your pipeline to view execution details, logs, and metrics

## Pipeline Version Management (Optional)

With SageMaker Pipeline versioning, you can manage different versions of your pipeline:

In [None]:
# List all versions of the pipeline
try:
    versions = pipeline.list_pipeline_versions()
    print(f"📋 Pipeline '{pipeline_name}' versions:")
    for version in versions[:5]:  # Show last 5 versions
        print(f"  - Version {version['PipelineVersion']}: Created {version['CreationTime']}")
        
    if len(versions) > 5:
        print(f"  ... and {len(versions) - 5} more versions")
        
    # Show how to execute a specific version
    print(f"\n💡 To execute a specific version:")
    print(f"   execution = pipeline.start(pipeline_version=1, parameters={{...}})")
    
except Exception as e:
    print(f"Could not retrieve version information: {e}")
    print("This may be normal for newly created pipelines")

## Clean Up Resources (Optional)

In [None]:
# # Delete the pipeline (uncomment to execute)
# try:
#     pipeline.delete()
#     print(f"✅ Pipeline '{pipeline_name}' deleted")
# except Exception as e:
#     print(f"Error deleting pipeline: {e}")

## Alternative Pipeline with Conditional Model Approval

This section demonstrates a second pipeline that conditionally approves models based on evaluation metrics.

**Approval Logic:**
- **Accuracy ≥ 0.80**: Model automatically approved (`approval_status="Approved"`)
- **Accuracy < 0.80**: Model requires manual approval (`approval_status="PendingManualApproval"`)

This approach uses `ConditionStep` to route to different `ModelStep` instances based on the evaluation results.

In [None]:
# Create two separate registration steps with different approval statuses

# Step 1: Auto-approved registration (for high-quality models)
register_approved_args = model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.m5.large", "ml.m5.xlarge"],
    transform_instances=["ml.m5.large"],
    model_package_group_name="customer-churn-models-conditional",
    approval_status="Approved",  # Automatically approved
    model_metrics=model_metrics,
    description="High-quality XGBoost model (accuracy ≥ 0.80) for customer churn prediction"
)

step_register_approved = ModelStep(
    name="RegisterApprovedModel",
    step_args=register_approved_args
)

# Step 2: Manual approval required (for lower-quality models)
register_pending_args = model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.m5.large", "ml.m5.xlarge"],
    transform_instances=["ml.m5.large"],
    model_package_group_name="customer-churn-models-conditional",
    approval_status="PendingManualApproval",  # Requires manual approval
    model_metrics=model_metrics,
    description="XGBoost model (accuracy < 0.80) requiring manual review for customer churn prediction"
)

step_register_pending = ModelStep(
    name="RegisterPendingModel",
    step_args=register_pending_args
)

print("✅ Two registration steps created:")
print("   - RegisterApprovedModel: For accuracy ≥ 0.80")
print("   - RegisterPendingModel: For accuracy < 0.80")

In [None]:
# Import condition components
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

# Define the accuracy condition
# Extract accuracy from evaluation.json: binary_classification_metrics.accuracy.value
accuracy_condition = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step_name=step_evaluate.name,
        property_file=evaluation_report,
        json_path="binary_classification_metrics.accuracy.value"
    ),
    right=0.80  # Threshold for automatic approval
)

# Create conditional step that routes to appropriate registration
step_condition = ConditionStep(
    name="CheckModelAccuracy",
    conditions=[accuracy_condition],
    if_steps=[step_register_approved],   # Execute if accuracy ≥ 0.80
    else_steps=[step_register_pending]   # Execute if accuracy < 0.80
)

print("✅ Conditional step created with accuracy threshold: 0.80")
print("   - If accuracy ≥ 0.80 → Auto-approve")
print("   - If accuracy < 0.80 → Manual approval required")

In [None]:
# Create the conditional pipeline with a different name
pipeline_conditional_name = "customer-churn-pipeline-conditional"

pipeline_conditional = Pipeline(
    name=pipeline_conditional_name,
    parameters=[
        processing_instance_type,
        train_test_split,
        training_instance_type,
        max_depth,
        num_round
    ],
    steps=[
        step_process,          # Reuse: Data preprocessing
        step_train,            # Reuse: Model training
        step_evaluate,         # Reuse: Model evaluation
        step_condition         # NEW: Conditional registration
    ],
    sagemaker_session=pipeline_session
)

print(f"🚀 Conditional Pipeline Name: {pipeline_conditional_name}")
print(f"📊 Pipeline Steps: {len(pipeline_conditional.steps)}")
print("✨ Key difference: Conditional model approval based on accuracy")

In [None]:
# Validate pipeline definition
pipeline_conditional_definition = json.loads(pipeline_conditional.definition())
print("✅ Conditional pipeline definition validated successfully!")
print(f"\nPipeline has {len(pipeline_conditional_definition['Steps'])} steps:")
for step in pipeline_conditional_definition['Steps']:
    step_type = step['Type']
    step_name = step['Name']
    print(f"  - {step_name}: {step_type}")
    
    # Show condition details for the conditional step
    if step_type == 'Condition':
        print(f"    → Condition: Accuracy ≥ 0.80")
        print(f"    → If True: {step['Arguments']['IfSteps'][0]['Name']}")
        print(f"    → If False: {step['Arguments']['ElseSteps'][0]['Name']}")

# Create/update the conditional pipeline
response_conditional = pipeline_conditional.upsert(role_arn=lab_session.role)
print(f"\n✅ Pipeline '{pipeline_conditional_name}' created/updated successfully")

# Start execution (optional - uncomment to run)
execution_conditional = pipeline_conditional.start(
    parameters={
        "ProcessingInstanceType": "ml.m5.large",
        "TrainingInstanceType": "ml.m5.large",
        "TrainTestSplit": 0.33,
        "MaxDepth": "5",
        "NumRound": "100"
    }
)
print("🚀 Conditional pipeline execution started")
print(f"📝 Execution ARN: {execution_conditional.arn}")

In [None]:
execution_conditional.wait()