# Adapting Preprocessing Script for SageMaker

To build a pipeline, we need to start with individual steps. Our first step will be data preprocessing — the same preprocessing you've done locally in previous courses, but now adapted to run in SageMaker's managed environment.

We'll create a separate file called `data_processing.py` that contains our preprocessing logic. The preprocessing logic itself is identical to what you've done before — we're still capping outliers, creating new features, and splitting the data. The key changes are in how we handle file paths to work within SageMaker's processing environment:

These specific paths (`/opt/ml/processing/`) are SageMaker conventions that allow the service to automatically handle data movement between S3 and your processing containers. When SageMaker runs this script, it will automatically mount your S3 data to the input directory and upload the results from the output directories back to S3.



# Understanding SageMaker Sessions

Before building your pipeline, you need to understand that SageMaker Pipelines require two different types of sessions that serve distinct purposes:



In [None]:
import sagemaker
from sagemaker.workflow.pipeline_context import PipelineSession

# Create a SageMaker session for immediate operations
sagemaker_session = sagemaker.Session()

# Create a pipeline session for pipeline components
pipeline_session = PipelineSession()

The distinction between these sessions is about execution context, not timing:

- `sagemaker.Session()` — This is your direct connection to AWS services. When you use this session, you're telling SageMaker "execute this operation using real AWS resources right now." It handles immediate operations like uploading data to S3, creating pipeline definitions, starting executions, and checking status.

- `PipelineSession()` — This is a special "recording" session that creates placeholder operations instead of real ones. When you use this session with processors or estimators, instead of immediately creating SageMaker jobs, it returns step objects that become part of your pipeline definition. These placeholders get converted into real operations only when the pipeline executes.

Without `PipelineSession()`, if you created a processor with a regular session, SageMaker would immediately try to spin up compute instances and start processing your data before you've even finished defining your pipeline! The PipelineSession() lets you define all your pipeline components as a complete workflow first, then execute everything in the proper order when you're ready.

Simple Rule:

- Use `PipelineSession()` for any processor, estimator, or transformer that should become a pipeline step
- Use sagemaker.Session() for immediate actions like managing the pipeline itself

In [None]:
# Get the default SageMaker bucket name
default_bucket = sagemaker_session.default_bucket()

# Local file path
DATA_PATH = "data/california_housing.csv"

# S3 prefix (folder path within the bucket)
DATA_PREFIX = "datasets"

try:
    # Upload the dataset using the upload_data() method
    s3_uri = sagemaker_session.upload_data(
        path=DATA_PATH,
        bucket=default_bucket,
        key_prefix=DATA_PREFIX
    )
    
    print(f"Data uploaded successfully to: {s3_uri}")
    
except Exception as e:
    print(f"Error: {e}")

# Setting Up AWS Resources

Now that we understand sessions, we need to set up the AWS resources and permissions that our pipeline will use:



In [None]:
# Retrieve your AWS account ID (used for constructing resource ARNs)
account_id = sagemaker_session.boto_session.client('sts').get_caller_identity()['Account']

# Get the default S3 bucket for your SageMaker resources
default_bucket = sagemaker_session.default_bucket()

# Define the SageMaker execution role ARN
SAGEMAKER_ROLE = f"arn:aws:iam::{account_id}:role/SageMakerDefaultExecution"

# Creating the Processing Environment

To run our preprocessing script in SageMaker, we need to define the computing environment where our code will execute. We use an `SKLearnProcessor` because our preprocessing script uses scikit-learn libraries, and crucially, we use the `PipelineSession` to ensure the processor becomes a pipeline step rather than executing immediately:



In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor

# Create a processor that will run our data preprocessing script
processor = SKLearnProcessor(
    framework_version="1.2-1",    # Specify scikit-learn version
    role=SAGEMAKER_ROLE,          # IAM role with necessary permissions
    instance_type="ml.m5.large",  # Compute instance type for processing
    instance_count=1,             # Number of instances to use
    sagemaker_session=pipeline_session  # Use pipeline session for deferred execution
)

# Building Our First Pipeline Step

With our processor configured, we can now create the actual processing step using `ProcessingStep`. This step defines what data goes in, what comes out, and what code runs in between.

Before proceeding, note that we assume you have already uploaded your raw dataset (`california_housing.csv`) to your S3 default bucket at the path `/datasets/california_housing.csv`. This is necessary because the pipeline will read the input data directly from S3.

In [None]:
from sagemaker.workflow.steps import ProcessingStep

# Define the processing step with inputs, outputs, and the script to run
processing_step = ProcessingStep(
    name="ProcessData",   # Unique name for this step in the pipeline
    processor=processor,  # The processor we defined above
    inputs=[
        # Define where the raw data comes from (S3 location)
        sagemaker.processing.ProcessingInput(
            source=f"s3://{default_bucket}/datasets/california_housing.csv",
            destination="/opt/ml/processing/input"  # Where data will be mounted in container
        )
    ],
    outputs=[
        # Define where processed training data will be saved
        sagemaker.processing.ProcessingOutput(
            output_name="train_data",               # Reference name for this output
            source="/opt/ml/processing/train"       # Container path where script saves data
        ),
        # Define where processed test data will be saved
        sagemaker.processing.ProcessingOutput(
            output_name="test_data",                # Reference name for this output
            source="/opt/ml/processing/test"        # Container path where script saves data
        )
    ],
    code="data_processing.py"  # The Python script that performs the processing
)

Notice how the paths in our `ProcessingInput` and `ProcessingOutput` definitions perfectly match the paths our script expects. The `inputs` parameter specifies where our raw data comes from (an S3 location) and where it will be mounted inside the processing container (`/opt/ml/processing/input`). The `outputs` parameter defines where our processed data will be saved, with separate outputs for training and test data. Each output has a name that we can reference later and a source path where our processing script will write the data. The `code` parameter points to the Python script that contains our preprocessing logic.

This connection between the step definition and the script paths is crucial — it's what allows SageMaker to automatically handle all the data movement for you.

# Creating the Pipeline

With our processing step defined, we can now create our first pipeline. A pipeline is simply a collection of steps that execute in order, and right now we have just one step. Note that we use the regular `sagemaker_session` for pipeline management:



In [None]:
from sagemaker.workflow.pipeline import Pipeline

# Set a name for the SageMaker Pipeline
PIPELINE_NAME = "california-housing-preprocessing-pipeline"

# Create pipeline with our processing step
pipeline = Pipeline(
    name=PIPELINE_NAME,
    steps=[processing_step],  # For now, just one step
    sagemaker_session=sagemaker_session  # Use regular session for pipeline management
)

This creates a pipeline definition with our single processing step. In future lessons, we'll add more steps to this list to create more complex workflows with training, evaluation, and model registration.

At this point, we've only created the pipeline definition in memory — it doesn't exist in AWS yet. To make it available in SageMaker, we need to register it with the service using the `upsert` method:

In [None]:
# Create or update pipeline (upsert = update if exists, create if not)
pipeline.upsert(role_arn=SAGEMAKER_ROLE)

The `upsert` method is particularly useful because it handles both creation and updates intelligently. If this is the first time you're running this code, it will create a new pipeline in SageMaker. If you run the same code again after making changes to your pipeline definition, it will update the existing pipeline rather than throwing an error. This makes iterative development much smoother — you can modify your pipeline code, run it again, and SageMaker will automatically apply your changes.

Think of the pipeline definition as a blueprint or recipe. Once you've registered this blueprint with SageMaker using `upsert`, you can execute it multiple times. Each execution is a separate run of the same blueprint, potentially with different data or parameters.

# Executing the Pipeline

Now that our pipeline is registered with SageMaker, we can start an execution. This is where the actual work begins:

In [None]:
# Start pipeline execution and get execution object for monitoring
execution = pipeline.start()

# Get the execution ARN for tracking
print(f"Pipeline execution ARN: {execution.arn}")

When you call `pipeline.start()`, SageMaker immediately begins executing your pipeline in the background. This means your local Python script doesn't need to wait for the processing to complete — the heavy computational work is happening on AWS infrastructure while your script continues running or even after it finishes.

The execution object provides valuable information about the running pipeline. You'll see output similar to:

`
Pipeline execution ARN: arn:aws:sagemaker:us-east-1:123456789012:pipeline/california-housing-preprocessing-pipeline/execution/x1gc33lgj8v5
`

# Monitoring Your Pipeline


In [None]:
# Check the current status
execution_details = execution.describe()

# Display status
print(f"Status: {execution_details['PipelineExecutionStatus']}")


# Integrating Model Training Steps

## Our Training Script

Before we add the training step to our pipeline, let's understand what our training script does. Our train.py script contains the actual machine learning code that will execute during the training step. This script follows SageMaker's training script conventions, which means it knows how to read input data from specific locations and save the trained model to the correct output directory.


## Configuring a SKLearn Estimator

Now let's add the training components to our existing pipeline. First, we need to create an estimator that defines how our model training will be executed. Our `SKLearn estimator` will specify the training environment, including our entry point script, computational resources, and framework version:



In [None]:
from sagemaker.sklearn.estimator import SKLearn

# Create an estimator that defines how our model will be trained
estimator = SKLearn(
    entry_point="train.py",             # Our training script
    role=SAGEMAKER_ROLE,                # IAM role with necessary permissions
    instance_type="ml.m5.large",        # Compute instance type for training
    instance_count=1,                   # Number of instances to use
    framework_version="1.2-1",          # Specify scikit-learn version
    py_version="py3",                   # Python version
    sagemaker_session=pipeline_session  # Use pipeline session for deferred execution
)

## Creating the TrainingStep

With our estimator configured, we can now create the `TrainingStep` that orchestrates model training within our pipeline. A `TrainingStep` is a pipeline component that combines an estimator (which defines the training environment) with input data sources. <u>The key to building effective pipelines lies in properly connecting outputs from one step to inputs of another using SageMaker's step properties system.</U>



In [None]:
from sagemaker.workflow.steps import TrainingStep

# Define our training step that uses output from our processing step
training_step = TrainingStep(
    name="TrainModel",    # Unique name for this step in our pipeline
    estimator=estimator,  # Our estimator we defined above
    inputs={
        # Use the training data output from our processing step as input
        "train": sagemaker.inputs.TrainingInput(
            # Reference the S3 URI where our processing step saved the training data
            s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train_data"].S3Output.S3Uri
        )
    }
)

The critical connection happens in the `inputs` parameter. The `"train"` key in our inputs dictionary corresponds to the `SM_CHANNEL_TRAIN` environment variable that our training script uses to locate input data. Remember in our training script where we had `args.train = os.environ.get('SM_CHANNEL_TRAIN')`?

That environment variable gets populated with the local path where SageMaker downloads our training data. When the training job runs, SageMaker will automatically download the data from the S3 location we specify and make it available to our script at the path stored in `SM_CHANNEL_TRAIN`.

Now let's break down how we specify which S3 location to use for the `s3_data` parameter:

- `processing_step` - References our preprocessing step that we defined earlier
- `.properties` - Accesses the runtime properties of our preprocessing step
- `.ProcessingOutputConfig` - References the output configuration of the processing job
- `.Outputs` - Accesses the collection of all outputs from the processing step
["train_data"] - Selects the specific output we named "train_data" when we defined our processing step's outputs
- `.S3Output` - References the S3 output configuration for this specific output
- `.S3Uri` - Gets the actual S3 URI where this output was saved

This property chain essentially says: "Use the S3 location where the processing step saved its 'train_data' output as the input for this training step." The beauty of this approach is that SageMaker resolves these property references at runtime, so you don't need to know the exact S3 paths beforehand.

## Let's Update Our Pipeline

Finally, let's update our pipeline creation to include both our preprocessing and training steps:



In [None]:
# Create our pipeline by combining all steps
pipeline = Pipeline(
    name=PIPELINE_NAME,
    steps=[processing_step, training_step],  # Include both our steps
    sagemaker_session=sagemaker_session
)

try:
    # Create or update our pipeline (upsert = update if exists, create if not)
    pipeline.upsert(role_arn=SAGEMAKER_ROLE)
    
    # Start our pipeline execution and get execution object for monitoring
    execution = pipeline.start()

    # Print the unique ARN identifier for this execution
    print(f"Pipeline execution ARN: {execution.arn}")

    # Get detailed information about our execution
    execution_details = execution.describe()

    # Display the current status of our pipeline execution
    print(f"Status: {execution_details['PipelineExecutionStatus']}")
    
except Exception as e:
    print(f"Error: {e}")

## Understanding Pipeline Execution Order

Now that we have a multi-step pipeline, it's important to understand how SageMaker determines execution order. <u>Pipeline execution order is determined by dependencies, not by the order of steps in your steps list</u>.

When you create a pipeline with `steps=[processing_step, training_step]`, SageMaker doesn't execute steps based on list order. Instead, it analyzes dependencies between steps. In our pipeline, we created a dependency when we configured our training step to use the processing step's output:



In [None]:
# This creates a dependency: training_step depends on processing_step
s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train_data"].S3Output.S3Uri

This property reference tells SageMaker that the training step cannot begin until the processing step completes successfully. <u>Dependencies are explicit and created through step property references</u> — simply placing steps in a certain order in your list doesn't create dependencies.

<u>If steps have no dependencies between them, SageMaker executes them in parallel</u> to minimize total execution time. This explicit dependency model makes pipelines robust and predictable, as execution flow is clearly defined by data dependencies rather than list ordering.




# Evaluating Models in Pipelines


## Exploring the Evaluation Script


Let's examine the evaluation script we'll call `evaluation.py` that will perform the same model assessment you've done locally in previous exercises, but now adapted to work within our SageMaker Pipeline. This script follows SageMaker's processing script conventions, reading the model and data from other pipeline steps and saving results for future pipeline components.


## Configuring a SKLearnProcessor for Evaluation

Now, let's integrate the evaluation functionality into our existing pipeline. While we could technically reuse the same processor instance from our preprocessing step, creating separate processors for each step is a best practice that improves pipeline clarity and maintainability - each step has its own dedicated processor that can be independently configured, scaled, or modified without affecting other pipeline components.





In [None]:
# Create a processor for running the evaluation script
evaluation_processor = SKLearnProcessor(
    framework_version="1.2-1",
    role=SAGEMAKER_ROLE,
    instance_type="ml.m5.large",
    instance_count=1,
    sagemaker_session=pipeline_session
)

This separation allows you to optimize each step independently. For example, you might later decide that evaluation requires more memory or different compute resources than preprocessing, or you might want to experiment with different framework versions for specific steps. Having dedicated processors makes these modifications straightforward without risking unintended effects on other pipeline components.


## Understanding Property Files for Pipeline Integration

The evaluation step introduces a powerful new concept called `Property Files`, which allow pipeline steps to expose structured data that subsequent steps can reference and use for decision-making.


In [None]:
from sagemaker.workflow.properties import PropertyFile

# Define property file for evaluation metrics
evaluation_report_property = PropertyFile(
    name="EvaluationReport",     # Unique identifier for this property file within the pipeline
    output_name="evaluation",    # Must match the output_name in the evaluation step's ProcessingOutput
    path="evaluation.json"       # Path to the JSON file within the evaluation output directory
)

Property files enable sophisticated pipeline orchestration by making the contents of output files accessible to other pipeline components. In our case, the evaluation metrics (MSE, RMSE, MAE, R²) stored in the JSON file become available for future pipeline steps to reference. This capability becomes crucial when building more advanced pipelines that might include conditional deployment based on performance thresholds, automated model comparison logic, or routing decisions based on evaluation results.

<u>The `output_name` parameter must exactly match the output name we'll specify in our evaluation step's `ProcessingOutput`, while the `path` parameter tells SageMaker where to find the JSON file within the output directory</u>. This connection ensures that SageMaker can locate and parse the evaluation metrics for use by downstream pipeline components.




## Defining the Evaluation ProcessingStep

With our processor and property file configured, we can now define the evaluation step that orchestrates the entire evaluation process:



In [None]:
# Define the evaluation step that takes the trained model and test data as input
evaluation_step = ProcessingStep(
    name="EvaluateModel",            # Unique name for this step in the pipeline
    processor=evaluation_processor,  # The processor we defined above
    inputs=[
        # Model artifact from the training step
        sagemaker.processing.ProcessingInput(
            source=training_step.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model"
        ),
        # Test data from the processing step
        sagemaker.processing.ProcessingInput(
            source=processing_step.properties.ProcessingOutputConfig.Outputs["test_data"].S3Output.S3Uri,
            destination="/opt/ml/processing/test"
        )
    ],
    outputs=[
        # Where the evaluation report will be saved
        sagemaker.processing.ProcessingOutput(
            output_name="evaluation",               # Reference name for this output
            source="/opt/ml/processing/evaluation"  # Container path where script saves report
        )
    ],
    code="evaluation.py",               # The Python script that performs the evaluation
    property_files=[evaluation_report_property]  # Enable access to evaluation metrics in subsequent pipeline steps
)

The evaluation step demonstrates sophisticated pipeline orchestration by connecting to outputs from two different previous steps. The first input references `training_step.properties.ModelArtifacts.S3ModelArtifacts`, which provides the S3 location where our training step saved the trained model artifact. The second input references the test data output from our preprocessing step using the same property system you learned in the previous lesson.

This dual-input configuration creates implicit dependencies that ensure the evaluation step won't execute until both the training and preprocessing steps complete successfully. SageMaker automatically manages these dependencies, guaranteeing that our evaluation script will have access to both the trained model and the test data when it runs.

The `property_files` parameter enables downstream pipeline components to access the evaluation metrics programmatically. This capability becomes crucial when building more sophisticated pipelines that might include conditional deployment based on model performance thresholds or automated model comparison logic.

## Integrating the Evaluation Step into the Pipeline



In [None]:
# Create pipeline by combining all steps
pipeline = Pipeline(
    name=PIPELINE_NAME,
    steps=[processing_step, training_step, evaluation_step],
    sagemaker_session=sagemaker_session
)

# Conditionally Registering Models for Deployment

## Understanding Conditional Model Approval
Conditional steps represent one of the most powerful features of SageMaker Pipelines, allowing you to create intelligent workflows that make decisions based on data, metrics, or other criteria. Instead of executing every step in a linear fashion, conditional steps enable your pipeline to branch and execute different logic paths depending on the conditions you define.

In the context of model deployment, conditional approval serves as a quality gate that ensures only models meeting your performance standards reach production systems. <u>`Model registration` refers to the process of storing a trained model in SageMaker's model registry — a centralized catalog that maintains different versions of your models along with their metadata, approval status, and deployment artifacts</u>. Rather than manually reviewing evaluation reports and deciding which models to register, you can encode your quality criteria directly into the pipeline logic.

For our California housing price prediction pipeline, we'll implement a conditional step that checks whether the trained model's R-squared score meets a minimum threshold of 0.6. The conditional logic will evaluate this score from our evaluation step and only proceed with model registration if the threshold is met. Models that fall below this threshold will be skipped entirely — they won't be registered in the model registry, preventing them from being deployed while still preserving their training artifacts and evaluation results for analysis.

This automated quality control ensures that your production systems only receive models that have demonstrated acceptable performance on your test data, eliminating human bottlenecks and reducing the risk of deploying poor-performing models.


## Creating a Lightweight Inference Model
When preparing models for deployment, it's important to create optimized inference artifacts that contain only the components necessary for making predictions. The entry point script serves as the interface between SageMaker's inference infrastructure and your trained model, defining how the model should be loaded when the inference container starts.

We'll create a separate file called entry_point.py that contains only the essential model loading logic.

This `entry_point.py` script contains only the `model_fn` function, which is called by SageMaker when the inference container starts. It loads your trained Linear Regression model from the `model.joblib` file and returns the model object for use in prediction.

By creating this dedicated `entry_point.py` script with minimal dependencies, we ensure faster startup times and more efficient resource utilization during deployment.

## Implementing the Conditional Registration Step
Now that we have our entry_point.py script ready, let's go back to work on our pipeline code to implement the conditional logic that will automatically register high-performing models for deployment. We'll accomplish this by building four key components:

- <u>Creating a serving model object</u> that packages our trained model with the inference script
- <u>Configuring a model registration step</u> that stores approved models in SageMaker's model registry
- <u>Defining a quality condition</u> that evaluates the R-squared score against our threshold
- <u>Creating a conditional step</u> that orchestrates the decision-making logic to only register models that meet our quality standards


Let's start with the first component by creating the serving model object.

## Creating the Serving Model Object

First, we'll create the serving model object that defines how the trained model should be packaged for inference:



In [None]:
from sagemaker.sklearn.model import SKLearnModel

# Create a serving model object with a minimal inference script
inference_model = SKLearnModel(
    role=SAGEMAKER_ROLE,                                                  # IAM role for model execution
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,  # Reference to trained model
    entry_point="entry_point.py",                                         # Our minimal inference script
    framework_version="1.2-1",                                            # Scikit-learn framework version
    py_version="py3",                                                     # Python version for inference
    sagemaker_session=pipeline_session                                    # Pipeline session for execution
)

The `SKLearnModel` class serves as a bridge between your trained model artifacts and SageMaker's deployment infrastructure, providing essential metadata and configuration that the model registry and inference endpoints require.

When you register a model using `SKLearnModel`, you're not just storing the model weights — you're creating a complete deployment package that includes:

- `Framework specification`: Tells SageMaker which scikit-learn version and Python version to use
- `Entry point definition`: Specifies which script contains the inference logic
- `Execution role`: Defines the IAM permissions needed for model deployment
- `Container configuration`: Sets up the proper runtime environment for your model

The model registry needs this comprehensive information to ensure that anyone deploying the model later will have all the necessary components and configuration. Without `SKLearnModel`, the registry would only have the raw model file without the context needed for successful deployment.

This approach also enables deployment portability — once a model is registered with all its deployment metadata, it can be deployed by different teams, in different environments, or at different times, all while maintaining consistent behavior and requirements.

## Configuring Model Registration
Before configuring the model registration step, let's define the model package group name that will organize our registered models in SageMaker's model registry:



In [None]:
# Define the model package group name for organizing registered models
MODEL_PACKAGE_GROUP_NAME = "california-housing-pipeline-models"

The model package group is like a folder in SageMaker's model registry that keeps all versions of your California housing models organized in one place. Each time your pipeline runs and approves a model, it will create a new version within this group, allowing you to track how your models improve over time.

Next, we'll configure the model registration step that will store approved models in SageMaker's model registry:

In [None]:
from sagemaker.workflow.step_collections import RegisterModel

# Configure the model registration step
register_step = RegisterModel(
    name="RegisterModel",                               # Step name in the pipeline
    model=inference_model,                              # The serving model we just created
    content_types=["text/csv"],                         # Input format the model accepts
    response_types=["text/csv"],                        # Output format the model produces
    inference_instances=["ml.m5.large"],                # Approved instance types for real-time inference
    transform_instances=["ml.m5.large"],                # Approved instance types for batch transform
    model_package_group_name=MODEL_PACKAGE_GROUP_NAME,  # Registry group name
    approval_status="Approved"                          # Automatically approve registered models
)

The `RegisterModel` step takes your trained model and creates a permanent record of it in SageMaker's model registry. Remember that the `inference_model` parameter references the `SKLearnModel` object we created earlier, which packages your trained Linear Regression model with the `entry_point.py` script needed for making predictions.

Let's break down the key parameters:

- `content_types` and `response_types`: These specify the data formats your model can handle. By setting both to `["text/csv"]`, you're telling SageMaker that this model accepts CSV input data (like housing features) and returns CSV output (predicted prices). This matches the format of your California housing dataset and ensures proper data handling during inference.

- `inference_instances` and `transform_instances`: These define the EC2 instance types that are approved for deploying this model. `inference_instances` specifies what resources can be used for real-time endpoints (where you send individual prediction requests), while `transform_instances` specifies resources for batch transform jobs (where you process large datasets at once). By setting both to `["ml.m5.large"]`, you're pre-approving this cost-effective instance type for both deployment scenarios.

- `approval_status="Approved"`: This is a crucial parameter that automatically marks registered models as approved for deployment. Without this setting, models would be registered with a "PendingManualApproval" status, requiring someone to manually review and approve each model in the SageMaker console before it can be deployed. By setting this to "Approved", your pipeline can fully automate the deployment workflow — since your conditional logic already ensures only high-quality models reach this registration step, there's no need for additional manual approval gates.

This automated approval approach is essential for achieving true end-to-end automation. Your pipeline has already validated model quality through the conditional step, so manual approval

## Defining the Quality Condition
Now we'll create the condition that evaluates model performance based on our quality threshold:



In [None]:
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.functions import JsonGet

# Define the quality threshold condition
condition_r2_threshold = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step_name=evaluation_step.name,            # Reference to the evaluation step
        property_file=evaluation_report_property,  # The PropertyFile containing metrics
        json_path="regression_metrics.r2_score"    # Path to the R-squared score in JSON
    ),
    right=0.6                                      # Minimum acceptable R-squared threshold
)

The `ConditionGreaterThanOrEqualTo` condition compares the R-squared score from your evaluation step against a threshold of `0.6`. The `JsonGet` function extracts the specific metric from the evaluation report using the JSON path `regression_metrics.r2_score`, which corresponds to the structure of the `evaluation.json` file created by your evaluation script.



## Creating the Conditional Step
Finally, we'll create the conditional step that orchestrates the decision-making logic:



In [None]:
from sagemaker.workflow.condition_step import ConditionStep

# Create a conditional step that only registers the model if it meets quality standards
condition_step = ConditionStep(
    name="CheckModelQuality",             # Step name in the pipeline
    conditions=[condition_r2_threshold],  # List of conditions to evaluate
    if_steps=[register_step],             # Steps to execute when conditions are met
    else_steps=[]                         # Steps to execute when conditions are not met (none in this case)
)

The `ConditionStep` evaluates the list of conditions and executes different step collections based on the results. When the condition evaluates to `True`, the registration step is executed. If the condition evaluates to `False`, no steps are executed (empty `else_steps` list), preventing low-quality models from entering your deployment pipeline.



## Integrating the Conditional Step into the Pipeline

With all the conditional logic components defined, we can now integrate the new conditional step into our existing pipeline. The updated pipeline maintains the same three-step foundation you've built while adding intelligent decision-making capabilities:



In [None]:
# Create pipeline by combining all steps
pipeline = Pipeline(
    name=PIPELINE_NAME,
    steps=[processing_step, training_step, evaluation_step, condition_step],
    sagemaker_session=sagemaker_session
)

## Querying the Model Registry for Approved Models

After your pipeline has successfully executed and registered an approved model, you can deploy it independently from a different environment or at a later time. This separation between training and deployment workflows is a best practice in MLOps, allowing you to deploy models on different schedules and test deployment configurations without re-running the entire training pipeline.

Since this is your first time working with `SageMaker's model registry`, let's understand what happened when your pipeline registered the model. The model registry acts as a centralized catalog that stores different versions of your models along with their metadata, approval status, and deployment artifacts. Each registered model receives a unique identifier called an `ARN` (Amazon Resource Name) that you can use to reference and deploy that specific model version.

To deploy an approved model, you'll first need to query the model registry to find the latest approved model from your pipeline. This works similarly to how you've searched for training jobs before, but instead of looking for training job names, you're searching for model packages within a specific group:

In [None]:
import sagemaker

# Create a SageMaker session
sagemaker_session = sagemaker.Session()

# Get the SageMaker client from the session
sagemaker_client = sagemaker_session.sagemaker_client

# List approved model packages from the group, sorted by creation time
response = sagemaker_client.list_model_packages(
    ModelPackageGroupName="california-housing-pipeline-models",
    ModelApprovalStatus='Approved',
    SortBy='CreationTime',
    SortOrder='Descending'
)

## Extracting the Model Package ARN


In [None]:
# Extract the model package list from the response
model_packages = response.get('ModelPackageSummaryList', [])

# Get the ARN of the latest approved model package
model_package_arn = model_packages[0]['ModelPackageArn']

# Display model package ARN
print(f"Found approved model: {model_package_arn}")

The response returns a dictionary with a `ModelPackageSummaryList` containing model package information. Since we sorted by newest first, `model_packages[0]` gives us the most recent approved model. We then extract the `ModelPackageArn` from this model package summary to get the unique identifier needed for deployment.

When you run this code, you'll see output similar to:

```
Found approved model: arn:aws:sagemaker:us-east-1:123456789012:model-package/california-housing-pipeline-models/1
```

This ARN is the unique identifier for your approved model package. It contains all the information SageMaker needs to deploy your model, including the trained model artifacts, the inference script, and the deployment configuration that was specified during registration. The number at the end (in this case `/1`) represents the version number of the model package within the group - your first approved model will be version 1, the second will be version 2, and so on. We'll use this ARN in the next step to create a deployable model object.



## Deploying the Approved Model

Now that you have the ARN of your approved model, you can deploy it using the same serverless deployment approach you've used before. However, instead of deploying a model object created from training artifacts, you'll deploy a `ModelPackage` object created from the registry entry:

In [None]:
from sagemaker.model import ModelPackage
from sagemaker.serverless import ServerlessInferenceConfig

# Create a ModelPackage object from the ARN
model = ModelPackage(
    role=SAGEMAKER_ROLE,
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session
)

# Configure serverless inference with memory and concurrency limits
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10
)

# Deploy the model as a serverless endpoint
predictor = model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="california-housing-estimator",
    wait=False
)

The key difference here is that you're creating a `ModelPackage` object instead of an `SKLearnModel` object. The `ModelPackage` automatically includes all the deployment configuration (entry point script, framework version, etc.) that was specified when the model was registered by your pipeline. This approach demonstrates the power of the model registry — all the deployment details are packaged with the model, making deployment consistent and repeatable.

This deployment workflow represents the final piece of your end-to-end MLOps system. Your pipeline trains and evaluates models automatically, registers only those that meet quality standards, and now you can deploy approved models independently whenever needed. This separation allows you to maintain different deployment schedules, test various deployment configurations, and roll back to previous model versions if needed.