# Monitoring Pipeline Executions

Effective monitoring involves several key aspects:

- **<u>Track overall execution status</u>**: Know whether your pipeline is still running, has completed successfully, or has encountered an error.
- **<u>Understand execution timing</u>**: Identify performance bottlenecks and optimize your workflows by analyzing how long each execution and step takes.
- **<u>Examine individual step details</u>**: See what each component of your pipeline accomplished, how long it took, and whether it succeeded or failed.
- **<u>Diagnose and resolve issues</u>**: Use detailed execution information to quickly identify and fix problems when they occur.
- **<u>Optimize costs and resources</u>**: Monitor execution times to spot steps that might be over-provisioned or underperforming.

This monitoring capability is crucial for maintaining robust ML workflows for several reasons. In production environments, pipelines often run on schedules or are triggered by events, and you need to ensure they complete successfully without manual intervention. When issues do occur, detailed execution information helps you quickly identify and resolve problems. Additionally, monitoring execution times helps you optimize costs by identifying steps that might be over-provisioned or underperforming.

## Setting Up the Monitoring Environment


In [None]:
import sagemaker

# Define the pipeline name to monitor
PIPELINE_NAME = "california-housing-preprocessing-pipeline"

# Create a SageMaker session
sagemaker_session = sagemaker.Session()

# Get the SageMaker client from the session
# This client allows us to interact with SageMaker API
sagemaker_client = sagemaker_session.sagemaker_client

## Fetching Recent Pipeline Executions

SageMaker's `list_pipeline_executions` API allows us to query for execution records and sort them to find the most recent runs.

In [None]:
# List pipeline executions sorted by creation time (newest first)
response = sagemaker_client.list_pipeline_executions(
    PipelineName=PIPELINE_NAME,
    SortBy='CreationTime',
    SortOrder='Descending'
)

# Extract the execution summaries from the response
executions = response['PipelineExecutionSummaries']

if not executions:
    print("No pipeline executions found")
else:
    print(f"Retrieved {len(executions)} pipeline executions")
    # Get the latest execution (first item in the list)
    latest_execution = executions[0]

## Extracting Basic Execution Information


In [None]:
# Extract basic execution information directly from the summary
execution_arn = latest_execution['PipelineExecutionArn']
status = latest_execution['PipelineExecutionStatus']
start_time = latest_execution['StartTime']

# Display basic execution information
print(f"Latest Execution ARN: {execution_arn}")
print(f"Status: {status}")
print(f"Start Time: {start_time}")

```
Latest Execution ARN: arn:aws:sagemaker:us-east-1:123456789012:pipeline/california-housing-preprocessing-pipeline/execution/abcd1234-5678-90ef-ghij-klmnopqrstuv

Status: Succeeded

Start Time: 2025-08-01 15:37:56.328000+00:00
```

## Retrieving Detailed Execution Information

The `describe_pipeline_execution` API provides additional timing information and metadata that isn't available in the summary.

In [None]:
# Get detailed execution information using describe_pipeline_execution
execution_details = sagemaker_client.describe_pipeline_execution(
    PipelineExecutionArn=execution_arn
)

This detailed response includes additional information such as:

- <u>LastModifiedTime</u>: When the execution completed (for finished executions)
- <u>PipelineExecutionDisplayName</u>: A human-readable name for the execution
- <u>PipelineExperimentConfig</u>: Experiment tracking configuration details
- <u>CreatedBy/LastModifiedBy</u>: Information about who created and modified the execution
- <u>PipelineVersionId</u>: The specific version of the pipeline that was executed

Among other fields, the execution ARN serves as a unique identifier that allows SageMaker to locate and return information about this exact execution instance.

## Calculating Execution Duration


In [None]:
# Check if the execution has completed (LastModifiedTime exists for completed executions)
if 'LastModifiedTime' in execution_details:
    # Extract the last modified time which serves as end time for completed executions
    end_time = execution_details['LastModifiedTime']
    # Calculate how long the pipeline took to run
    duration = end_time - start_time
    print(f"End Time: {end_time}")
    print(f"Duration: {duration}")

```
End Time: 2025-08-01 15:40:30.174000+00:00
Duration: 0:02:33.846000
```

## Retrieving Step-Level Information


While overall execution status provides a high-level view, examining individual steps gives you detailed insights into your pipeline's internal behavior. SageMaker's `list_pipeline_execution_steps` API allows you to retrieve comprehensive information about every step within a specific execution.



In [None]:
# Retrieve all steps for this specific execution
steps_response = sagemaker_client.list_pipeline_execution_steps(
    PipelineExecutionArn=execution_arn
)

# Extract the steps from the response
steps = steps_response['PipelineExecutionSteps']

The response contains detailed information about each step, including its name, status, timing data, resource usage, and any failure information. This step-level granularity is essential for understanding pipeline behavior, identifying bottlenecks, and diagnosing issues within specific components of your workflow. Each step in the response represents one component of your pipeline definition, maintaining the same order and structure you specified when creating the pipeline.



### Examining Individual Step Details


In [None]:
# Iterate through each step and display its details
for step in steps:
    # Extract step information
    step_name = step['StepName']
    step_status = step['StepStatus']
    step_start = step.get('StartTime', 'Not started')
    
    # Display step name and status
    print(f"\n{step_name}: {step_status}")
    
    # Show start time if the step has started
    if step_start != 'Not started':
        print(f"Started: {step_start}")

### Calculating Step Duration and Performance Analysis

For completed steps, we can calculate execution duration and analyze performance characteristics. This timing information is particularly valuable for identifying bottlenecks and understanding resource utilization patterns across your pipeline.



In [None]:
# Iterate through each step and display its details
for step in steps:
    # ... previous step analysis code ...
    
    # If step has completed, show end time and duration
    if 'EndTime' in step:
        # Extract the step's end time
        step_end = step['EndTime']
        # Calculate step execution time
        step_duration = step_end - step['StartTime']
        print(f"Ended: {step_end}")
        print(f"Duration: {step_duration}")

```
ProcessData: Succeeded
Started: 2025-08-01 15:37:56.996000+00:00
Ended: 2025-08-01 15:40:29.623000+00:00
Duration: 0:02:32.627000
```

### Handling Failed Steps


In [None]:
# Iterate through each step and display its details
for step in steps:
    # ... previous step analysis code ...
    
    # For failed steps, show the failure reason to help with debugging
    if step_status == 'Failed' and 'FailureReason' in step:
        print(f"Failure Reason: {step['FailureReason']}")

Common failure scenarios include insufficient IAM permissions for accessing S3 buckets, memory or disk space limitations during processing, code errors in your processing scripts, or missing input data files.