# Comprehensive ML System Monitoring
## AAI-540 Final Project - Module 5: Monitoring & Observability

**Project Team:** [Your Team Name]

**Authors:** [Team Member Names]

---

## Overview

This notebook implements comprehensive monitoring for our ML system including:
1. **Model Monitoring** - Track prediction quality and bias
2. **Data Monitoring** - Detect data quality issues and distribution drift
3. **Infrastructure Monitoring** - Monitor endpoint performance and resource utilization
4. **CloudWatch Dashboard** - Centralized visualization of all metrics
5. **Automated Reporting** - Generate model and data quality reports

---

## Table of Contents
1. [Setup & Configuration](#setup)
2. [Model Quality Monitoring](#model-quality)
3. [Data Quality Monitoring](#data-quality)
4. [Model Bias Monitoring](#model-bias)
5. [Infrastructure Monitoring](#infrastructure)
6. [CloudWatch Dashboard](#dashboard)
7. [Generate Reports](#reports)
8. [Cleanup](#cleanup)

## 1. Setup & Configuration <a id='setup'></a>

### 1.1 Install and Import Required Libraries

In [1]:
# Verify Python version
!python --version

Python 3.12.9


In [2]:
%%time

import json
import os
from datetime import datetime, timedelta, timezone
from time import sleep

import boto3
import pandas as pd
import numpy as np

# AWS SDK
import boto3
from botocore import UNSIGNED
from botocore.client import Config
from botocore.exceptions import ClientError

# SageMaker
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer, JSONSerializer
from sagemaker.deserializers import CSVDeserializer, JSONDeserializer
from sagemaker import get_execution_role, Session, image_uris
from sagemaker.s3 import S3Downloader, S3Uploader
from sagemaker.model import Model
from sagemaker.predictor import Predictor


# Model Monitor imports
from sagemaker.model_monitor import (
    DataCaptureConfig,
    DefaultModelMonitor,
    ModelQualityMonitor,
    DataQualityDistributionConstraints,
    CronExpressionGenerator
)

from sagemaker.model_monitor.dataset_format import DatasetFormat

# Clarify imports for bias monitoring
from sagemaker.clarify import (
    BiasConfig,
    DataConfig,
    ModelConfig,
    ModelPredictedLabelConfig,
    SageMakerClarifyProcessor
)
from sagemaker.model_monitor import BiasAnalysisConfig, ModelBiasMonitor

print("✓ Libraries imported successfully")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
✓ Libraries imported successfully
CPU times: user 1.63 s, sys: 290 ms, total: 1.92 s
Wall time: 1.92 s


### 1.2 Retrieve AWS Account ID

In [3]:
try:
    # Get AWS Account ID
    account_id = boto3.client("sts").get_caller_identity()["Account"]
    print(f"Successfully retrieved AWS Account ID: {account_id}")
except Exception as e:
    print(f"Cannot retrieve account information: {e}")
    raise

Successfully retrieved AWS Account ID: 356396930368


### 1.3 Initialize Session and Clients

In [4]:
# AWS Region
REGION = "us-east-1"

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Initialize AWS clients
s3_client = boto3.client("s3", region_name=REGION)
s3_resource = boto3.resource("s3", region_name=REGION)
athena_client = boto3.client("athena", region_name=REGION)
sagemaker_client = boto3.client("sagemaker", region_name=REGION)
sm_client = boto3.client("sagemaker", region_name=REGION)
cloudwatch_client = boto3.client("cloudwatch", region_name=REGION)
cw_client = boto3.client("cloudwatch", region_name=REGION)
logs_client = boto3.client("logs", region_name=REGION)

print(f"AWS Region: {REGION}")
print(f"SageMaker Execution Role: {role}")
print(f"AWS clients initialized successfully")

AWS Region: us-east-1
SageMaker Execution Role: arn:aws:iam::356396930368:role/LabRole
AWS clients initialized successfully


### 1.4 Configure S3 Paths and Prefixes

In [5]:
# Base bucket name (shared across team)
BASE_BUCKET_NAME = "yelp-aai540-group6"

# Individual buckets with Account ID for each team member
DATA_BUCKET = f"{BASE_BUCKET_NAME}-{account_id}"  # Raw data storage
ATHENA_BUCKET = f"{BASE_BUCKET_NAME}-athena-{account_id}"  # Athena queries and results
FEATURE_BUCKET = f"{BASE_BUCKET_NAME}-features-{account_id}"  # Feature store offline
MODEL_BUCKET = f"{BASE_BUCKET_NAME}-models-{account_id}"  # Model artifacts
MONITORING_BUCKET = f"{BASE_BUCKET_NAME}-monitoring-{account_id}"  # Monitoring data

# S3 Prefixes (paths within buckets)
RAW_DATA_PREFIX = "yelp-dataset/json/"
PARQUET_PREFIX = "yelp-dataset/parquet/"
ATHENA_RESULTS_PREFIX = "athena-results/"
FEATURE_PREFIX = "feature-store/"
MODEL_PREFIX = "models/"
MONITORING_PREFIX = "monitoring/"

# Full S3 paths
ATHENA_RESULTS_S3 = f"s3://{ATHENA_BUCKET}/{ATHENA_RESULTS_PREFIX}"

# Athena Database
ATHENA_DB = "yelp"

# Store configuration
%store REGION
%store account_id
%store DATA_BUCKET
%store ATHENA_BUCKET
%store FEATURE_BUCKET
%store MODEL_BUCKET
%store MONITORING_BUCKET
%store ATHENA_RESULTS_S3
%store ATHENA_DB

# Display configuration
print("="*80)
print("S3 BUCKET CONFIGURATION (Account-Specific)")
print("="*80)
print(f"AWS Account ID:     {account_id}")
print(f"AWS Region:         {REGION}")
print()
print("S3 Buckets:")
print(f"  Data Bucket:      {DATA_BUCKET}")
print(f"  Athena Bucket:    {ATHENA_BUCKET}")
print(f"  Feature Bucket:   {FEATURE_BUCKET}")
print(f"  Model Bucket:     {MODEL_BUCKET}")
print(f"  Monitoring:       {MONITORING_BUCKET}")
print()
print("Athena Configuration:")
print(f"  Database:         {ATHENA_DB}")
print(f"  Results Location: {ATHENA_RESULTS_S3}")
print("="*80)

Stored 'REGION' (str)
Stored 'account_id' (str)
Stored 'DATA_BUCKET' (str)
Stored 'ATHENA_BUCKET' (str)
Stored 'FEATURE_BUCKET' (str)
Stored 'MODEL_BUCKET' (str)
Stored 'MONITORING_BUCKET' (str)
Stored 'ATHENA_RESULTS_S3' (str)
Stored 'ATHENA_DB' (str)
S3 BUCKET CONFIGURATION (Account-Specific)
AWS Account ID:     356396930368
AWS Region:         us-east-1

S3 Buckets:
  Data Bucket:      yelp-aai540-group6-356396930368
  Athena Bucket:    yelp-aai540-group6-athena-356396930368
  Feature Bucket:   yelp-aai540-group6-features-356396930368
  Model Bucket:     yelp-aai540-group6-models-356396930368
  Monitoring:       yelp-aai540-group6-monitoring-356396930368

Athena Configuration:
  Database:         yelp
  Results Location: s3://yelp-aai540-group6-athena-356396930368/athena-results/


In [6]:
def create_bucket_if_not_exists(bucket_name, region=REGION):
    """
    Create an S3 bucket if it doesn't already exist.
    
    Args:
        bucket_name: Name of the bucket to create
        region: AWS region for the bucket
    
    Returns:
        True if bucket was created or already exists, False otherwise
    """
    try:
        # Check if bucket exists
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"  Bucket already exists: {bucket_name}")
        return True
    except ClientError as e:
        error_code = e.response['Error']['Code']
        if error_code == '404':
            # Bucket doesn't exist, create it
            try:
                if region == 'us-east-1':
                    s3_client.create_bucket(Bucket=bucket_name)
                else:
                    s3_client.create_bucket(
                        Bucket=bucket_name,
                        CreateBucketConfiguration={'LocationConstraint': region}
                    )
                print(f"  Created bucket: {bucket_name}")
                return True
            except ClientError as create_error:
                print(f"  Error creating bucket {bucket_name}: {create_error}")
                return False
        else:
            print(f"  Error checking bucket {bucket_name}: {e}")
            return False

# Create all required buckets
print("Creating S3 buckets...")
buckets = [
    DATA_BUCKET,
    ATHENA_BUCKET,
    FEATURE_BUCKET,
    MODEL_BUCKET,
    MONITORING_BUCKET
]

all_success = True
for bucket in buckets:
    if not create_bucket_if_not_exists(bucket):
        all_success = False

if all_success:
    print("\n All S3 buckets are ready!")
else:
    print("\n Some buckets could not be created. Please check errors above.")

Creating S3 buckets...
  Bucket already exists: yelp-aai540-group6-356396930368
  Bucket already exists: yelp-aai540-group6-athena-356396930368
  Bucket already exists: yelp-aai540-group6-features-356396930368
  Bucket already exists: yelp-aai540-group6-models-356396930368
  Bucket already exists: yelp-aai540-group6-monitoring-356396930368

 All S3 buckets are ready!


In [7]:
# Monitoring configuration using Account ID
monitoring_schedule_name = f"venuesignal-monitor-{account_id}"
baseline_job_name = f"venuesignal-baseline-{account_id}"

# S3 paths for monitoring data
monitoring_output_path = f"s3://{MONITORING_BUCKET}/monitoring-output"
baseline_results_path = f"s3://{MONITORING_BUCKET}/baseline-results"
monitoring_reports_path = f"s3://{MONITORING_BUCKET}/reports"

print("Monitoring Configuration:")
print(f"  Schedule: {monitoring_schedule_name}")
print(f"  Output: {monitoring_output_path}")
print(f"  Reports: {monitoring_reports_path}")

%store monitoring_schedule_name
%store monitoring_output_path
%store monitoring_reports_path

Monitoring Configuration:
  Schedule: venuesignal-monitor-356396930368
  Output: s3://yelp-aai540-group6-monitoring-356396930368/monitoring-output
  Reports: s3://yelp-aai540-group6-monitoring-356396930368/reports
Stored 'monitoring_schedule_name' (str)
Stored 'monitoring_output_path' (str)
Stored 'monitoring_reports_path' (str)


In [8]:
# Get default bucket
bucket = MONITORING_BUCKET

# TODO: Update this prefix to match your project
project_name = "VenueSignal - Yelp Business Rating Prediction" 
prefix = f"yelp-aai540-group6-monitoring-356396930368"

# Define S3 paths for monitoring
data_capture_prefix = f"datacapture"
data_capture_uri = f"s3://{MONITORING_BUCKET}/{data_capture_prefix}"

baseline_prefix = f"baseline"
baseline_data_uri = f"s3://{MONITORING_BUCKET}/{baseline_prefix}/data"
baseline_results_uri = f"s3://{MONITORING_BUCKET}/{baseline_prefix}/results"

model_quality_prefix = f"model-quality"
model_quality_baseline_uri = f"s3://{MONITORING_BUCKET}/{model_quality_prefix}/baseline"
model_quality_results_uri = f"s3://{MONITORING_BUCKET}/{model_quality_prefix}/results"

data_quality_prefix = f"data-quality"
data_quality_baseline_uri = f"s3://{MONITORING_BUCKET}/{data_quality_prefix}/baseline"
data_quality_results_uri = f"s3://{MONITORING_BUCKET}/{data_quality_prefix}/results"

bias_prefix = f"bias"
bias_baseline_uri = f"s3://{MONITORING_BUCKET}/{bias_prefix}/baseline"
bias_results_uri = f"s3://{MONITORING_BUCKET}/{bias_prefix}/results"

reports_uri = f"s3://{MONITORING_BUCKET}/reports"

print("\n=== S3 Configuration ===")
print(f"Bucket: {bucket}")
print(f"Data Capture: {data_capture_uri}")
print(f"Baseline: {baseline_results_uri}")
print(f"Model Quality: {model_quality_results_uri}")
print(f"Data Quality: {data_quality_results_uri}")
print(f"Bias: {bias_results_uri}")
print(f"Reports: {reports_uri}")


=== S3 Configuration ===
Bucket: yelp-aai540-group6-monitoring-356396930368
Data Capture: s3://yelp-aai540-group6-monitoring-356396930368/datacapture
Baseline: s3://yelp-aai540-group6-monitoring-356396930368/baseline/results
Model Quality: s3://yelp-aai540-group6-monitoring-356396930368/model-quality/results
Data Quality: s3://yelp-aai540-group6-monitoring-356396930368/data-quality/results
Bias: s3://yelp-aai540-group6-monitoring-356396930368/bias/results
Reports: s3://yelp-aai540-group6-monitoring-356396930368/reports


### 1.4 Specify Your Endpoint Name

**Important:** Replace with your actual deployed endpoint name from Module 4.

In [9]:
# TODO: Update with your actual endpoint name
endpoint_name = "venuesignal-endpoint-2026-02-07-11-28-53" 

# Verify endpoint exists
try:
    response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    print(f" Endpoint '{endpoint_name}' found")
    print(f"  Status: {response['EndpointStatus']}")
    variant = response['ProductionVariants'][0]
    print(f"  Instance Type: {variant.get('CurrentInstanceType', variant.get('InstanceType', 'Unknown'))}")
except Exception as e:
    print(f" Error: Endpoint '{endpoint_name}' not found")
    print(f"  {str(e)}")
    print("\nPlease update the endpoint_name variable with your deployed endpoint.")

 Endpoint 'venuesignal-endpoint-2026-02-07-11-28-53' found
  Status: InService
  Instance Type: Unknown


### 1.5 Load Baseline Training Data

**Important:** Update the path to your training data used in Module 3-4.

In [10]:
# TODO: Update with your actual training data path
training_data_uri = f"s3://{FEATURE_BUCKET}/training-data/train.csv" 

# TODO: Update with your actual validation/test data path for ground truth
validation_data_uri = f"s3://{FEATURE_BUCKET}/training-data/validation.csv"  

print(f"Training data: {training_data_uri}")
print(f"Validation data: {validation_data_uri}")

# Load a sample to verify
try:
    # Download a sample
    local_training_file = S3Downloader.download(training_data_uri, ".")
    df_train_sample = pd.read_csv(local_training_file[0], nrows=5)
    print(f"\n✓ Training data loaded successfully")
    print(f"  Shape: {df_train_sample.shape}")
    print(f"  Columns: {list(df_train_sample.columns)}")
    display(df_train_sample.head())
except Exception as e:
    print(f"✗ Error loading training data: {e}")
    print("Please update the training_data_uri variable.")

Training data: s3://yelp-aai540-group6-features-356396930368/training-data/train.csv
Validation data: s3://yelp-aai540-group6-features-356396930368/training-data/validation.csv

✓ Training data loaded successfully
  Shape: (5, 34)
  Columns: ['review_id', 'business_id', 'user_id', 'mentions_parking', 'parking_positive', 'parking_negative', 'parking_type_lot', 'parking_type_street', 'parking_type_garage', 'parking_type_valet', 'parking_free', 'parking_paid', 'enhanced_parking_score', 'business_stars', 'business_review_count', 'avg_review_stars', 'std_review_stars', 'total_reviews', 'avg_engagement', 'pct_highly_rated', 'has_parking_data', 'parking_sentiment', 'review_stars', 'useful', 'funny', 'cool', 'engagement_score', 'is_engaged', 'review_year', 'review_month', 'review_quarter', 'is_restaurant', 'price_range_numeric', 'is_highly_rated']


Unnamed: 0,review_id,business_id,user_id,mentions_parking,parking_positive,parking_negative,parking_type_lot,parking_type_street,parking_type_garage,parking_type_valet,...,funny,cool,engagement_score,is_engaged,review_year,review_month,review_quarter,is_restaurant,price_range_numeric,is_highly_rated
0,ILUkOqiOI4OAP66sFIK2wg,vl40Oa75v42jvJsHwpCGKA,N0zPkywxRWcdwIdDydRjsg,0,0,0,0,0,0,0,...,0,1,3,1,2018,12,4,1,2,1
1,xZsG5iSqG09rMWAG5xbAuA,vl40Oa75v42jvJsHwpCGKA,awc2ZDTlv_UwVj-O0PDVLQ,0,0,0,0,0,0,0,...,0,0,0,0,2019,8,3,1,2,1
2,Sh_k6lShYktzcDQldXEkDA,vl40Oa75v42jvJsHwpCGKA,EUL1aKj4hhBqfPhSmJ7tbQ,0,0,0,0,0,0,0,...,0,0,0,0,2019,3,1,1,2,1
3,N1qv_f2jQ3L1KQswNCjlmQ,vl40Oa75v42jvJsHwpCGKA,Sx0atJz9G3Q84-dUlfnVTg,0,0,0,0,0,0,0,...,0,1,2,1,2019,9,3,1,2,1
4,5iMBUIkxxzHJzTp1lvUOug,vl40Oa75v42jvJsHwpCGKA,6l8Pr2n0Sq2HHsHB2XyoNw,0,0,0,0,0,0,0,...,0,1,2,1,2018,9,3,1,2,1


## 2. Model Quality Monitoring <a id='model-quality'></a>

Model Quality Monitoring tracks the prediction accuracy of your model over time by comparing predictions against ground truth labels.

### 2.1 Create Model Quality Baseline

First, we need to establish a baseline for model quality metrics using our validation dataset.

In [11]:
# TODO: Specify your problem type
problem_type = "BinaryClassification"  # Options: "BinaryClassification", "MulticlassClassification", "Regression"

# TODO: Specify inference and ground truth column names
inference_attribute = "prediction"  # Column name for predictions in captured data
probability_attribute = "probability"  # Column name for probability scores (if applicable)
ground_truth_attribute = "label"  # Column name for actual labels in ground truth data

print(f"Problem Type: {problem_type}")
print(f"Inference Attribute: {inference_attribute}")
print(f"Ground Truth Attribute: {ground_truth_attribute}")

Problem Type: BinaryClassification
Inference Attribute: prediction
Ground Truth Attribute: label


In [12]:
%%time
# Initialize Model Quality Monitor
model_quality_monitor = ModelQualityMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",  # Resource-efficient instance
    volume_size_in_gb=20,
    max_runtime_in_seconds=1800,
    sagemaker_session=sagemaker_session,
)

print("✓ Model Quality Monitor initialized")

✓ Model Quality Monitor initialized
CPU times: user 21.7 ms, sys: 253 μs, total: 22 ms
Wall time: 21.6 ms


In [13]:
%%time
# Suggest baseline for model quality
try:
    model_quality_monitor.suggest_baseline(
        baseline_dataset=validation_data_uri,
        dataset_format=DatasetFormat.csv(header=True),
        output_s3_uri=model_quality_baseline_uri,
        problem_type=problem_type,
        inference_attribute=inference_attribute,
        ground_truth_attribute=ground_truth_attribute,
        wait=True,
        logs=False
    )
    print(" Model quality baseline created successfully")
except Exception as e:
    print(f" Error creating baseline: {e}")
    print("This may happen if your validation data format doesn't match expectations.")
    print("Verify that your validation data has both predictions and ground truth labels.")

INFO:sagemaker:Creating processing-job with name baseline-suggestion-job-2026-02-08-10-35-23-576


...........................................................* Error creating baseline: Error for Processing job baseline-suggestion-job-2026-02-08-10-35-23-576: Failed. Reason: AlgorithmError: Error: Errors occurred when analyzing your data. Please check CloudWatch logs for more details., exit code: 255. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html
This may happen if your validation data format doesn't match expectations.
Verify that your validation data has both predictions and ground truth labels.
CPU times: user 216 ms, sys: 18.7 ms, total: 235 ms
Wall time: 5min 2s


### 2.2 Review Model Quality Baseline Report

In [14]:
# Download and display baseline statistics
baseline_job = model_quality_monitor.latest_baselining_job
baseline_stats_uri = baseline_job.baseline_statistics().statistics_s3_uri

print(f"Baseline Statistics URI: {baseline_stats_uri}")

# Download the baseline statistics
local_stats_file = S3Downloader.download(baseline_stats_uri, ".")
with open(local_stats_file[0], 'r') as f:
    baseline_stats = json.load(f)

print("\n=== Model Quality Baseline Metrics ===")
if 'binary_classification_metrics' in baseline_stats:
    metrics = baseline_stats['binary_classification_metrics']
    for metric, value in metrics.items():
        print(f"{metric}: {value['value']:.4f}")
elif 'regression_metrics' in baseline_stats:
    metrics = baseline_stats['regression_metrics']
    for metric, value in metrics.items():
        print(f"{metric}: {value['value']:.4f}")
else:
    print(json.dumps(baseline_stats, indent=2))

### 2.3 Create Model Quality Monitoring Schedule

In [None]:
%%time
from sagemaker.model_monitor import CronExpressionGenerator

model_quality_schedule_name = f"{project_name}-model-quality-schedule-{datetime.now(timezone.utc):%Y%m%d-%H%M}"

try:
    model_quality_monitor.create_monitoring_schedule(
        monitor_schedule_name=model_quality_schedule_name,
        endpoint_input=endpoint_name,
        output_s3_uri=model_quality_results_uri,
        problem_type=problem_type,
        ground_truth_input=validation_data_uri,  # You'll need to update this with actual ground truth source
        constraints=model_quality_monitor.suggested_constraints(),
        schedule_cron_expression=CronExpressionGenerator.daily(),  # Run daily to save costs
        enable_cloudwatch_metrics=True,
    )
    print(f"✓ Model quality monitoring schedule created: {model_quality_schedule_name}")
except Exception as e:
    print(f"✗ Error creating schedule: {e}")

## 3. Data Quality Monitoring <a id='data-quality'></a>

Data Quality Monitoring detects anomalies and data drift in the input features.

### 3.1 Enable Data Capture on Endpoint

**Note:** If you already have data capture enabled from Module 4, you can skip this step.

In [None]:
# Check if data capture is already enabled
try:
    endpoint_config = sm_client.describe_endpoint(EndpointName=endpoint_name)
    config_name = endpoint_config['EndpointConfigName']
    config_details = sm_client.describe_endpoint_config(EndpointConfigName=config_name)
    
    if 'DataCaptureConfig' in config_details:
        print("✓ Data capture is already enabled on this endpoint")
        capture_enabled = True
    else:
        print("✗ Data capture is not enabled")
        print("You need to update your endpoint config to enable data capture.")
        print("See the code below for reference.")
        capture_enabled = False
except Exception as e:
    print(f"Error checking endpoint: {e}")
    capture_enabled = False

In [None]:
# If data capture is not enabled, use this code to update your endpoint
# Uncomment and run if needed

# from sagemaker.model_monitor import DataCaptureConfig
# 
# data_capture_config = DataCaptureConfig(
#     enable_capture=True,
#     sampling_percentage=100,  # Capture 100% of requests (reduce for high-traffic endpoints)
#     destination_s3_uri=data_capture_uri,
#     capture_options=["INPUT", "OUTPUT"]  # Capture both input and output
# )
# 
# # You'll need to update your endpoint configuration
# # This requires redeploying the endpoint with the new configuration
# # See Module 4 deployment code for reference

### 3.2 Create Data Quality Baseline

In [None]:
%%time
# Initialize Data Quality Monitor
data_quality_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=1800,
    sagemaker_session=sagemaker_session,
)

print("✓ Data Quality Monitor initialized")

In [None]:
%%time
# Create baseline for data quality
try:
    data_quality_monitor.suggest_baseline(
        baseline_dataset=training_data_uri,
        dataset_format=DatasetFormat.csv(header=True),
        output_s3_uri=data_quality_baseline_uri,
        wait=True,
        logs=False
    )
    print("✓ Data quality baseline created successfully")
except Exception as e:
    print(f"✗ Error creating baseline: {e}")

### 3.3 Review Data Quality Baseline

In [None]:
# Download and display baseline statistics
baseline_job = data_quality_monitor.latest_baselining_job
schema_uri = baseline_job.baseline_statistics().statistics_s3_uri
constraints_uri = baseline_job.suggested_constraints().constraints_s3_uri

print(f"Schema URI: {schema_uri}")
print(f"Constraints URI: {constraints_uri}")

# Download statistics
local_stats_file = S3Downloader.download(schema_uri, ".")
with open(local_stats_file[0], 'r') as f:
    data_stats = json.load(f)

print("\n=== Data Quality Statistics (Sample) ===")
if 'features' in data_stats:
    for i, feature in enumerate(list(data_stats['features'])[:5]):  # Show first 5 features
        print(f"\nFeature: {feature['name']}")
        print(f"  Type: {feature.get('inferred_type', 'unknown')}")
        if 'numerical_statistics' in feature:
            stats = feature['numerical_statistics']
            print(f"  Min: {stats.get('min', 'N/A')}")
            print(f"  Max: {stats.get('max', 'N/A')}")
            print(f"  Mean: {stats.get('mean', 'N/A')}")

### 3.4 Create Data Quality Monitoring Schedule

In [None]:
%%time
data_quality_schedule_name = f"{project_name}-data-quality-schedule-{datetime.now(timezone.utc):%Y%m%d-%H%M}"

try:
    data_quality_monitor.create_monitoring_schedule(
        monitor_schedule_name=data_quality_schedule_name,
        endpoint_input=endpoint_name,
        output_s3_uri=data_quality_results_uri,
        statistics=data_quality_monitor.baseline_statistics(),
        constraints=data_quality_monitor.suggested_constraints(),
        schedule_cron_expression=CronExpressionGenerator.daily(),
        enable_cloudwatch_metrics=True,
    )
    print(f"✓ Data quality monitoring schedule created: {data_quality_schedule_name}")
except Exception as e:
    print(f"✗ Error creating schedule: {e}")

## 4. Model Bias Monitoring <a id='model-bias'></a>

**Note:** Bias monitoring is optional but recommended if your model uses sensitive attributes. Skip this section if not applicable to your project.

### 4.1 Configure Bias Monitoring (Optional)

Uncomment and configure if your dataset has sensitive attributes to monitor for bias.

In [None]:
# # TODO: Configure if applicable to your project
# 
# # Specify the sensitive attribute (facet) to monitor
# facet_name = "age_group"  # UPDATE THIS - e.g., "gender", "age", "ethnicity"
# facet_values_or_threshold = [1]  # UPDATE THIS - sensitive group(s) to monitor
# 
# # Label column name
# label_name = "label"  # UPDATE THIS
# 
# # Configure bias
# bias_config = BiasConfig(
#     label_values_or_threshold=[1],  # Positive class
#     facet_name=facet_name,
#     facet_values_or_threshold=facet_values_or_threshold,
# )
# 
# print(f"Bias monitoring configured for facet: {facet_name}")

## 5. Infrastructure Monitoring <a id='infrastructure'></a>

Monitor endpoint performance, latency, and resource utilization using CloudWatch metrics.

### 5.1 Create CloudWatch Alarms for Endpoint Metrics

In [None]:
# Define alarm thresholds
LATENCY_THRESHOLD_MS = 1000  # Alert if p99 latency > 1 second
ERROR_RATE_THRESHOLD = 5  # Alert if error rate > 5%
CPU_THRESHOLD = 80  # Alert if CPU > 80%
MEMORY_THRESHOLD = 80  # Alert if memory > 80%

print("Alarm Thresholds:")
print(f"  Latency: {LATENCY_THRESHOLD_MS}ms")
print(f"  Error Rate: {ERROR_RATE_THRESHOLD}%")
print(f"  CPU: {CPU_THRESHOLD}%")
print(f"  Memory: {MEMORY_THRESHOLD}%")

In [None]:
# Get endpoint variant name
endpoint_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
variant_name = endpoint_desc['ProductionVariants'][0]['VariantName']

print(f"Endpoint: {endpoint_name}")
print(f"Variant: {variant_name}")

In [None]:
# Create Model Latency Alarm
try:
    cw_client.put_metric_alarm(
        AlarmName=f"{endpoint_name}-High-Latency",
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='ModelLatency',
        Namespace='AWS/SageMaker',
        Period=300,  # 5 minutes
        Statistic='Average',
        Threshold=LATENCY_THRESHOLD_MS * 1000,  # Convert to microseconds
        ActionsEnabled=False,
        AlarmDescription='Alert when model latency is too high',
        Dimensions=[
            {'Name': 'EndpointName', 'Value': endpoint_name},
            {'Name': 'VariantName', 'Value': variant_name}
        ]
    )
    print("✓ Created alarm: High Latency")
except Exception as e:
    print(f"✗ Error creating latency alarm: {e}")

In [None]:
# Create Model Invocation Error Rate Alarm
try:
    cw_client.put_metric_alarm(
        AlarmName=f"{endpoint_name}-High-Error-Rate",
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        Metrics=[
            {
                'Id': 'error_rate',
                'Expression': '(m2/m1)*100',
                'Label': 'Error Rate (%)',
            },
            {
                'Id': 'm1',
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/SageMaker',
                        'MetricName': 'Invocations',
                        'Dimensions': [
                            {'Name': 'EndpointName', 'Value': endpoint_name},
                            {'Name': 'VariantName', 'Value': variant_name}
                        ]
                    },
                    'Period': 300,
                    'Stat': 'Sum',
                },
                'ReturnData': False,
            },
            {
                'Id': 'm2',
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/SageMaker',
                        'MetricName': 'ModelInvocationErrors',
                        'Dimensions': [
                            {'Name': 'EndpointName', 'Value': endpoint_name},
                            {'Name': 'VariantName', 'Value': variant_name}
                        ]
                    },
                    'Period': 300,
                    'Stat': 'Sum',
                },
                'ReturnData': False,
            },
        ],
        Threshold=ERROR_RATE_THRESHOLD,
        ActionsEnabled=False,
        AlarmDescription='Alert when error rate exceeds threshold'
    )
    print("✓ Created alarm: High Error Rate")
except Exception as e:
    print(f"✗ Error creating error rate alarm: {e}")

In [None]:
# Create CPU Utilization Alarm
try:
    cw_client.put_metric_alarm(
        AlarmName=f"{endpoint_name}-High-CPU",
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='CPUUtilization',
        Namespace='/aws/sagemaker/Endpoints',
        Period=300,
        Statistic='Average',
        Threshold=CPU_THRESHOLD,
        ActionsEnabled=False,
        AlarmDescription='Alert when CPU utilization is high',
        Dimensions=[
            {'Name': 'EndpointName', 'Value': endpoint_name},
            {'Name': 'VariantName', 'Value': variant_name}
        ]
    )
    print("✓ Created alarm: High CPU")
except Exception as e:
    print(f"✗ Error creating CPU alarm: {e}")

In [None]:
# Create Memory Utilization Alarm
try:
    cw_client.put_metric_alarm(
        AlarmName=f"{endpoint_name}-High-Memory",
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='MemoryUtilization',
        Namespace='/aws/sagemaker/Endpoints',
        Period=300,
        Statistic='Average',
        Threshold=MEMORY_THRESHOLD,
        ActionsEnabled=False,
        AlarmDescription='Alert when memory utilization is high',
        Dimensions=[
            {'Name': 'EndpointName', 'Value': endpoint_name},
            {'Name': 'VariantName', 'Value': variant_name}
        ]
    )
    print("✓ Created alarm: High Memory")
except Exception as e:
    print(f"✗ Error creating memory alarm: {e}")

### 5.2 List All CloudWatch Alarms

In [None]:
# List all alarms for this endpoint
alarms = cw_client.describe_alarms(
    AlarmNamePrefix=endpoint_name
)

print("\n=== CloudWatch Alarms ===")
for alarm in alarms['MetricAlarms']:
    print(f"\n{alarm['AlarmName']}")
    print(f"  State: {alarm['StateValue']}")
    print(f"  Metric: {alarm.get('MetricName', 'Expression')}")
    print(f"  Threshold: {alarm['Threshold']}")
    print(f"  Description: {alarm.get('AlarmDescription', 'N/A')}")

## 6. CloudWatch Dashboard <a id='dashboard'></a>

Create a centralized dashboard for visualizing all monitoring metrics.

### 6.1 Create Comprehensive Monitoring Dashboard

In [None]:
dashboard_name = f"{project_name}-monitoring-dashboard"

# Define dashboard body
dashboard_body = {
    "widgets": [
        # Row 1: Endpoint Performance
        {
            "type": "metric",
            "properties": {
                "metrics": [
                    ["AWS/SageMaker", "ModelLatency", {"stat": "Average", "label": "Avg Latency"}],
                    ["...", {"stat": "p99", "label": "P99 Latency"}]
                ],
                "period": 300,
                "stat": "Average",
                "region": region,
                "title": "Model Latency (microseconds)",
                "dimensions": {
                    "EndpointName": endpoint_name,
                    "VariantName": variant_name
                },
                "yAxis": {"left": {"min": 0}}
            },
            "width": 12,
            "height": 6,
            "x": 0,
            "y": 0
        },
        {
            "type": "metric",
            "properties": {
                "metrics": [
                    ["AWS/SageMaker", "Invocations", {"stat": "Sum", "label": "Total Invocations"}],
                    [".", "ModelInvocationErrors", {"stat": "Sum", "label": "Errors"}]
                ],
                "period": 300,
                "stat": "Sum",
                "region": region,
                "title": "Invocations & Errors",
                "dimensions": {
                    "EndpointName": endpoint_name,
                    "VariantName": variant_name
                },
                "yAxis": {"left": {"min": 0}}
            },
            "width": 12,
            "height": 6,
            "x": 12,
            "y": 0
        },
        
        # Row 2: Resource Utilization
        {
            "type": "metric",
            "properties": {
                "metrics": [
                    ["/aws/sagemaker/Endpoints", "CPUUtilization", {"stat": "Average"}]
                ],
                "period": 300,
                "stat": "Average",
                "region": region,
                "title": "CPU Utilization (%)",
                "dimensions": {
                    "EndpointName": endpoint_name,
                    "VariantName": variant_name
                },
                "yAxis": {"left": {"min": 0, "max": 100}}
            },
            "width": 12,
            "height": 6,
            "x": 0,
            "y": 6
        },
        {
            "type": "metric",
            "properties": {
                "metrics": [
                    ["/aws/sagemaker/Endpoints", "MemoryUtilization", {"stat": "Average"}]
                ],
                "period": 300,
                "stat": "Average",
                "region": region,
                "title": "Memory Utilization (%)",
                "dimensions": {
                    "EndpointName": endpoint_name,
                    "VariantName": variant_name
                },
                "yAxis": {"left": {"min": 0, "max": 100}}
            },
            "width": 12,
            "height": 6,
            "x": 12,
            "y": 6
        },
        
        # Row 3: Disk Utilization
        {
            "type": "metric",
            "properties": {
                "metrics": [
                    ["/aws/sagemaker/Endpoints", "DiskUtilization", {"stat": "Average"}]
                ],
                "period": 300,
                "stat": "Average",
                "region": region,
                "title": "Disk Utilization (%)",
                "dimensions": {
                    "EndpointName": endpoint_name,
                    "VariantName": variant_name
                },
                "yAxis": {"left": {"min": 0, "max": 100}}
            },
            "width": 12,
            "height": 6,
            "x": 0,
            "y": 12
        },
        
        # Row 4: Model Monitor Violations
        {
            "type": "log",
            "properties": {
                "query": f"SOURCE '/aws/sagemaker/Endpoints/{endpoint_name}'\n| fields @timestamp, @message\n| filter @message like /violation/\n| sort @timestamp desc\n| limit 20",
                "region": region,
                "title": "Recent Monitoring Violations",
                "stacked": False
            },
            "width": 24,
            "height": 6,
            "x": 0,
            "y": 18
        }
    ]
}

# Create dashboard
try:
    cw_client.put_dashboard(
        DashboardName=dashboard_name,
        DashboardBody=json.dumps(dashboard_body)
    )
    print(f"✓ CloudWatch dashboard created: {dashboard_name}")
    print(f"\nView dashboard at:")
    print(f"https://console.aws.amazon.com/cloudwatch/home?region={region}#dashboards:name={dashboard_name}")
except Exception as e:
    print(f"✗ Error creating dashboard: {e}")

### 6.2 Query Recent Metrics

In [None]:
# Query recent endpoint metrics
from datetime import datetime, timedelta

end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=1)

def get_metric_statistics(metric_name, namespace='AWS/SageMaker', statistic='Average'):
    """Helper function to get metric statistics"""
    try:
        response = cw_client.get_metric_statistics(
            Namespace=namespace,
            MetricName=metric_name,
            Dimensions=[
                {'Name': 'EndpointName', 'Value': endpoint_name},
                {'Name': 'VariantName', 'Value': variant_name}
            ],
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=[statistic]
        )
        
        if response['Datapoints']:
            latest = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])[-1]
            return latest[statistic]
        return None
    except Exception as e:
        return None

print("\n=== Recent Endpoint Metrics (Last Hour) ===")
print(f"\nPerformance:")
latency = get_metric_statistics('ModelLatency')
if latency:
    print(f"  Average Latency: {latency/1000:.2f}ms")

invocations = get_metric_statistics('Invocations', statistic='Sum')
if invocations:
    print(f"  Total Invocations: {invocations:.0f}")

errors = get_metric_statistics('ModelInvocationErrors', statistic='Sum')
if errors:
    print(f"  Errors: {errors:.0f}")
    if invocations and invocations > 0:
        error_rate = (errors / invocations) * 100
        print(f"  Error Rate: {error_rate:.2f}%")

print(f"\nResource Utilization:")
cpu = get_metric_statistics('CPUUtilization', namespace='/aws/sagemaker/Endpoints')
if cpu:
    print(f"  CPU: {cpu:.2f}%")

memory = get_metric_statistics('MemoryUtilization', namespace='/aws/sagemaker/Endpoints')
if memory:
    print(f"  Memory: {memory:.2f}%")

disk = get_metric_statistics('DiskUtilization', namespace='/aws/sagemaker/Endpoints')
if disk:
    print(f"  Disk: {disk:.2f}%")

## 7. Generate Monitoring Reports <a id='reports'></a>

Generate comprehensive reports for model and data monitoring.

### 7.1 List All Monitoring Schedules

In [None]:
# List all monitoring schedules
schedules = sm_client.list_monitoring_schedules(
    EndpointName=endpoint_name,
    MaxResults=100
)

print("\n=== Active Monitoring Schedules ===")
for schedule in schedules['MonitoringScheduleSummaries']:
    print(f"\nSchedule: {schedule['MonitoringScheduleName']}")
    print(f"  Status: {schedule['MonitoringScheduleStatus']}")
    print(f"  Type: {schedule.get('MonitoringType', 'DataQuality')}")
    print(f"  Created: {schedule['CreationTime']}")
    print(f"  Last Modified: {schedule['LastModifiedTime']}")

### 7.2 Check Latest Monitoring Executions

In [None]:
def get_latest_execution(schedule_name):
    """Get the latest execution for a monitoring schedule"""
    try:
        executions = sm_client.list_monitoring_executions(
            MonitoringScheduleName=schedule_name,
            MaxResults=1,
            SortBy='CreationTime',
            SortOrder='Descending'
        )
        
        if executions['MonitoringExecutionSummaries']:
            return executions['MonitoringExecutionSummaries'][0]
        return None
    except Exception as e:
        return None

print("\n=== Latest Monitoring Executions ===")
for schedule in schedules['MonitoringScheduleSummaries']:
    schedule_name = schedule['MonitoringScheduleName']
    execution = get_latest_execution(schedule_name)
    
    print(f"\n{schedule_name}:")
    if execution:
        print(f"  Status: {execution['MonitoringExecutionStatus']}")
        print(f"  Started: {execution.get('ScheduledTime', 'N/A')}")
        if 'ProcessingJobArn' in execution:
            print(f"  Processing Job: {execution['ProcessingJobArn'].split('/')[-1]}")
    else:
        print("  No executions yet (schedule may not have run)")

### 7.3 Generate Model Quality Report

In [None]:
# Download latest model quality results if available
try:
    # List files in model quality results location
    results_files = S3Downloader.list(model_quality_results_uri)
    
    if results_files:
        print(f"\n=== Model Quality Monitoring Results ===")
        print(f"Found {len(results_files)} result files")
        
        # Download the most recent constraint violations report
        for file in sorted(results_files, reverse=True):
            if 'constraint_violations' in file:
                local_file = S3Downloader.download(file, ".")
                with open(local_file[0], 'r') as f:
                    violations = json.load(f)
                
                if violations.get('violations'):
                    print(f"\n⚠️ Found {len(violations['violations'])} violations:")
                    for violation in violations['violations']:
                        print(f"  - {violation.get('description', 'Unknown violation')}")
                else:
                    print("\n✓ No violations detected")
                break
    else:
        print("\nNo monitoring results available yet.")
        print("Monitoring schedules need to run before results are generated.")
except Exception as e:
    print(f"\nCannot retrieve model quality results yet: {e}")
    print("This is expected if monitoring hasn't run yet.")

### 7.4 Generate Data Quality Report

In [None]:
# Download latest data quality results if available
try:
    results_files = S3Downloader.list(data_quality_results_uri)
    
    if results_files:
        print(f"\n=== Data Quality Monitoring Results ===")
        print(f"Found {len(results_files)} result files")
        
        # Download the most recent constraint violations report
        for file in sorted(results_files, reverse=True):
            if 'constraint_violations' in file:
                local_file = S3Downloader.download(file, ".")
                with open(local_file[0], 'r') as f:
                    violations = json.load(f)
                
                if violations.get('violations'):
                    print(f"\n⚠️ Found {len(violations['violations'])} violations:")
                    for violation in violations['violations']:
                        feature = violation.get('feature_name', 'Unknown')
                        description = violation.get('description', 'Unknown violation')
                        print(f"  - Feature '{feature}': {description}")
                else:
                    print("\n✓ No violations detected")
                break
    else:
        print("\nNo monitoring results available yet.")
        print("Monitoring schedules need to run before results are generated.")
except Exception as e:
    print(f"\nCannot retrieve data quality results yet: {e}")
    print("This is expected if monitoring hasn't run yet.")

### 7.5 Create Comprehensive Monitoring Summary Report

In [None]:
# Generate comprehensive monitoring report
report = {
    "report_timestamp": datetime.now(timezone.utc).isoformat(),
    "endpoint_name": endpoint_name,
    "monitoring_schedules": [],
    "infrastructure_metrics": {},
    "alarms": [],
    "recent_violations": []
}

# Add monitoring schedules
for schedule in schedules['MonitoringScheduleSummaries']:
    report["monitoring_schedules"].append({
        "name": schedule['MonitoringScheduleName'],
        "status": schedule['MonitoringScheduleStatus'],
        "type": schedule.get('MonitoringType', 'DataQuality')
    })

# Add infrastructure metrics
report["infrastructure_metrics"] = {
    "latency_ms": latency/1000 if latency else None,
    "invocations": invocations if invocations else 0,
    "errors": errors if errors else 0,
    "cpu_utilization": cpu if cpu else None,
    "memory_utilization": memory if memory else None,
    "disk_utilization": disk if disk else None
}

# Add alarms
for alarm in alarms['MetricAlarms']:
    report["alarms"].append({
        "name": alarm['AlarmName'],
        "state": alarm['StateValue'],
        "metric": alarm.get('MetricName', 'Expression'),
        "threshold": alarm['Threshold']
    })

# Save report to S3
report_filename = f"monitoring_report_{datetime.now(timezone.utc):%Y%m%d_%H%M%S}.json"
with open(report_filename, 'w') as f:
    json.dump(report, indent=2, fp=f)

report_s3_uri = S3Uploader.upload(report_filename, reports_uri)

print(f"\n✓ Monitoring report generated and saved to:")
print(f"  {report_s3_uri}")
print(f"\n=== Report Summary ===")
print(json.dumps(report, indent=2))

## 8. Testing Monitoring with Sample Data <a id='testing'></a>

Send test predictions to generate monitoring data.

In [None]:
# TODO: Uncomment and adapt this to your model's input format

# from sagemaker.predictor import Predictor
# from sagemaker.serializers import CSVSerializer
# from sagemaker.deserializers import JSONDeserializer
# 
# # Create predictor
# predictor = Predictor(
#     endpoint_name=endpoint_name,
#     sagemaker_session=sagemaker_session,
#     serializer=CSVSerializer(),
#     deserializer=JSONDeserializer()
# )
# 
# # Load test data
# test_data = df_train_sample.drop(columns=['label'])  # UPDATE column name
# 
# # Send test predictions
# print("Sending test predictions...")
# for i, row in test_data.iterrows():
#     prediction = predictor.predict(row.values.reshape(1, -1))
#     print(f"  Prediction {i+1}: {prediction}")
#     sleep(1)  # Small delay between requests
# 
# print("\n✓ Test predictions sent successfully")
# print("Data capture should now have recorded these predictions.")

## 9. Cleanup <a id='cleanup'></a>

**Important:** Run these cells to delete resources and avoid charges.

**⚠️ Warning:** This will delete all monitoring schedules, alarms, and the dashboard. Only run after you've completed your demonstration and documentation.

### 9.1 Delete Monitoring Schedules

In [1]:
# Delete all monitoring schedules
print("Deleting monitoring schedules...")

for schedule in schedules['MonitoringScheduleSummaries']:
    schedule_name = schedule['MonitoringScheduleName']
    try:
        sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule_name)
        print(f"  ✓ Deleted: {schedule_name}")
    except Exception as e:
        print(f"  ✗ Error deleting {schedule_name}: {e}")

print("\nWaiting for deletions to complete...")
sleep(60)

Deleting monitoring schedules...


NameError: name 'schedules' is not defined

### 9.2 Delete CloudWatch Alarms

In [2]:
# Delete all CloudWatch alarms for this endpoint
print("Deleting CloudWatch alarms...")

try:
    alarm_names = [alarm['AlarmName'] for alarm in alarms['MetricAlarms']]
    if alarm_names:
        cw_client.delete_alarms(AlarmNames=alarm_names)
        print(f"  ✓ Deleted {len(alarm_names)} alarms")
    else:
        print("  No alarms to delete")
except Exception as e:
    print(f"  ✗ Error deleting alarms: {e}")

Deleting CloudWatch alarms...
  ✗ Error deleting alarms: name 'alarms' is not defined


### 9.3 Delete CloudWatch Dashboard

In [3]:
# Delete CloudWatch dashboard
try:
    cw_client.delete_dashboards(DashboardNames=[dashboard_name])
    print(f"✓ Deleted dashboard: {dashboard_name}")
except Exception as e:
    print(f"✗ Error deleting dashboard: {e}")

✗ Error deleting dashboard: name 'cw_client' is not defined


### 9.4 Optional: Delete S3 Monitoring Data

In [4]:
# Uncomment to delete S3 monitoring data
# This will delete all captured data, baselines, and reports

# print("Deleting S3 monitoring data...")
# 
# s3_resource = boto3.resource('s3')
# bucket_resource = s3_resource.Bucket(bucket)
# 
# # Delete monitoring prefix
# bucket_resource.objects.filter(Prefix=f"{prefix}/monitoring/").delete()
# 
# print(f"✓ Deleted all S3 objects under s3://{bucket}/{prefix}/monitoring/")

### 9.5 Keep Endpoint Running (Optional)

**Note:** By default, we do NOT delete the endpoint as it may be needed for other project deliverables. 
If you want to delete the endpoint as well, uncomment and run the cell below.

In [5]:
# Uncomment to delete the endpoint
# 
# print("Deleting endpoint...")
# predictor = Predictor(
#     endpoint_name=endpoint_name,
#     sagemaker_session=sagemaker_session
# )
# predictor.delete_endpoint(delete_endpoint_config=True)
# print("✓ Endpoint deleted")

---

## Summary

This notebook has implemented comprehensive monitoring for your ML system:

✅ **Model Quality Monitoring** - Tracks prediction accuracy over time

✅ **Data Quality Monitoring** - Detects data drift and quality issues

✅ **Model Bias Monitoring** - Monitors for fairness issues (if applicable)

✅ **Infrastructure Monitoring** - CloudWatch alarms for latency, errors, CPU, memory

✅ **CloudWatch Dashboard** - Centralized visualization of all metrics

✅ **Automated Reporting** - Comprehensive monitoring reports

### Next Steps for Module 5:

1. **Take Screenshots** - Capture dashboard, alarms, and monitoring schedules for your video demonstration

2. **Update Design Document** - Document your monitoring approach in the ML Design Document

3. **Wait for Executions** - Monitoring schedules run daily, so come back in 24-48 hours to see results

4. **Generate Sample Violations** - (Optional) Intentionally send bad data to trigger violations for demonstration

5. **Prepare for Module 6** - CI/CD will integrate with this monitoring infrastructure

### Module 5 Deliverable Checklist:

- [ ] Model monitors implemented
- [ ] Data monitors implemented
- [ ] Infrastructure monitors implemented
- [ ] CloudWatch dashboard created
- [ ] Model and data reports generated
- [ ] Screenshots captured for documentation
- [ ] Design document updated with monitoring details

---