# Feature Attribution Drift Monitoring with AWS SageMaker Clarity

This Jupyter notebook shows how to perform model bias explainability monitoring with AWS SageMaker (based on [docs](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/fairness_and_explainability/SageMaker-Model-Monitor-Fairness-and-Explainability.html))


Aamazon SageMaker Clarify explainability monitoring offers tools to provide global explanations of models and to explain the predictions of a deployed model producing inferences. Such model explanation tools can help ML modelers and developers and other internal stakeholders understand model characteristics as a whole prior to deployment and to debug predictions provided by the model once deployed. The current offering includes a scalable and efficient implementation of [**SHAP**](https://papers.nips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html), based on the concept of the [**Shapley value**](https://en.wikipedia.org/wiki/Shapley_value) from the field of cooperative game theory that assigns each feature an importance value for a particular prediction.


Notebook structure:

## Table of Contents
1. **[Configuration](#Configuration)**  
1. **[Model and Data Preparation](#Model-and-Data-Preparation)**
1. **[Deploying model for ML Observability](#Deploying-model-for-ML-Observability)**
1. **[Generate traffic](#Generate-traffic)** - Provides artifictial traffic for explainability metrics
1. **[Setting up monitoring job](#Setting-up-monitoring-job)** - Creates Monitoring tasks by creating baseline and scheduling regular monitoring
1. **[Cleaning up](#Cleaning-up)** - Removes all the created resources.

Prerequisites:

- Existing Roles with all needed permissions (S3, SageMaker, etc.)
- Configured SageMaker Domain
- SageMaker Studio user
- S3 bucket with pretrained model and data

One can use SageMaker Studio or Sagemaker Notebook instances (Jupyter-like environment) to run this notebook.  
To do that, follow the next steps:

1. Run the SageMaker Studio or Create new SageMaker notebook instance
1. Clone this repository (https://github.com/griddynamics/gd-ml-observability.git)



### Imports

In [6]:
%pip install -q --upgrade boto3

Note: you may need to restart the kernel to use updated packages.


In [7]:
import copy
import time
import pandas as pd
import threading

from datetime import datetime

from sagemaker import get_execution_role, image_uris, Session
from sagemaker.clarify import (
    DataConfig,
    ModelConfig,
    SHAPConfig,
)
from sagemaker.model import Model
from sagemaker.model_monitor import (
    CronExpressionGenerator,
    DataCaptureConfig,
    ExplainabilityAnalysisConfig,
    ModelExplainabilityMonitor,
)
from sagemaker.predictor import Predictor
from sagemaker.s3 import S3Downloader, S3Uploader

## Configuration

In [8]:
role = get_execution_role()
print(f"Execution Role: {role}")

sagemaker_session = Session()
sagemaker_client = sagemaker_session.sagemaker_client
sagemaker_runtime_client = sagemaker_session.sagemaker_runtime_client

region = sagemaker_session.boto_region_name
print(f"AWS region: {region}")

# A different bucket can be used, but make sure the role for this notebook has
# the s3:PutObject permissions. This is the bucket into which the data is captured
bucket = Session().default_bucket()
print(f"Demo Bucket: {bucket}")
prefix = "sagemaker/shap-observability-promo-planning"
s3_key = f"s3://{bucket}/{prefix}"
print(f"S3 key: {s3_key}")

s3_capture_upload_path = f"{s3_key}/datacapture"
s3_report_path = f"{s3_key}/reports"

print(f"Capture path: {s3_capture_upload_path}")
print(f"Report path: {s3_report_path}")

baseline_results_uri = f"{s3_key}/baselining"
print(f"Baseline results uri: {baseline_results_uri}")

endpoint_instance_count = 1
endpoint_instance_type = "ml.m5.large"
schedule_expression = CronExpressionGenerator.hourly()

Execution Role: arn:aws:iam::125667932402:role/service-role/AmazonSageMaker-ExecutionRole-20230117T121938
AWS region: us-east-1
Demo Bucket: sagemaker-us-east-1-125667932402
S3 key: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning
Capture path: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/datacapture
Report path: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports
Baseline results uri: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/baselining


### Test bucket connectivity

In [9]:
# Upload a test file
test_file = 'upload-test-file.txt'
with open(test_file, 'w') as f:
    f.write('Hello world!\n')

S3Uploader.upload(test_file, f"s3://{bucket}/test_upload")
print("Success! We are all set to proceed.")

Success! We are all set to proceed.


## Model and Data Preparation

For the explainability demo promotion planning dataset is used with pre-trained XGBoost model.   
We need to specify S3 URI for pre-trained model and data which will be used to compute baseline data set for shap.

In [10]:
s3_data_uri = 's3://adp-rnd-ml-datasets/promotion-planning/validation/data.csv'
s3_model_uri = 's3://adp-rnd-ml-models/promotion-planning/model/promotion-planning-train-job-2023-01-31-084806/output/model.tar.gz'

In [11]:
dataset_type = 'text/csv'

model_dir = 'model'
model_file = f'{model_dir}/model.tar.gz'
S3Downloader.download(s3_model_uri, model_dir)


dataset_dir = 'data'
S3Downloader.download(s3_data_uri, dataset_dir)

In [12]:
df = pd.read_csv(f'{dataset_dir}/data.csv')
print('SHAPE:', df.shape)

all_headers = df.columns.tolist()
label_header = all_headers[0]
all_headers[:10]

SHAPE: (57834, 111)


['OUTPUT_LABEL',
 'amt',
 'oft',
 'amount_365_days_lag',
 'off_365_days_lag',
 'black_friday',
 'business_day',
 'cyber_monday',
 'day_of_week',
 'day_of_month']

 For purposes of this demo we won't be using just a fraction of all available data.

In [24]:
fraction = 0.01
test_idx = df.sample(frac=fraction).index
test_data = df[df.index.isin(test_idx)]
print('Test shape:', test_data.shape)
test_dataset_dir = 'test'
test_dataset = f'{test_dataset_dir}/test.csv'
test_data.drop(label_header, axis=1).to_csv(test_dataset, index=False, header=False)


val_data = df[~df.index.isin(test_idx)].sample(frac=fraction)
print('Validation shape:', val_data.shape)
validation_dataset_dir = 'validation'
validation_dataset = f'{validation_dataset_dir}/validation.csv'
val_data.to_csv(validation_dataset, index=False, header=True)

Test shape: (578, 111)
Validation shape: (573, 111)


In [25]:
model_url = S3Uploader.upload(model_file, s3_key)
print(f"Model file has been uploaded to {model_url}")

Model file has been uploaded to s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/model.tar.gz


## Deploying model for ML Observability

Setting up the pre-trained model

In [15]:
model_name = f"shap-observability-promo-planning-{datetime.utcnow():%Y-%m-%d-%H%M}"
print("Model name: ", model_name)
endpoint_name = f"shap-observability-promo-planning-{datetime.utcnow():%Y-%m-%d-%H%M}"
print("Endpoint name: ", endpoint_name)

image_uri = image_uris.retrieve("xgboost", region, '0.90-1')
print(f"XGBoost image uri: {image_uri}")
model = Model(
    role=role,
    name=model_name,
    image_uri=image_uri,
    model_data=model_url,
    sagemaker_session=sagemaker_session,
)

Model name:  shap-observability-promo-planning-2023-03-06-1005
Endpoint name:  shap-observability-promo-planning-2023-03-06-1005
XGBoost image uri: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3


Setting up data capture config to be able to monitor model bias based on the stored data

In [16]:
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri=s3_capture_upload_path,
)

Deploying the model to the endpoint that will be monitored

In [17]:
print(f"Deploying model {model_name} to endpoint {endpoint_name}")
model.deploy(
    initial_instance_count=endpoint_instance_count,
    instance_type=endpoint_instance_type,
    endpoint_name=endpoint_name,
    data_capture_config=data_capture_config,
)

Deploying model shap-observability-promo-planning-2023-03-06-1005 to endpoint shap-observability-promo-planning-2023-03-06-1005
------!

## Generate traffic

If there is no traffic, the monitoring jobs are marked as Failed since there is no data to process.  
So for this example we generate traffic artificial traffic based on test dataset.

In [38]:

class WorkerThread(threading.Thread):
    def __init__(self, do_run, *args, **kwargs):
        super(WorkerThread, self).__init__(*args, **kwargs)
        self.__do_run = do_run
        self.__terminate_event = threading.Event()

    def terminate(self):
        self.__terminate_event.set()

    def run(self):
        while not self.__terminate_event.is_set():
            self.__do_run(self.__terminate_event)


def invoke_endpoint(terminate_event):
    with open(test_dataset, "r") as f:
        i = 0
        for row in f:
            payload = row.rstrip("\n")
            response = sagemaker_runtime_client.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType="text/csv",
                Body=payload,
                InferenceId=str(i),  # unique ID per row
            )
            i += 1
            response["Body"].read()
            time.sleep(1)
            if terminate_event.is_set():
                break


# Keep invoking the endpoint with test data
invoke_endpoint_thread = WorkerThread(do_run=invoke_endpoint)
invoke_endpoint_thread.start()

## Setting up monitoring job

Creating model explainability monitor

In [19]:
model_explainability_monitor = ModelExplainabilityMonitor(
    role=role,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=1800,
)


model_explainability_baselining_job_result_uri = f"{baseline_results_uri}/model_explainability"
print(f'Explainability baseline s3 uri: {model_explainability_baselining_job_result_uri}')

Explainability baseline s3 uri: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/baselining/model_explainability


Creating model and data config

In [26]:
model_config = ModelConfig(
    model_name=model_name,
    instance_count=endpoint_instance_count,
    instance_type=endpoint_instance_type,
    content_type=dataset_type,
    accept_type=dataset_type,
)

model_explainability_data_config = DataConfig(
    s3_data_input_path=validation_dataset,
    s3_output_path=model_explainability_baselining_job_result_uri,
    label=label_header,
    headers=all_headers,
    dataset_type=dataset_type,
)

### Create Shap baseline

In order to create explainability baseline, we need to provide a baseline dataset for Kernel Shap algorithm.
Number of samples determines the size of generated synthetic dataset (if not provided clarify will choose a value based on number of features). There are three ways to aggregate shap values: `mean_abs` (mean of absolute SHAP values), `median` (median of SHAP values for all instances) and `mean_sq` (mean of squared SHAP values).

In [27]:
test_dataframe = pd.read_csv(test_dataset, header=None)
shap_baseline = test_dataframe.sample(frac=fraction).values.tolist()


shap_config = SHAPConfig(
    baseline=shap_baseline,
    num_samples=50,
    agg_method="mean_abs",
    save_local_shap_values=False,
)

#### Run explainability baselining job

In [28]:
model_explainability_monitor.suggest_baseline(
    data_config=model_explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
)
latest_baselining_job_name = model_explainability_monitor.latest_baselining_job_name
print(f"ModelExplainabilityMonitor baselining job: {latest_baselining_job_name}")


INFO:sagemaker:Creating processing-job with name baseline-suggestion-job-2023-03-06-10-17-13-690


ModelExplainabilityMonitor baselining job: baseline-suggestion-job-2023-03-06-10-17-13-690


Wait for baselining job and review the constaints

In [29]:
model_explainability_monitor.latest_baselining_job.wait(logs=False)
model_explainability_constraints = model_explainability_monitor.suggested_constraints()
if model_explainability_constraints is not None:
    print(
        "ModelExplainabilityMonitor suggested constraints: "
        f"{model_explainability_constraints.file_s3_uri}"
    )
    print(S3Downloader.read_file(model_explainability_constraints.file_s3_uri))

model_explainability_analysis_config = None
if not model_explainability_monitor.latest_baselining_job:
    # Remove label because only features are required for the analysis
    headers_without_label_header = copy.deepcopy(all_headers)
    headers_without_label_header.remove(label_header)
    model_explainability_analysis_config = ExplainabilityAnalysisConfig(
        explainability_config=shap_config,
        model_config=model_config,
        headers=headers_without_label_header,
    ) 
    

.......................................................................................................!

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


ModelExplainabilityMonitor suggested constraints: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/baselining/model_explainability/analysis.json
{
    "version": "1.0",
    "explanations": {
        "kernel_shap": {
            "label0": {
                "global_shap_values": {
                    "amt": 0.10948012214978607,
                    "oft": 0.10853867704455387,
                    "amount_365_days_lag": 0.10292516679605697,
                    "off_365_days_lag": 0.11545188903392774,
                    "black_friday": 0.09804559208200446,
                    "business_day": 0.11297093042708263,
                    "cyber_monday": 0.11255411381564072,
                    "day_of_week": 0.1071939282195742,
                    "day_of_month": 0.11620003500504866,
                    "month": 0.10880362477807314,
                    "is_holiday": 0.10026392279858574,
                    "list_price": 0.506711290138479,
                    "list

### Create Monitoring schedule

In [30]:
model_explainability_monitor.create_monitoring_schedule(
    output_s3_uri=s3_report_path,
    endpoint_input=endpoint_name,
    schedule_cron_expression=schedule_expression,
    analysis_config=model_explainability_analysis_config
)

INFO:sagemaker.model_monitor.clarify_model_monitoring:Uploading analysis config to {s3_uri}.
INFO:sagemaker.model_monitor.model_monitoring:Creating Monitoring Schedule with name: monitoring-schedule-2023-03-06-10-26-10-817


You can find created monitoring schedule in SageMaker Studio by navigating to Deployments -> Endpoints section in sidebar, choosing you endpoint and opening `Model explainability` tab.
![Schedule](images/sm-shap-monitoring-schedule.png)

#### Wait for monitoring job to start first execution

In [31]:
def wait_for_execution_to_start(model_monitor):
    print(
        "A hourly schedule was created above and it will kick off executions ON the hour (plus 0 - 20 min buffer)."
    )

    print("Waiting for the first execution to happen", end="")
    schedule_desc = model_monitor.describe_schedule()
    while "LastMonitoringExecutionSummary" not in schedule_desc:
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)
        time.sleep(60)
    print()
    print("Done! Execution has been created")

    print("Now waiting for execution to start", end="")
    while schedule_desc["LastMonitoringExecutionSummary"]["MonitoringExecutionStatus"] in "Pending":
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)
        time.sleep(10)

    print()
    print("Done! Execution has started")



# Waits for the schedule to have last execution in a terminal status.
def wait_for_execution_to_finish(model_monitor):
    schedule_desc = model_monitor.describe_schedule()
    execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
    if execution_summary is not None:
        print("Waiting for execution to finish", end="")
        while execution_summary["MonitoringExecutionStatus"] not in [
            "Completed",
            "CompletedWithViolations",
            "Failed",
            "Stopped",
        ]:
            print(".", end="", flush=True)
            time.sleep(60)
            schedule_desc = model_monitor.describe_schedule()
            execution_summary = schedule_desc["LastMonitoringExecutionSummary"]
        print()
        print("Done! Execution has finished")
    else:
        print("Last execution not found")


In [36]:
wait_for_execution_to_start(model_explainability_monitor)

A hourly schedule was created above and it will kick off executions ON the hour (plus 0 - 20 min buffer).
Waiting for the first execution to happen
Done! Execution has been created
Now waiting for execution to start
Done! Execution has started


In [43]:
wait_for_execution_to_finish(model_explainability_monitor)

Waiting for execution to finish..
Done! Execution has finished


In [44]:

schedule_desc = model_explainability_monitor.describe_schedule()
execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
if execution_summary and execution_summary["MonitoringExecutionStatus"] in [
    "Completed",
    "CompletedWithViolations",
]:
    last_model_explainability_monitor_execution = model_explainability_monitor.list_executions()[-1]
    last_model_explainability_monitor_execution_report_uri = (
        last_model_explainability_monitor_execution.output.destination
    )
    print(f"Report URI: {last_model_explainability_monitor_execution_report_uri}")
    last_model_explainability_monitor_execution_report_files = sorted(
        S3Downloader.list(last_model_explainability_monitor_execution_report_uri)
    )
    print("Found Report Files:")
    print("\n ".join(last_model_explainability_monitor_execution_report_files))
else:
    last_model_explainability_monitor_execution = None
    print(
        "====STOP==== \n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures."
    )

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Report URI: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06-1005/monitoring-schedule-2023-03-06-10-26-10-817/2023/03/06/17
Found Report Files:
s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06-1005/monitoring-schedule-2023-03-06-10-26-10-817/2023/03/06/17/analysis.json
 s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06-1005/monitoring-schedule-2023-03-06-10-26-10-817/2023/03/06/17/report.html
 s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06-1005/monitoring-schedule-2023-03-06-10-26-10-817/2023/03/06/17/report.ipynb
 s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06

In [46]:
if last_model_explainability_monitor_execution:
    model_explainability_violations = (
        last_model_explainability_monitor_execution.constraint_violations()
    )
    if model_explainability_violations:
        print(model_explainability_violations.body_dict)


Could not retrieve constraints file at location 's3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports/shap-observability-promo-planning-2023-03-06-1005/monitoring-schedule-2023-03-06-10-26-10-817/2023/03/06/17/constraint_violations.json'. To manually retrieve ConstraintViolations object from a given uri, use 'my_model_monitor.constraints(my_s3_uri)' or 'ConstraintViolations.from_s3_uri(my_s3_uri)'


Now the feature importance chart is diplayed in SageMaker Studio
![Chart](images/sm-shap-chart.png)

You can also view previous Explainability monitoring jobs in `Monitoring job history` tab.
![Monitoring Jobs History](images/sm-shap-monitoring-job-history.png)

Sagemaker also provides reports, which are saved to S3 to specified `s3_report_path` in multiple formats including HTML and PDF.
![Report](images/sm-shap-explainability-report.png)

In [47]:
print(f"Report path: {s3_report_path}")

Report path: s3://sagemaker-us-east-1-125667932402/sagemaker/shap-observability-promo-planning/reports


## Cleaning up

If there is no plan to use the endpoint further, it should be deleted to avoid incurring additional charges. Note that deleting endpoint does not delete the data that was captured during the model invocations.

In [48]:

model_explainability_monitor.stop_monitoring_schedule()
model_explainability_monitor.delete_monitoring_schedule()


Stopping Monitoring Schedule with name: monitoring-schedule-2023-03-06-10-26-10-817

Deleting Monitoring Schedule with name: monitoring-schedule-2023-03-06-10-26-10-817


INFO:sagemaker.model_monitor.clarify_model_monitoring:Deleting Model Explainability Job Definition with name: model-explainability-job-definition-2023-03-06-10-26-10-817


In [50]:
invoke_endpoint_thread.terminate()

predictor = Predictor(endpoint_name, sagemaker_session=sagemaker_session)
predictor.delete_endpoint()
predictor.delete_model()