# SageMaker MLOps Clarify Team Demo - Configuring Features to Explain

## Contents

1. [Overview](#Overview)
1. [Prerequisites and Data](#Prerequisites-and-Data)
    1. [Import libraries](#Import-libraries)
    1. [Set configurations](#Set-configurations)
    1. [Download data](#Download-data)
    1. [Loading the data: Adult Dataset](#Loading-the-data:-Adult-Dataset) 
    1. [Data inspection](#Data-inspection) 
    1. [Encode and Upload the Dataset](#Encode-and-Upload-the-Dataset) 
1. [Train and Deploy XGBoost Model](#Train-XGBoost-Model)
    1. [Train Model](#Train-Model)
    1. [Create Model](#Create-Model)
1. [Amazon SageMaker Clarify](#Amazon-SageMaker-Clarify)
    1. [Set Configurations](#Set-Configurations)
    1. [Get Started with a SageMaker Clarify Container](#Get-Started-with-a-SageMaker-Clarify-Container)
    1. [Explaining Predictions](#Explaining-Predictions)
        1. [Configure a SageMaker Clarify Processing Job Container's Input and Output Parameters ](#Configure-a-SageMaker-Clarify-Processing-Job-Container's-input-and-output-parameters)
        1. [Configure Analysis Config](#Configure-analysis-config)
        1. [Run SageMaker Clarify Processing Job](#Run-SageMaker-Clarify-Processing-job)
        1. [Viewing the Explainability Report](#Viewing-the-Explainability-Report)
        1. [Analysis of local explanations](#Analysis-of-local-explanations)


## Overview

Amazon SageMaker Clarify can help improve your machine learning models by explaining how these models make predictions. Specifically, Clarify uses the Kernel SHAP algorithm to explain the contribution that each model feature makes to the final prediction. 

By default, Clarify computes feature importance for every feature in the model input. This demo illustrate a new configuration we have for tabular SHAP explainability to select model feature columns for explainability calculations.

### Customer - SageMaker AutoPilot

* Autopilot has integrated Clarify's explainability analysis and report generation for the models they create.
* With Autopilot, customers can select dataset features for model training. 
* Internally, Autopilot treats unselected features as model input before applying a column selection feature transform and training a downstream model only on the customer selected features.

### Customer Requirement

* To be able to exclude some features from explainability computations, while still sending them to the model.



## In this demo, we will walk through:

* Key terms and concepts needed to understand SageMaker Clarify 
* Analysis configuration for explainability and how to configure `features_to_explain`
* The corresponding analysis output and report in SageMaker Studio.

In doing so, we first trains a [SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) model using the UCI adult dataset, then utilizes the [AWS SDK for Python](https://aws.amazon.com/sdk-for-python/) to launch SageMaker Clarify jobs to analyze an example dataset in CSV format.

## Prerequisites and Data

### Import libraries

In [6]:
import pandas as pd
import numpy as np
import os
import boto3
import time
from datetime import datetime
from sagemaker import get_execution_role, session


### Set configurations

In [7]:
# Initialize sagemaker session
sagemaker_session = session.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-sagemaker-clarify-features-to-explain"


### Download data
Data Source: [https://archive.ics.uci.edu/ml/machine-learning-databases/adult/](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)

$^{[2]}$Dua Dheeru, and Efi Karra Taniskidou. "[UCI Machine Learning Repository](http://archive.ics.uci.edu/ml)". Irvine, CA: University of California, School of Information and Computer Science (2017).

In [8]:
from sagemaker.s3 import S3Downloader

adult_columns = [
    "Age",
    "Workclass",
    "fnlwgt",
    "Education",
    "Education-Num",
    "Marital Status",
    "Occupation",
    "Relationship",
    "Ethnic group",
    "Sex",
    "Capital Gain",
    "Capital Loss",
    "Hours per week",
    "Country",
    "Target",
]

S3Downloader.download(
    s3_uri="s3://{}/{}".format(f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.data"),
    local_path="./",
    sagemaker_session=sagemaker_session,
)
S3Downloader.download(
    s3_uri="s3://{}/{}".format(f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.test"),
    local_path="./",
    sagemaker_session=sagemaker_session,
)


### Loading the data: Adult Dataset
From the UCI repository of machine learning datasets, this database contains 14 features concerning demographic characteristics of 45,222 rows (32,561 for training and 12,661 for testing). The task is to predict whether a person has a yearly income that is more or less than $50,000.

Here are the features and their possible values:

1. **Age**: continuous.
1. **Workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
1. **Fnlwgt**: continuous (the number of people the census takers believe that observation represents).
1. **Education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
1. **Education-num**: continuous.
1. **Marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
1. **Occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
1. **Relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
1. **Ethnic group**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
1. **Sex**: Female, Male.
    * **Note**: this data is extracted from the 1994 Census and enforces a binary option on Sex
1. **Capital-gain**: continuous.
1. **Capital-loss**: continuous.
1. **Hours-per-week**: continuous.
1. **Native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Next, we specify our binary prediction task: 

15. **Target**: <=50,000, >$50,000.

In [9]:
training_data = pd.read_csv(
    "adult.data", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?"
).dropna()

testing_data = pd.read_csv(
    "adult.test", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?", skiprows=1
).dropna()

training_data.head()


Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Ethnic group,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Encode and Upload the Dataset
Here we encode the training and test data. Encoding input data is not necessary for SageMaker Clarify, but is necessary for the model.

In [10]:
from sklearn import preprocessing
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column].fillna("None"))
    return result, encoders


training_data = pd.concat([training_data["Target"], training_data.drop(["Target"], axis=1)], axis=1)
training_data, _ = number_encode_features(training_data)
training_data.to_csv("train_data.csv", index=False, header=False)

testing_data, _ = number_encode_features(testing_data)
test_features = testing_data.drop(["Target"], axis=1)
test_target = testing_data["Target"]
test_features.to_csv("test_features.csv", index=False, header=False)



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


A quick note about our encoding: the "Female" Sex value has been encoded as 0 and "Male" as 1.

Lastly, let's upload the data to S3.

In [11]:
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

train_uri = S3Uploader.upload(
    local_path="train_data.csv",
    desired_s3_uri="s3://{}/{}".format(bucket, prefix),
    sagemaker_session=sagemaker_session,
)
train_input = TrainingInput(train_uri, content_type="csv")
test_uri = S3Uploader.upload(
    local_path="test_features.csv",
    desired_s3_uri="s3://{}/{}".format(bucket, prefix),
    sagemaker_session=sagemaker_session,
)


### Train XGBoost Model
#### Train Model
Since our focus is on understanding how to use SageMaker Clarify, we keep it simple by using a standard XGBoost model. For this section we will be using Amazon SageMaker Python SDK for simplicity.

In [12]:
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator

# This is references the AWS managed XGBoost container
xgboost_image_uri = retrieve(region=region, framework="xgboost", version="1.5-1")

xgb = Estimator(
    xgboost_image_uri,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    disable_profiler=True,
    sagemaker_session=sagemaker_session,
)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective="binary:logistic",
    num_round=800,
)

xgb.fit({"train": train_input}, logs=False)



2023-06-15 06:51:15 Starting - Starting the training job...
2023-06-15 06:51:32 Starting - Preparing the instances for training.........
2023-06-15 06:52:23 Downloading - Downloading input data....
2023-06-15 06:52:48 Training - Downloading the training image....
2023-06-15 06:53:14 Training - Training image download completed. Training in progress......
2023-06-15 06:53:44 Uploading - Uploading generated training model.
2023-06-15 06:53:56 Completed - Training job completed


#### Create Model
Here we create the SageMaker model.

In [13]:
model_name = "DEMO-clarify-xgboost-model"
model = xgb.create_model(name=model_name)
container_def = model.prepare_container_def()
sagemaker_session.create_model(model_name, role, container_def)


'DEMO-clarify-xgboost-model'

## Amazon SageMaker Clarify
With your model set up, it's time to explore SageMaker Clarify. For a general overview of how SageMaker Clarify processing jobs work, refer to [the provided link](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-how-it-works.html). This section will demonstrate how to use the AWS SDK for Python (Boto3) to launch SageMaker Clarify processing jobs.

### Set Configurations

In [14]:
# Initialise SageMaker boto3 client
sagemaker_client = boto3.Session().client("sagemaker")



### Get Started with a SageMaker Clarify Container
Amazon SageMaker provides prebuilt SageMaker Clarify container images that include the libraries and other dependencies needed to compute bias metrics and feature attributions for explainability. This image has been enabled to run SageMaker Clarify processing job in your account.

The following code uses the SageMaker Python SDK API to easily retrieve the image URI. If you are unable to use the SageMaker Python SDK, you can find the image URI by referring to [the regional image URI page](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-container.html).

In [15]:
clarify_image_uri = retrieve(region=region, framework="clarify", version="1.0")
print(f"Clarify Image URI: {clarify_image_uri}")


Clarify Image URI: 306415355426.dkr.ecr.us-west-2.amazonaws.com/sagemaker-clarify-processing:1.0


In [18]:
def create_processing_job(analysis_config_path, analysis_result_path):
    processing_job_name = "DEMO-clarify-job-{}".format(datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))

    response = sagemaker_client.create_processing_job(
        ProcessingJobName=processing_job_name,
        AppSpecification={"ImageUri": clarify_image_uri},
        ProcessingInputs=[
            {
                "InputName": "analysis_config",
                "S3Input": {
                    "S3DataType": "S3Prefix",
                    "S3InputMode": "File",
                    "S3Uri": analysis_config_path,
                    "LocalPath": "/opt/ml/processing/input/config",
                },
            },
            {
                "InputName": "dataset",
                "S3Input": {
                    "S3DataType": "S3Prefix",
                    "S3InputMode": "File",
                    "S3Uri": train_uri,
                    "LocalPath": "/opt/ml/processing/input/data",
                },
            },
        ],
        ProcessingOutputConfig={
            "Outputs": [
                {
                    "OutputName": "analysis_result",
                    "S3Output": {
                        "S3Uri": analysis_result_path,
                        "LocalPath": "/opt/ml/processing/output",
                        "S3UploadMode": "EndOfJob",
                    },
                }
            ]
        },
        ProcessingResources={
            "ClusterConfig": {
                "InstanceCount": 1,
                "InstanceType": "ml.m5.xlarge",
                "VolumeSizeInGB": 30,
            }
        },
        StoppingCondition={
            "MaxRuntimeInSeconds": 3600,
        },
        RoleArn=role,
    )

    return processing_job_name

# Wait for processing job to complete
def wait_for_job(job_name):
    while (
        sagemaker_client.describe_processing_job(ProcessingJobName=job_name)["ProcessingJobStatus"]
        == "InProgress"
    ):
        print(".", end="")
        time.sleep(60)
    print()
    

Here is a brief explanation of inputs used above, for detailed documentation check [CreateProcessingJob API reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html):

* `AppSpecification`: Here we provide the region specific clarify image uri we fetched earlier
* `ProcessingInputs`: Clarify job requires that you provide two ProcessingInput parameters.
  * `InputName: analysis_config`: The analysis configuration JSON file for a SageMaker Clarify job must be specified as an Amazon S3 object with the InputName "analysis_config". We will be providing the example analysis_configs that we have provided with this notebook. 
  * `InputName: dataset`, dataset fetched earlier provided here as an Amazon S3 object.
* `ProcessingOutputConfig`: The job also requires an output parameter, the output location as an Amazon S3 prefix with the OutputName "analysis_result". The S3UploadMode should be set to "EndOfJob", because the analysis results is generated at the end of the job. We will be providing here the `analysis_result_path` that we configured earlier.
* `ProcessingResources` contains the ClusterConfig specifying the ML compute instance type we want to use and the count. SageMaker SHAP analysis is CPU-intensive, to speed up the analysis, use a better instance type, or add more instances to enable Spark parallelization. The SageMaker Clarify job doesn’t use GPU.
* `StoppingCondition`: Using a maximum limit of 60 min for example job run. You can set the MaxRuntimeInSeconds of a SageMaker Clarify job to up to 7 days (604800 seconds). If the job cannot be completed within this time limit, it will be force-stopped and no analysis results are provided.

### Explaining Predictions
SageMaker Clarify uses Kernel SHAP to explain the contribution that each input feature makes to the final decision.

#### Configure a SageMaker Clarify Processing Job Container's input and output parameters 

In [20]:
explainability_analysis_config_path = "s3://{}/{}/explainability_analysis_config.json".format(
    bucket, prefix
)
explainability_analysis_result_path = "s3://{}/{}/explainability_analysis_output".format(
    bucket, prefix
)

#### Configure analysis config
For our example use case we will be using the following analysis config. 

Note that if you do not wish for all model features to be explained by Kernel SHAP, you can configure the `features_to_explain` parameter as a list of feature names or indices to specify the model features you would like explanations computed for as seen below:

In [21]:
!echo
!cat explainability_analysis_config.json



{
    "dataset_type": "text/csv",
    "headers": ["Target", "Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Marital Status", "Occupation", "Relationship", "Ethnic group", "Sex", "Capital Gain", "Capital Loss", "Hours per week", "Country"],
    "label": "Target",
    "methods": {
        "shap": {
            "baseline": [
                [38, 2, 189794, 10, 10, 3, 6, 1, 4, 1, 1092, 88, 41, 36]
            ],
            "num_samples": 15,
            "agg_method": "mean_abs",
            "use_logit": false,
            "save_local_shap_values": true,
            "features_to_explain": ["Age", "Education", "Occupation"]
        },
        "report": {
            "name": "report",
            "title": "Analysis Report"
        }
    },
    "predictor": {
        "model_name": "DEMO-clarify-xgboost-model",
        "instance_type": "ml.m5.xlarge",
        "initial_instance_count": 1,
        "accept_type": "text/csv",
        "content_type": "text/csv"
    }
}

`explainability_analysis_config.json` here contains configuration values for computing feature attribution using a SageMaker Clarify job:

* `dataset_type` specify the format of your dataset, for this example as we are using csv dataset this will be `text/csv`
* `headers` is the list of column names in the dataset
* `label` specifies the ground truth label, in this example the "Target" column. The SageMaker Clarify job will drop the column and uses the remaining feature columns for explainability analysis.
* `methods` is the list of methods and their parameters for the analyses and reports.
  * `shap:` This section has the parameter for SHAP analysis. 
      * `baseline`: Kernel SHAP algorithm requires a baseline (also known as background dataset). If not provided, a baseline is calculated automatically by SageMaker Clarify using K-means or K-prototypes in the input dataset. Baseline dataset type shall be the same as `dataset_type`, and baseline samples shall only include features. By definition, `baseline` should either be a S3 URI to the baseline dataset file, or an in-place list of samples. In this case we chose the latter, and put the mean of the train dataset to the list. For more details on baseline selection please [refer this documentation](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html).
      * `features_to_explain`: a list of names or indices of model features you would like to be explained. If not provided, 
* `predictor` includes model configuration, this section is required if the analysis requires predictions from model
  * `model_name`: name of the concerned model, using name of the xgboost model trained earlier, `DEMO-clarify-xgboost-model`
  * `instance_type` and `initial_instance_count` specify your preferred instance type and instance count used to run your model on during SageMaker Clarify's processing. The testing dataset is small, so a single standard instance is good enough to run this example.
  * `accept_type` denotes the endpoint response payload format, and `content_type` denotes the payload format of request to the endpoint. As per the example model we created above both of these will be `text/csv`

In [22]:
# Upload the analysis_config to the concerned S3 path.
S3Uploader.upload(
    "explainability_analysis_config.json", "s3://{}/{}".format(bucket, prefix)
)


's3://sagemaker-us-west-2-678264136642/sagemaker/DEMO-sagemaker-clarify-features-to-explain/explainability_analysis_config.json'

#### Run SageMaker Clarify Processing job

In [26]:
processing_job_name = create_processing_job(
    explainability_analysis_config_path, explainability_analysis_result_path
)
wait_for_job(processing_job_name)


................


#### Viewing the Explainability Report
Let's view the explainability report in Studio under the experiments tab. Note that explanations are only generated for features specified in `features_to_explain` in the analysis config.


<img src="explainability_studio.png">

The Model Insights tab contains direct links to the report and model insights. 

The complete analysis report can also be accessed at the following S3 bucket.

In [35]:
explainability_analysis_result_path


's3://sagemaker-us-west-2-786499417150/sagemaker/DEMO-sagemaker-clarify-boto3/explainability_analysis_output'