## Bonus Lab - Detect bias using SageMaker Clarify

Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models so they can identify and limit bias and explain predictions. Biases are imbalances in the training data or prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model.

In the below section, we would want to run a pretraining bias job to examine our training dataset for bias.

We would normally pick sensitive groups which might be prone to bias and run analysis. In our example, we pick customer gender to analyze how it is skewed 

## Instructions

Start by installing the tools you'll need to detect and address bias

In [1]:
# cell 1
# install the tools we're going to use to detect and deal with bias

!pip install -U imbalanced-learn awswrangler

Collecting imbalanced-learn==0.7.0
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl.metadata (11 kB)
Collecting awswrangler
  Downloading awswrangler-3.9.1-py3-none-any.whl.metadata (17 kB)
Collecting pyarrow>=8.0.0 (from awswrangler)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Downloading awswrangler-3.9.1-py3-none-any.whl (381 kB)
Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pyarrow, imbalanced-learn, awswrangler
Successfully installed awswrangler-3.9.1 imbalanced-learn-0.7.0 pyarrow-17.0.0


This is a continuation of Lab 1 after the training dataset was created. Since we're picking up the lab after the training dataset was uploaded we're going to use the same few variables that point to the dataset. The following variables are populated with the default values. If you didn't change the bucket name and training dataset name, you don't need to change anything in the cell below.

In [3]:
# cell 2

# pull the prefix and bucket variables from storemagic. running the store -r command will give
# you access to the bucket and prefix variables
%store -r

AttributeError: 'PickleShareDB' object has no attribute 'keys'

If you see any error that says unable to retrieve variable sagemaker_session, please ignore it. We're going to create a new sagemaker session in the next few cells

In [4]:
# cell 3
bucket = sagemaker_session.default_bucket()
prefix = "fraud-detect-demo"

print(f"Bucket is {bucket}")
print(f"Prefix is {prefix}")

NameError: name 'bucket' is not defined

In [8]:
# cell 4

train_data_uri = f"s3://{bucket}/{prefix}/data/train/train.csv"

NameError: name 'bucket' is not defined

In [9]:
# cell 5

import boto3
import sagemaker
import json

# Reference that session
boto_session = boto3.session.Session()
region = boto_session.region_name

# create a sagemaker client
sagemaker_boto_client = boto_session.client("sagemaker")

# then link the two
sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session, sagemaker_client=sagemaker_boto_client
)

bucket = sagemaker_session.default_bucket()
prefix = "fraud-detect-demo"
train_data_uri = f"s3://{bucket}/{prefix}/data/train/train.csv"

# create an s3 client
s3_client = boto3.client("s3", region_name=region)

In [10]:
# cell 6

# comment the line below if you want to use a separate role
sagemaker_execution_role_name = "fraud-detection-workshop-SageMakerExecutionRole-4rmzgJyf2wbK"

# Get the default role that was created for this domaim
try:
    sagemaker_role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    sagemaker_role = iam.get_role(RoleName=sagemaker_execution_role_name)["Role"]["Arn"]
    print(f"\n instantiating sagemaker_role with supplied role name : {sagemaker_role}")

# Get temporary access credentials
account_id = boto3.client("sts").get_caller_identity()["Account"]

Couldn't call 'get_role' to get Role ARN from role name peteryxu to get Role path.



 instantiating sagemaker_role with supplied role name : arn:aws:iam::975050200450:role/fraud-detection-workshop-SageMakerExecutionRole-4rmzgJyf2wbK


First - pull the training data set from our S3 bucket using DataWrangler

In [11]:
# cell 7

import awswrangler as wr

# features selected for training
train_cols = wr.s3.read_csv(train_data_uri).columns.to_list()

# our bias report will be saved at this path
bias_report_1_output_path = f"s3://{bucket}/{prefix}/clarify-output/bias_1"

Create the `SageMakerClarifyProcessor` instance to initiate a clarify job

In [12]:
# cell 8

clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(
    role=sagemaker_role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session,
)

Next, configure the input dataset, where to store the output, the label column targeted with a `DataConfig` object.

In [13]:
# cell 9

bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_uri,
    s3_output_path=bias_report_1_output_path,
    label="fraud",
    headers=train_cols,
    dataset_type="text/csv",
)

Use `BiasConfig` to provide information on which columns contain the facets (sensitive groups, customer_gender_female), what the sensitive features (facet_values_or_threshold) might be, and what the desirable outcomes are (label_values_or_threshold).

In [14]:
# cell 10

bias_config = sagemaker.clarify.BiasConfig(
    label_values_or_threshold=[0],
    facet_name="customer_gender_female",
    facet_values_or_threshold=[1],
)

Now run the clarify job if it hasn't been run already. When it is ran, store the job name and cache the results

In [16]:
# cell 11

if 'clarify_bias_job_1_name' not in locals():

    clarify_processor.run_pre_training_bias(
        data_config=bias_data_config,
        data_bias_config=bias_config)

    clarify_bias_job_1_name = clarify_processor.latest_job.name
    %store clarify_bias_job_1_name

else:
    print(f'Clarify job {clarify_bias_job_1_name} has already run successfully.')


Job Name:  Clarify-Pretraining-Bias-2024-09-08-04-33-38-420
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/data/train/train.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/clarify-output/bias_1/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/clarify-output/bias_1', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
....................

  db[ 'autorestore/' + arg ] = obj


## Result analysis

You can verify the class imbalance that Clarify reports on our training dataset. We have pre run the bias detection and use the output json below to print the class imbalance.  A classification data set with skewed class proportions is said to be imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes. This is problematic because the training model will spend most of its time on majority examples and not learn enough from minority ones. 

In this case `female` is a minority class and under represented in our dataset which might impact our model prediction.

In [17]:
# cell 12

if "clarify_bias_job_1_name" in locals():
    s3_client.download_file(
        Bucket=bucket,
        Key=f"{prefix}/clarify-output/bias_1/analysis.json",
        Filename="outputs/bias_1_analysis.json",
    )
    print(f"Downloaded analysis from previous Clarify job: {clarify_bias_job_1_name}")
else:
    print(f"Loading pre-generated analysis file...")

with open("./outputs/bias_1_analysis.json", "r") as f:
    bias_analysis = json.load(f)

results = bias_analysis["pre_training_bias_metrics"]["facets"]["customer_gender_female"][0][
    "metrics"
][1]
print(json.dumps(results, indent=4))

Downloaded analysis from previous Clarify job: Clarify-Pretraining-Bias-2024-09-08-04-33-38-420
{
    "name": "CI",
    "description": "Class Imbalance (CI)",
    "value": 0.408
}


## Fix the imbalance/bias
To fix class imbalance, we use a popular technique called SMOTE (Synthetic Minority Oversampling Technique) which basically oversamples the minority class meaining duplicating the minority class synthetically in your training dataset to balance the skew for customer gender. 

In [19]:
# cell 13

import pandas as pd
from imblearn.over_sampling import SMOTE

train = pd.read_csv("./data/train.csv")
gender = train["customer_gender_female"]
gender.value_counts()

customer_gender_female
0    2816
1    1184
Name: count, dtype: int64

In [20]:
# cell 14

sm = SMOTE(random_state=42)
train_data_upsampled, gender_res = sm.fit_resample(train, gender)
train_data_upsampled["customer_gender_female"].value_counts()

customer_gender_female
0    2816
1    2816
Name: count, dtype: int64

Now you can see that we are able to effectively duplicate the female values. Now let's get the file loaded to s3 to run Clarify again

In [21]:
# cell 15

train_data_upsampled.to_csv("./data/upsampled_train.csv", index=False)
train_data_upsampled_s3_path = f"s3://{bucket}/{prefix}/data/train/upsampled/train.csv"

s3_client.upload_file(
    Filename="./data/upsampled_train.csv",
    Bucket=bucket,
    Key=f"{prefix}/data/train/upsampled/train.csv",
)

## Re-run and see the new bias results

Let's re-run the previous few steps to get the new bias values

In [22]:
# cell 16

train_cols = wr.s3.read_csv(train_data_upsampled_s3_path).columns.to_list()
bias_report_2_output_path = f"s3://{bucket}/{prefix}/clarify-output/bias_2"

clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(
    role=sagemaker_role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session,
)

bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_upsampled_s3_path,
    s3_output_path=bias_report_2_output_path,
    label="fraud",
    headers=train_cols,
    dataset_type="text/csv",
)

bias_config = sagemaker.clarify.BiasConfig(
    label_values_or_threshold=[0],
    facet_name="customer_gender_female",
    facet_values_or_threshold=[1],
)

In [23]:
# cell 17

if 'clarify_bias_job_2_name' not in locals():

    clarify_processor.run_pre_training_bias(
        data_config=bias_data_config,
        data_bias_config=bias_config)

    clarify_bias_job_2_name = clarify_processor.latest_job.name
    %store clarify_bias_job_2_name

else:
    print(f'Clarify job {clarify_bias_job_2_name} has already run successfully.')


Job Name:  Clarify-Pretraining-Bias-2024-09-08-04-54-40-853
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/data/train/upsampled/train.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/clarify-output/bias_2/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/clarify-output/bias_2', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
..........

  db[ 'autorestore/' + arg ] = obj


If you run the below cell, you can see that class imbalance is now reduced to zero. This shows that our SMOTE worked and the dataset is not biased for gender anymore.In the next section, we will kick off a training job to train a XGBoost model using this training dataset 

In [24]:
# cell 18

if "clarify_bias_job_2_name" in locals():
    s3_client.download_file(
        Bucket=bucket,
        Key=f"{prefix}/clarify-output/bias_2/analysis.json",
        Filename="outputs/bias_2_analysis.json",
    )
    print(f"Downloaded analysis from previous Clarify job: {clarify_bias_job_2_name}")
else:
    print(f"Loading pre-generated analysis file...")

with open("./outputs/bias_2_analysis.json", "r") as f:
    bias_analysis = json.load(f)

results = bias_analysis["pre_training_bias_metrics"]["facets"]["customer_gender_female"][0][
    "metrics"
][1]
print(json.dumps(results, indent=4))

Downloaded analysis from previous Clarify job: Clarify-Pretraining-Bias-2024-09-08-04-54-40-853
{
    "name": "CI",
    "description": "Class Imbalance (CI)",
    "value": 0.0
}


Congratulations! You have finished the bonus lab - Detect Bias using SageMaker Clarify