# Detect data bias with Amazon SageMaker Clarify


## Amazon Science: _[How Clarify helps machine learning developers detect unintended bias](https://www.amazon.science/latest-news/how-clarify-helps-machine-learning-developers-detect-unintended-bias)_ 

[<img src="img/amazon_science_clarify.png"  width="100%" align="left">](https://www.amazon.science/latest-news/how-clarify-helps-machine-learning-developers-detect-unintended-bias)

# Terminology
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html

* **Bias**: 
An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people.

* **Bias metric**: 
A function that returns numerical values indicating the level of a potential bias.

* **Bias report**:
A collection of bias metrics for a given dataset, or a combination of a dataset and a model.

* **Label**:
Feature that is the target for training a machine learning model. Referred to as the observed label or observed outcome.

* **Positive label values**:
Label values that are favorable to a demographic group observed in a sample. In other words, designates a sample as having a positive result.

* **Negative label values**:
Label values that are unfavorable to a demographic group observed in a sample. In other words, designates a sample as having a negative result.

* **Facet**:
A column or feature that contains the attributes with respect to which bias is measured.

* **Facet value**:
The feature values of attributes that bias might favor or disfavor.

# Pretraining Bias Metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html

* **Class Imbalance (CI)**:
Measures the imbalance in the number of members between different facet values.

* **Difference in Proportions of Labels (DPL)**:
Measures the imbalance of positive outcomes between different facet values.

* **Kullback-Leibler Divergence (KL)**:
Measures how much the outcome distributions of different facets diverge from each other entropically.

* **Jensen-Shannon Divergence (JS)**:
Measures how much the outcome distributions of different facets diverge from each other entropically.

* **Lp-norm (LP)**:
Measures a p-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

* **Total Variation Distance (TVD)**:
Measures half of the L1-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

* **Kolmogorov-Smirnov (KS)**:
Measures maximum divergence between outcomes in distributions for different facets in a dataset.

* **Conditional Demographic Disparity (CDD)**:
Measures the disparity of outcomes between different facets as a whole, but also by subgroups.

In [None]:
import boto3
import sagemaker
import pandas as pd
import numpy as np

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

# Analyze dataset
Create a Pandas DataFrame from each of the product categories.

In [None]:
import csv

df_giftcards = pd.read_csv(
    "./data-clarify/amazon_reviews_us_Gift_Card_v1_00.tsv.gz",
    delimiter="\t",
    quoting=csv.QUOTE_NONE,
    compression="gzip",
)

df_software = pd.read_csv(
    "./data-clarify/amazon_reviews_us_Digital_Software_v1_00.tsv.gz",
    delimiter="\t",
    quoting=csv.QUOTE_NONE,
    compression="gzip",
)

df_videogames = pd.read_csv(
    "./data-clarify/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz",
    delimiter="\t",
    quoting=csv.QUOTE_NONE,
    compression="gzip",
)

df = pd.concat([df_giftcards, df_software, df_videogames], ignore_index=True, sort=False)
df.head()

In [None]:
import seaborn as sns

sns.countplot(data=df, x="star_rating", hue="product_category")

### Upload the data

In [None]:
!mkdir -p ./transformed/

path = "./amazon_reviews_us_giftcards_software_videogames.csv"
df.to_csv(path, index=False, header=True)

data_s3_uri = sess.upload_data(bucket=bucket, key_prefix="bias/transformed", path=path)
data_s3_uri

# Analyze bias

In [None]:
from sagemaker import clarify

bias_s3_prefix = "bias/generated_bias_report"
bias_report_output_path = "s3://{}/{}/data".format(bucket, bias_s3_prefix)

data_config = clarify.DataConfig(
    s3_data_input_path=data_s3_uri,
    s3_output_path=bias_report_output_path,
    label="star_rating",
    headers=df.columns.to_list(),
    dataset_type="text/csv",
)

### Setup `BiasConfig`
SageMaker Clarify also needs the sensitive columns (`facets`) and the desirable outcomes (`label_values_or_threshold`).

We specify this information in the `BiasConfig` API. Here that the positive outcome is `star_rating==5` and `star_rating==4`.  `product_category` is the facet that we analyze in this run.

In [None]:
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[5, 4], facet_name="product_category", group_name="product_category"
)

### Setup SageMaker Clarify Processing Job

In [None]:
processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge", sagemaker_session=sess
)

### Run Processing Job

In [None]:
processor.run_pre_training_bias(
    data_config=data_config, data_bias_config=bias_config, methods="all", wait=False, logs=False
)

In [None]:
bias_processing_job_name = processor.latest_job.job_name
print(bias_processing_job_name)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(
            region, bias_processing_job_name
        )
    )
)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, bias_processing_job_name
        )
    )
)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}?region={}&prefix={}/">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(
            bucket, region, bias_s3_prefix
        )
    )
)

In [None]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=bias_processing_job_name, sagemaker_session=sess
)

### _This cell will take approximately 5-10 minutes to run._

In [None]:
%%time

running_processor.wait(logs=False)

### View bias report

In [None]:
!aws s3 ls $bias_report_output_path/

In [None]:
!aws s3 cp --recursive $bias_report_output_path ./generated_bias_report/data/

In [None]:
from IPython.core.display import display, HTML

display(
    HTML('<b>Review <a target="blank" href="./generated_bias_report/data/report.html">Unbalanced Bias Report</a></b>')
)

# Balance the dataset by `product_category` and `star_rating`

In [None]:
df_group_by = df.groupby(["product_category", "star_rating"])
df_balanced_data = df_group_by.apply(lambda x: x.sample(df_group_by.size().min()).reset_index(drop=True))

In [None]:
import seaborn as sns

sns.countplot(data=df_balanced_data, x="star_rating", hue="product_category")

# Analyze bias on balanced dataset with SageMaker Clarify

In [None]:
path_balanced = "./amazon_reviews_us_giftcards_software_videogames_balanced.csv"
df_balanced_data.to_csv(path_balanced, index=False, header=True)

balanced_data_s3_uri = sess.upload_data(bucket=bucket, key_prefix="bias/data_balanced", path=path_balanced)
balanced_data_s3_uri

In [None]:
from sagemaker import clarify

bias_s3_prefix = "bias/generated_bias_report"
bias_report_balanced_output_path = "s3://{}/{}/data_balanced".format(bucket, bias_s3_prefix)

balanced_data_config = clarify.DataConfig(
    s3_data_input_path=balanced_data_s3_uri,
    s3_output_path=bias_report_balanced_output_path,
    label="star_rating",
    headers=df_balanced_data.columns.to_list(),
    dataset_type="text/csv",
)

### Setup `BiasConfig`
SageMaker Clarify also needs the sensitive columns (`facets`) and the desirable outcomes (`label_values_or_threshold`).

We specify this information in the `BiasConfig` API. Here that the positive outcome is `star_rating==5` and `star_rating==4`.  `product_category` is the facet that we analyze in this run.

In [None]:
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[5, 4], facet_name="product_category", group_name="product_category"
)

### Setup SageMaker Clarify Processing Job

In [None]:
processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge", sagemaker_session=sess
)

In [None]:
processor.run_pre_training_bias(
    data_config=balanced_data_config, data_bias_config=bias_config, methods="all", wait=False, logs=False
)

In [None]:
balanced_bias_processing_job_name = processor.latest_job.job_name
print(balanced_bias_processing_job_name)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(
            region, balanced_bias_processing_job_name
        )
    )
)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, balanced_bias_processing_job_name
        )
    )
)

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}?region={}&prefix={}/">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(
            bucket, region, bias_s3_prefix
        )
    )
)

In [None]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=balanced_bias_processing_job_name, sagemaker_session=sess
)

### _This cell will take approximately 5-10 minutes to run._

In [None]:
%%time

running_processor.wait(logs=False)

### Analyze balanced bias report
Note that the class imbalance metric is equal across all product categories for the target label.

Download generated bias report from S3

In [None]:
!aws s3 ls $bias_report_balanced_output_path/

In [None]:
!aws s3 cp --recursive $bias_report_balanced_output_path ./generated_bias_report/data_balanced/

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="./generated_bias_report/data_balanced/report.html">Balanced Bias Report</a></b>'
    )
)

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>