# Bias and Explainability with Amazon SageMaker Clarify

## Overview
Biases are imbalances in the training data, or the prediction behavior of the model across different groups. Sometimes these biases can cause harms to demographic subgroups, e.g. based age or income bracket. The field of machine learning provides an opportunity to address biases by detecting them and measuring them in your data and model.

Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models so they can identify and limit bias and explain predictions.

In this notebook, we are going to go through each stage of the ML lifecycle, and show where you can include Clarify.

## Problem Formation

In this notebook, we are looking to predict the final grade for a students in a maths class, from the popular [Student Performance dataset](https://archive.ics.uci.edu/ml/datasets/Student+Performance) courtesy of UC Irvine.

For this dataset, final grades range from 0-20, where 15-20 are the most favourable outcomes. This is a multiclass classification problem, where we want to predict which grade a given student will get from 0 to 20. 

The benefit of using ML to predict this, is to be able to provide an accurate grade for the student if they aren't able to attend the final exam, due to circumstances outside their control.

#TODO diagram

The notebook will take 90 minutes to execute and will cost approximately $2. TODO
 - Some estimate of both time and money is recommended.
    - List the instance types and other resources that are created.

## Prerequisites
1. This notebook works in the following environments.
   - Notebook Instances: Jupyter
   - Notebook Instances: JupyterLab
   - Studio
1. Which conda kernel is required? TODO
1. This is a standalone notebook and it does not depend on other notebooks.


## Setup 

### Setup Dependencies

First, we're going to import various python libraries and set up a Sagemaker session for various tasks throughout our notebook. Then, we'll create a prefix within the default sagemaker bucket to store our data and reports.

In [None]:
# imports
import sagemaker
import boto3
import pandas as pd

# initialisation
session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "sagemaker/student-data-xgb"

### Setup Python Modules
1. Import modules, set options, and activate extensions.

In [None]:
# imports
import sagemaker
import numpy as np
import pandas as pd

# options
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# extensions
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

## Parameters
1. Setup user supplied parameters like custom bucket names and roles in a separated cell and call out what their options are.
1. Use defaults, so the notebook will still run end-to-end without any user modification.

For example, the following description & code block prompts the user to select the preferred dataset.

~~~

To do select a particular dataset, assign choosen_data_set below to be one of 'diabetes', 'california', or 'boston' where each name corresponds to the it's respective dataset.

'boston' : boston house data
'california' : california house data
'diabetes' : diabetes data

~~~


In [None]:
data_sets = {
    "diabetes": "load_diabetes()",
    "california": "fetch_california_housing()",
    "boston": "load_boston()",
}

# Change choosen_data_set variable to one of the data sets above.
choosen_data_set = "california"
assert choosen_data_set in data_sets.keys()
print("I selected the '{}' dataset!".format(choosen_data_set))


## Data import
1. Look for the data that was stored by a previous notebook run `%store -r variableName`
1. If that doesn't exist, look in S3 in their default bucket
1. If that doesn't exist, download it from the [SageMaker dataset bucket](https://sagemaker-sample-files.s3.amazonaws.com/) 
1. If that doesn't exist, download it from origin

For example, the following code block will pull training and validation data that was created in a previous notebook. This allows the customer to experiment with features, re-run the notebook, and not have it pull the dataset over and over.

In [None]:
# Load relevant dataframes and variables from preprocessing_tabular_data.ipynb required for this notebook
%store -r X_train
%store -r X_test
%store -r X_val
%store -r Y_train
%store -r Y_test
%store -r Y_val
%store -r choosen_data_set

## Procedure or tutorial
1. Break up processes with Markdown blocks to explain what's going on.
1. Make use of visualizations to better demonstrate each step.

## Cleanup
You can keep your endpoint running to continue capturing data. If you do not plan to collect more data or use this endpoint further, you should delete the endpoint to avoid incurring additional charges. Note that deleting your endpoint does not delete the data that was captured during the model invocations. That data persists in Amazon S3 until you delete it yourself.



In [None]:
session.delete_endpoint(endpoint_name)
session.delete_model(pipeline_model.name)

# Clean up S3 model? TODO

## Next steps

AI services and machine learning are helping organisations to build data driven applications that are innovative and can be highly attuned to their customers’ needs, but AI applications require crucial customer data to train machine learning models. Application logic is delegated to these models, which can introduce unfairness and biases into an application. In this session we reviewed the machine learning techniques and AWS services you can use to understand and reduce these risks.

#TODO

## References
1. Pauline Kelly - [Building AI applications that avoid bias and maintain privacy and fairness](https://anz-resources.awscloud.com/aws-summit-online-anz-2021-data-scientist/building-ai-applications-that-avoid-bias-and-maintain-privacy-and-fairness-1)