<a href="https://colab.research.google.com/github/owenhuanghao/responsible-data-science/blob/main/HW1_202_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install fairlearn

In [None]:
#@markdown Load modules
import numpy as np
from IPython.display import display, Markdown, Latex
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.preprocessing import CorrelationRemover
from fairlearn.adversarial import AdversarialFairnessClassifier
from fairlearn.metrics import MetricFrame
import fairlearn.datasets as fdata
from fairlearn.metrics import (demographic_parity_difference, demographic_parity_ratio, 
                               selection_rate_difference, false_negative_rate_difference, 
                               false_positive_rate_difference, equalized_odds_ratio,
                               false_negative_rate, false_positive_rate)

# Load and preprocess the data

In [None]:
#@markdown Load and read about the dataset.
# get datast from fairlearn and show description
dataset = fdata.fetch_diabetes_hospital()

display(Markdown(dataset.DESCR))

# save dataframe and features
x_raw = dataset.data
# y_raw = np.array(dataset.target)
feature_names = dataset.feature_names

The "Diabetes 130-Hospitals" dataset represents 10 years of clinical care at 130 U.S. hospitals and delivery networks, collected from 1999 to 2008. Each record represents the hospital admission record for a patient diagnosed with diabetes whose stay lasted between one to fourteen days. The features describing each encounter include demographics, diagnoses, diabetic medications, number of visits in the year preceding the encounter, and payer information, as well as whether the patient was readmitted after release, and whether the readmission occurred within 30 days of the release.

The original "Diabetes 130-Hospitals" dataset was collected by Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore in 2014.

This version of the dataset was derived by the Fairlearn team for the SciPy 2021 tutorial "Fairness in AI Systems: From social context to practice using Fairlearn". In this version, the target variable "readmitted" is binarized into whether the patient was re-admitted within thirty days. The full dataset pre-processing script can be found on GitHub: https://github.com/fairlearn/talks/blob/main/2021_scipy_tutorial/preprocess.py

Downloaded from openml.org.

You can read more about the dataset [here](https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html). In this description, we see that two features, `readmitted` and `readmit_binary`, are other representations of the same outcome, so we drop them from the set of predictors.

In [None]:
#@markdown Down sample to make runtimes reasonable
x_raw = x_raw.sample(frac=0.1, random_state=123)

In [None]:
y_raw = x_raw['readmit_binary']
x_raw = x_raw.drop(columns=['readmitted', 'readmit_binary'])
feature_names = feature_names[:-2]

In [None]:
#@markdown Look at the first few rows of the data.
x_raw.head()

Unnamed: 0,race,gender,age,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,...,max_glu_serum,A1Cresult,insulin,change,diabetesMed,medicare,medicaid,had_emergency,had_inpatient_days,had_outpatient_days
0,Caucasian,Female,30 years or younger,Other,Referral,1.0,Other,41.0,0.0,1.0,...,,,No,No,No,False,False,False,False,False
1,Caucasian,Female,30 years or younger,Discharged to Home,Emergency,3.0,Missing,59.0,0.0,18.0,...,,,Up,Ch,Yes,False,False,False,False,False
2,AfricanAmerican,Female,30 years or younger,Discharged to Home,Emergency,2.0,Missing,11.0,5.0,13.0,...,,,No,No,Yes,False,False,False,True,True
3,Caucasian,Male,30-60 years,Discharged to Home,Emergency,2.0,Missing,44.0,1.0,16.0,...,,,Up,Ch,Yes,False,False,False,False,False
4,Caucasian,Male,30-60 years,Discharged to Home,Emergency,1.0,Missing,51.0,0.0,8.0,...,,,Steady,Ch,Yes,False,False,False,False,False


## data inspectation 

In [None]:
#@markdown drop the rows with 'Unknown/Invalid' values for gender

# drop these 3 rows
print(x_raw.shape)
rows_to_keep = x_raw.gender != 'Unknown/Invalid'
x_raw = x_raw[rows_to_keep]
y_raw = y_raw[rows_to_keep]
print(x_raw.shape)

3

In [None]:
#@markdown
unique_feature_values = x_raw.apply(np.unique, axis=0)
unique_feature_values

race                        [AfricanAmerican, Asian, Caucasian, Hispanic, ...
gender                                                         [Female, Male]
age                         [30 years or younger, 30-60 years, Over 60 years]
discharge_disposition_id                          [Discharged to Home, Other]
admission_source_id                              [Emergency, Other, Referral]
time_in_hospital            [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
medical_specialty           [Cardiology, Emergency/Trauma, Family/GeneralP...
num_lab_procedures          [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
num_procedures                            [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
num_medications             [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
primary_diagnosis           [Diabetes, Genitourinary Issues, Musculoskelet...
number_diagnoses            [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
max_glu_serum                                        [>200, >300

In [None]:
#@markdown
binary_features = unique_feature_values.index[[len(x) == 2 for x in unique_feature_values]].values
print(f'Binary features: {binary_features}')
categorical_features = unique_feature_values.index[[len(x) > 2 and isinstance(x[0], str) for x in unique_feature_values]].values
print(f'Categorical features: {categorical_features}')

Binary features: ['gender' 'discharge_disposition_id' 'change' 'diabetesMed' 'medicare'
 'medicaid' 'had_emergency' 'had_inpatient_days' 'had_outpatient_days']
Categorical features: ['race' 'age' 'admission_source_id' 'medical_specialty'
 'primary_diagnosis' 'max_glu_serum' 'A1Cresult' 'insulin']


In [None]:
#@markdown standardize data types 
for col_name in feature_names:
    if col_name in categorical_features:
        x_raw[col_name] = x_raw[col_name].astype('category')
    elif col_name in binary_features:  # redundant for clarity
        # turn into int column
        integer_col = (x_raw[col_name] == unique_feature_values[col_name][0]).astype(int)
        new_name = f'{col_name}_{unique_feature_values[col_name][0]}'
        x_raw[new_name] = integer_col
        x_raw.drop(columns=[col_name], inplace=True) 

In [None]:
#@markdown
%%capture
x_raw.apply(np.unique, axis=0)

In [None]:
x_raw.dtypes

race                                           category
age                                            category
admission_source_id                            category
time_in_hospital                                float64
medical_specialty                              category
num_lab_procedures                              float64
num_procedures                                  float64
num_medications                                 float64
primary_diagnosis                              category
number_diagnoses                                float64
max_glu_serum                                  category
A1Cresult                                      category
insulin                                        category
gender_Female                                     int64
discharge_disposition_id_Discharged to Home       int64
change_Ch                                         int64
diabetesMed_No                                    int64
medicare_False                                  

In [None]:
#@markdown One-hot encode categorical features
x_numeric = pd.get_dummies(x_raw)
display(x_numeric.head())

# get one-hot and numeric column names
numeric_cols = x_numeric.dtypes.index[x_numeric.dtypes == 'int64'].values
one_hot_cols = x_numeric.dtypes.index[x_numeric.dtypes != 'int64'].values

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_diagnoses,gender_Female,discharge_disposition_id_Discharged to Home,change_Ch,diabetesMed_No,medicare_False,...,max_glu_serum_None,max_glu_serum_Norm,A1Cresult_>7,A1Cresult_>8,A1Cresult_None,A1Cresult_Norm,insulin_Down,insulin_No,insulin_Steady,insulin_Up
0,1.0,41.0,0.0,1.0,1.0,1,0,0,1,1,...,1,0,0,0,1,0,0,1,0,0
1,3.0,59.0,0.0,18.0,9.0,1,1,1,0,1,...,1,0,0,0,1,0,0,0,0,1
2,2.0,11.0,5.0,13.0,6.0,1,1,0,0,1,...,1,0,0,0,1,0,0,1,0,0
3,2.0,44.0,1.0,16.0,7.0,0,1,1,0,1,...,1,0,0,0,1,0,0,0,0,1
4,1.0,51.0,0.0,8.0,5.0,0,1,1,0,1,...,1,0,0,0,1,0,0,0,1,0


 # TODO: Split train, test data 80 train / 20 test 


In [None]:
# write your code here 

X_train, X_test, y_train, y_test = 

# Problem 2, Part (a) 
### **Train a baseline Random Forest (RF) (sklearn RandomForestClassifier) model and report metrics**

### Train a random forest model - Baseline

In [None]:
# write your code here 
# initialize a model with RandomForestClassifier 
n_estimators = 1

# train the model with the training data split



### Report metrics

Calculate metrics with Fairlearn MetricFrame


In [None]:
# write your code here 

# get model's prediction for the test set
y_pred_baseline = 

# use MetricFrame to get the results 

metric_dict = {}
sample_params = {}

metric_frame = MetricFrame(
    
)


Print Results

# Problem 2, Part (b)

Train another classifier with the following value of hyperparameters

In [None]:
# write your code here 
n_estimators = 1000
max_depth = 10

Calculate metrics with Fairlearn MetricFrame
and print the results

# Problem 2, Part (c) 
### **Fit Fairlean Adversarial Debiaser**

Experiment with the alpha parameter - which value of alpha produces the fairest and most accurate classifier? Does such a value exist?

In [None]:
#@markdown Fit the AdversarialFairnessClassifier here.
#@markdown Use these hyperparameters, while varying the `alpha` parameter:
#@markdown - backend='tensorflow',
#@markdown - predictor_model=[128,64,32,16,8],
#@markdown - adversary_model=[32,16,8],
#@markdown - learning_rate=0.001,
#@markdown - epochs=3,
#@markdown - batch_size=16,
#@markdown - constraints='demographic_parity',
#@markdown - random_state=seed,
#@markdown - shuffle=True

AdversarialFairnessClassifier(backend='tensorflow',
predictor_model=[128,64,32,16,8],
adversary_model=[32,16,8],
learning_rate=0.001,
epochs=3,
batch_size=16,
constraints='demographic_parity',
random_state=seed,
shuffle=True,
alpha = )

In [None]:
#@markdown ### Plot all our metrics as line plots while varying alpha

# Problem 2, Part (d) 
### **Threshold Optimizer Post-processing intervention**



In [None]:
#@markdown Fit the ThresholdOptimizer model here