# Data Science Ethics Checklist - University of Buckingham MSc

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

> The goals of this notebook exercise are (1) to practice integrating the `deon` checklist into your code, and (2) to learn how to use a few basic data science tools in python.

This notebook is for the Eviction Data Case Study exercise in the "Actionable Ethics for Data Scientists" workshop for the data science MSc at the University of Buckingham.

Instructions:

- We'll walk through the notebook as a group, and break for independent work where there is an <span style="color:green">***italicized, bolded heading for "ACTIVITY" or "DISCUSSION"***</span>.

- If you need help debugging during any of the exercises, post in the chat or send a direct message to one of the DrivenData team members. **We encourage you to work together!**

- There is a more comprehensive version of the case study notebook here: [https://github.com/drivendataorg/msc-buckingham-data-ethics/blob/master/notebooks/eviction-data-case-study-reference.ipynb](https://github.com/drivendataorg/msc-buckingham-data-ethics/blob/master/notebooks/eviction-data-case-study-reference.ipynb). You can refer to this if you are stumped during any of the coding exercises, but we strongly encourage solving problems on your own first!

Notebook outline:

- Background
- Set up Python
- Load & explore the data
- Train a model
- Walk through `deon` checklist
    - Activities & discussion

To easily jump beween notebook sections in Google colab, open the outline sidebar by clicking the three horizontal lines in the side menu.

*** 

# Background

Over the past five decades in the US, [housing costs have risen faster than incomes](http://www.jchs.harvard.edu/state-nations-housing-2018), low-cost housing has been disappearing from the market, and racial disparities in homeownership rates have deepened. This has put many in a perilous situation. As the [Eviction Lab](https://evictionlab.org/why-eviction-matters/#affordable-housing-crisis) explains:

> Today, most poor renting families spend at least half of their income on housing costs, with one in four of those families spending over 70 percent of their income just on rent and utilities. Only one in four families who qualify for affordable housing programs get any kind of help. Under those conditions, it has become harder for low-income families to keep up with rent and utility costs, and a growing number are living one misstep or emergency away from eviction.


#### Objective

A non-profit dedicated to helping people at risk of eviction in California has tasked us to build a model to estimate the number of eviction cases by geography, based on socioeconomic data. They would like to use these estimates to help them prioritize where to commit funding and resources.

We will be using a subset of the eviction dataset published by the [Eviction Lab](https://evictionlab.org/) at Princeton University. The subset is the census-tract-level aggregates for only tracts in the state of California. 

*FYI:* [Census tracts](https://www.census.gov/programs-surveys/geography/about/glossary.html#:~:text=Census%20tracts%20generally%20have%20a,on%20the%20density%20of%20settlement.) are small, relatively permanent geographic areas used in the U.S. census. A tract generally has a population between 1,200 and 8,000.

*** 
# Set up Python

Install and import the necessary python packages

In [None]:
# Run the cell below if you are working in Google colab
!pip install wget

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
import wget

%matplotlib inline
pd.set_option("display.max_columns", 30)

# Load the Data

In [None]:
# load the data from where we have it saved online
DATA_URL = "https://drivendata-public-assets.s3.amazonaws.com/odsc-west-2019/california-tracts.csv"
DATA_PATH = Path("../data/raw/california-tracts.csv")

if not DATA_PATH.exists():
    DATA_PATH.parent.mkdir(exist_ok=True, parents=True)
    # Download data
    wget.download(url=DATA_URL, out=str(DATA_PATH))

#### Exploratory data analysis

Let's do some basic exploration of the data.

In [None]:
# read in the data using pandas
df = pd.read_csv(DATA_PATH)
print("Data shape:", df.shape)
print("\nData types:\n", df.dtypes)
df.head()

In [None]:
# count = non-NaN observations; size = all observations
df.groupby("year").agg(
    count=("eviction-rate", "count"), size=("eviction-rate", "size")
).transpose()

In [None]:
# Data Dictionary
DATA_DICT_URL = "https://drivendata-public-assets.s3.amazonaws.com/odsc-west-2019/DATA_DICTIONARY.txt"
DATA_DICT_PATH = Path("../references/DATA_DICTIONARY.txt")

if not DATA_DICT_PATH.exists():
    DATA_DICT_PATH.parent.mkdir(exist_ok=True, parents=True)
    # Download data dictionary
    wget.download(url=DATA_DICT_URL, out=str(DATA_DICT_PATH))

In [None]:
# what information do we have in the dataset?
!cat $DATA_DICT_PATH

# Train a model

The non-profit wants their decision-making to be race-blind, so they ask for the population race percentage features to not be included in the modeling.

We'll create a very basic model that predicts evictions. We can refer to this model as we consider the ethics checklist items.

In [None]:
# what is the range of eviction values?
df.evictions.describe()

In [None]:
TARGET_VAR = "evictions"
FEATURE_VARS = [
    "year",
    "population",
    "poverty-rate",
    "median-property-value",
    "renter-occupied-households",
    "pct-renter-occupied",
    "median-gross-rent",
    "median-household-income",
    "rent-burden",
    ## Don't include race features
    #'pct-white' , 'pct-af-am', 'pct-hispanic', 'pct-am-ind',
    #'pct-asian', 'pct-nh-pi', 'pct-multiple', 'pct-other'
    ## Also don't include features directly related to the target variable
    # 'eviction-filings', 'eviction-rate', 'eviction-filing-rate'
]
GROUP_VAR = "GEOID"  # Prevent leakage

In [None]:
print(f"Original Shape: {df.shape}")
# Drop NAs in target variable
df_modeling = df.dropna(subset=[TARGET_VAR]).copy()
df_modeling.reset_index(inplace=True)
print(f"Shape without NAs: {df_modeling.shape}")

In [None]:
from sklearn.model_selection import GroupShuffleSplit, cross_validate
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

In [None]:
# create a train-test split for model evaluation later
split = GroupShuffleSplit(test_size=0.20, n_splits=2, random_state=36).split(
    df_modeling, groups=df_modeling.loc[:, GROUP_VAR]
)

train_inds, test_inds = next(split)

df_train = df_modeling.loc[train_inds, :]
df_test = df_modeling.loc[test_inds, :]

X_train = df_train.loc[:, FEATURE_VARS].values
y_train = df_train.loc[:, TARGET_VAR].values

X_test = df_test.loc[:, FEATURE_VARS].values
y_test = df_test.loc[:, TARGET_VAR].values

We're going to fit and use a [random forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) model to predict evictions.

In [None]:
%%time
# train the model - note that this may take a few moments
model_pipeline = Pipeline([
    ('med_impute', SimpleImputer(strategy='median')),
    ('model', RandomForestRegressor(
        criterion='friedman_mse',
        n_estimators=100, 
        max_depth=10,
        random_state=36
    ))
])
model_pipeline.fit(X_train, y_train)

In [None]:
# generate predictions and look at key performance metrics
y_pred = model_pipeline.predict(X_test)

print("R2", r2_score(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("MAE", mean_absolute_error(y_test, y_pred))

In [None]:
# basic visualization of actual v predicted
_, ax = plt.subplots(figsize=(7, 7))
ax.plot(y_test, y_pred, ".", markersize=2, alpha=0.2)
plt.xlabel("Actual evictions")
plt.ylabel("Predicted evictions")

# Set aspect to square so it's easier to see correlation
plt.xlim([0, 200])
plt.ylim([0, 200])
ax.axline([0, 0], slope=1, color="black", linewidth=1, label="Predicted = Actual")
ax.legend()

plt.show()

It is a little hard to see the pattern because of the number of data points we have. For now, we can see that most values are clustered at low numbers of evictions. When there is a high number of evictions (more than ~50), our model has a tendency to underpredict the number of evictions.

***

# Walk through `deon` Checklist

We'll now go through a few of the items in `deon`'s standard ethics checklist. 

**We will not discuss every item on the deon checklist - items have been chosen that illustrate interesting points or coding challenges.** In real-world setting, you'll want to integrate the full deon checklist into your coding. You can see an [example](https://github.com/drivendataorg/msc-buckingham-data-ethics/blob/master/notebooks/eviction-data-case-study-reference.ipynb) of this in the workshop repository.

In the future, you can create your own ethics checklist walkthrough notebooks easily with `deon --output ethics-checklist.ipynb`. See the `deon` [documentation](https://deon.drivendata.org/#command-line-options) for details

## C. Analysis

- [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

This section contains example code to considering possible sources of bias in the data. This code may be a helpful reference when you are completing exercises independently later in the notebook.

In [None]:
# Look into the number of missing values for a handful of relevant columns
df[["low-flag", "imputed", "evictions"]].isna().sum()

In [None]:
# How many observations have the low-flag?
# A majority of the evictions are likely too low
print(f"Proportion of observations with low-flag: {df['low-flag'].mean():.2f}")
df["low-flag"].value_counts()

In [None]:
# How many observations have the imputed flag?
# Very few values for eviction were imputed
print(f"Proportion of observations with imputed flag: {df['imputed'].mean():.2f}")
df["imputed"].value_counts()

In [None]:
# What are the general values of the race columns?
race_cols = [
    "pct-white",
    "pct-af-am",
    "pct-hispanic",
    "pct-am-ind",
    "pct-asian",
    "pct-nh-pi",
    "pct-multiple",
    "pct-other",
]
df[race_cols].describe(percentiles=[])

> *One note on the data:* The history around defining racial categories is complex, flawed, and nuanced. For the purposes of this activity, we will accept the race-based categories in the data as is. In a real-world context, it would be worth discussing how to navigate these categories in the most equitable way.

In [None]:
# Calculate pairwise correlation of some columns against race percentage columns
cols_to_correlate = ["evictions", "median-household-income", "imputed", "low-flag"]
correlation_df = (
    df[race_cols + cols_to_correlate].corr().loc[race_cols, cols_to_correlate]
)
correlation_df

In [None]:
# Visualize the above correlations.
plt.figure(figsize=(9, 5))
sns.heatmap(
    correlation_df.sort_values("evictions"),
    annot=True,
    fmt="g",
    cmap="RdBu_r",
    vmin=-0.5,
    vmax=0.5,
)

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

Mostly not applicable. We have no PII, but we do have some tracts with few observations. We need to be mindful of those and maybe exclude them from visualizations or combine them with neighboring tracts.

## D. Modeling

 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

Per the non-profit's request, we did not include any race variables in training our model. We want to figure out whether the model is still making decisions based on race using proxy variables that can indirectly indicate race.

**Any questions before we dive into our first activity?**

***
### <span style="color:green">*D.1 ACTIVITY*</span>

> Work independently for 20-25 minutes. Start here and stop where "end of activity" is indicated. **We encourage you to collaborate with one another!**

**<span style="color:green">To what extent are any of the feature variables in our model acting as proxies for race? Take some time to explore the data.</span>**

<span style="color:green">First, let's look for correlations between our feature variables and our race variables.</span>

In [None]:
corr = df_modeling.loc[:, FEATURE_VARS + race_cols].corr().loc[FEATURE_VARS, race_cols]

plt.figure(figsize=(9, 6))
sns.heatmap(corr, vmin=-1.0, vmax=1.0, cmap="RdBu_r", annot=True)

**<span style="color:green">Takeaways</span>** 

- <span style="color:green">*Example takeaway:* poverty rate has a strong correlation with multiple race variables. It tends to be higher for tracts with a higher percent hispanic, and also higher but slightly less so for tracts with a higher percent African American. It tends to be lower in neighborhoods that are more white.</span>

- <span style="color:green">... your thoughts here ...</span>

<span style="color:green">We can use the correlation function's documentation to help with interpretation (below). You may want to look online for more details about any concepts in the documentation that you aren't familiar with, like pearson correlation coefficients.</span>

In [None]:
# run this to see documentation of the df.corr function
?df.corr

<span style="color:green">Your turn to code!</span>

<span style="color:green">We have a lot of different race variables, some of which have fairly low rates in many areas. **What happens if we create an aggregated variable for the percent of all non-white residents (`pct-non-white`)? What do the feature variable correlations look like for `pct-non-white` compared to `pct-white`, and do any patterns become clearer?** *Hint:* You can accomplish most of this by reusing code from above.</span>

<span style="color:green">Remember to document any substantive choices you have to make when you define the `pct-non-white` variable, and who is included.</span>

In [None]:
# create pct-non-white variable
df_modeling["pct-non-white"] = ...  ## YOUR CODE HERE

# plot correlations to feature variables

## YOUR CODE HERE

**<span style="color:green">Takeaways</span>** 

- <span style="color:green">... your thoughts here ...</span>

<span style="color:green">Another strategy is to fit a model that predicts the percent of a given race based on feature variables. If that model performs well, we know that our model predicting evictions could also make accurate inferences about racial breakdowns within tracts.</span>

<span style="color:green">**Below, train a model that predicts `pct-white` based on the same `FEATURE_VARS` used to train our eviction model earlier. Then assess how well the model performs, and write up a few takeaways about what that means for race proxy variables in our eviction model.** Remember, you can reuse code from earlier steps.</span>

In [None]:
# create X_train, y_train, X_test, and y_test
# we can use the same split as before

## YOUR CODE HERE

In [None]:
# train a model to predict pct-white

## YOUR CODE HERE

In [None]:
# calculate simple performance metrics (R2, MSE, MAE)

## YOUR CODE HERE

In [None]:
# code for any other exploration of model performance you'd like to do!

**<span style="color:green">Takeaways</span>** 

<span style="color:green">... add your thoughts here ...</span>


<span style="color:green">**End of activity, wait for group to reconvene and discuss**</span>

***

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

### <span style="color:green">*D.2 DISCUSSION*</span>

<span style="color:green">What are some approaches we can use to answer this question? Think about things like:</span> 

- <span style="color:green">How does the format of the data about race impact our strategy?</span> 

- <span style="color:green">What model performance metrics do we want to consider?</span> 

- <span style="color:green">What visuals do we want to produce?</span>

***

### <span style="color:green">*D.2 ACTIVITY*</span>

> Work independently for 15-20 minutes. Start here and stop where "end of activity" is indicated

<span style="color:green">Calculate the correlation between each of the race variables in the model with error and absolute error. Remember that you can re-use code from previous sections.</span>

In [None]:
# add columns to df_test for error and absolute error

## YOUR CODE HERE

# calculate correlation

## YOUR CODE HERE

# plot correlation heatmap

## YOUR CODE HERE

**<span style="color:green">Takeaways</span>** 

- <span style="color:green">... add your thoughts here ...</span>

<span style="color:green">Generate at least one other visual that helps to compare error rates between different racial groups. You could also explore another method of determining whether error is dependent on race percentages, such as fitting another model.</span>

In [None]:
## YOUR CODE HERE

<span style="color:green">**End of activity, wait for group to reconvene and discuss**</span>

***

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

We won't go into explainability in detail in this exercise. For quick reference, let's look at the [feature importances](https://www.aporia.com/learn/feature-importance/feature-importance-7-methods-and-a-quick-tutorial/#:~:text=In%20machine%20learning%2C%20feature%20importance,linear%20models%2C%20and%20neural%20networks.) of our model.

In [None]:
model_pipeline.named_steps["model"].feature_importances_
feature_importance = pd.DataFrame.from_dict(
    {
        "features": FEATURE_VARS,
        "importance": model_pipeline.named_steps["model"].feature_importances_,
    }
)
print(
    feature_importance.sort_values("importance", ascending=False).reset_index(drop=True)
)

## E. Deployment

 - [ ] **E.1 Monitoring and evaluation:** How are we planning to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

### <span style="color:green">*E.1-2 DISCUSSION*</span>


<span style="color:green">*E.1 and E.2:* Think about how you could monitor/evaluate the model moving forward, and put steps for redress in place.</span>

- <span style="color:green">What are some possible real-world consequences of the model performing poorly / making mistakes? What is the potential harm or inequity from incorrect model estimates?</span>

- <span style="color:green">How might you determine what the performance cutoff is for the model being good enough to use in practice?</span>

- <span style="color:green">What metric could you use for the above? What are some of the pros and cons of different matrics? Think about the consequences of false positives vs. false negatives in practice. Is one less desirable than the other, and how can that be reflected in your metric?</span>

- <span style="color:green">If/when the model is deployed in practice, will there be any human review of the model's decisions? In which cases will there be human review, and how will that be integrated?</span>

*Data Science Ethics Checklist generated with [deon](http://deon.drivendata.org).*