# Explanation / Motivation:
- Study on non-discriminatory supervised learning models
- This data set is a case study on FICO scores and how they determine a 'threshold' cutoff score to either deny or approve a loan application
- Using data from the Federal Reserve, we can see the distribution of scores (FICO score percentile) against four main demographic groups: Asian, Hispanic, Black, and White
- With this data, we can plot the probability of defaulting and/or non-defaulting people from a specified demographpic group getting approved a loan ($\hat Y$ = 1).
- Theoretically, the probability of defaulting and/or non-defaulting people getting ($\hat Y$ = 1) should be equal amongst all demographic groups, but as you can see from this study, that is not the case.
- I dive into what this means in terms of precision/recall, the cost of these discrepencies, why this is happening, as well as methodologies for improvement.
- Data and non-descriminitaory model analysis courtesy of https://arxiv.org/pdf/1610.02413.pdf

In [1]:
from bs4 import BeautifulSoup
import fileinput
import sys
import re
import csv
import requests
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.lines as mlines


## Scraping Data from Federal Reserve:
#### Explanations / Guidelines:
- All files are saved in current directory
- We will be analyzing "Figure_7.A._TransRisk_Score_Cumulative_Percentage_of_Goods_and_Bads,_by_Demographic_Group(Random-Account_Performance)_-_Race_or_ethnicity_(SSA_data).csv" and saving that file as 'random-account-ficoscores.csv' for easier reference
- To analyze the Cumulative Percentage of Goods and Bads for any of the other protected groups (sex, age, marital status, or income ratio) simply plug in this csv when assigning the 'data' variable
- Options for account types: any-account, new-account, existing-account, random-account
- In this study, "good" means non-defaulting for loans (will pay it off). "Bad" means defaulting for loans (will not pay it off).

*** I'm having difficulties making sure I have the correct data, I think I accidentally took the data from Figure 7.D, not 7.A and I'd like to switch that, but it says that my attempts at requesting data from the HTML page have been exhausted *** (keeping these cells as markdown to try to not accidentally make more request attempts

In [2]:

url = "https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables7.htm#d7A"
r  = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data, 'lxml')


def cell_text(cell):
    return " ".join(cell.stripped_strings)

for table in soup.find_all('table'):
    title = table.find('span', { 'class' : 'tablehead' }).getText()
    subhead = table.find('span', { 'class' : 'tablesubheadsmall' }).getText()
    fname = (title + ' - '+subhead).replace(' ', '_') + '.csv'
    fname = fname.replace(':', '-')
    with open(fname, 'w') as outfile:
        output = csv.writer(outfile)

        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            output.writerow(col)
            

ConnectionError: HTTPSConnectionPool(host='https', port=443): Max retries exceeded with url: //www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables7.htm (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x11a40be48>: Failed to establish a new connection: [Errno 61] Connection refused',))

*** Tried it again with the Figure 3A ***


url = "https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables3.htm#d3A"
r  = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data, 'lxml')


def cell_text(cell):
    return " ".join(cell.stripped_strings)

for table in soup.find_all('table'):
    title = table.find('span', { 'class' : 'tablehead' }).getText()
    subhead = table.find('span', { 'class' : 'tablesubheadsmall' }).getText()
    fname = (title + ' - '+subhead).replace(' ', '_') + '.csv'
    fname = fname.replace(':', '-')
    with open(fname, 'w') as outfile:
        output = csv.writer(outfile)

        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            output.writerow(col)

In [3]:
#os.rename('Figure_7.D._TransRisk_Score-_Cumulative_Percentage_of_Goods_and_Bads,_by_Demographic_Group(Random-Account_Performance)_-_Race_or_ethnicity_(SSA_data).csv', 'random-account-ficoscores.csv')

## HERE IS WHERE I CALCULATE THE CONDENSED VERSION OF THE DATA!

In [3]:
import numpy as np

In [4]:
def getPD(goodName, badName, data, raceName):
    pd = data['Score'].to_frame(name="Score")
    race = np.full(len(data[badName]), raceName)
    pd["Demographic"] = race
    pd["Good"] = data[goodName].copy()
    pd["Bad"] = data[badName].copy()
    return pd

In [6]:
data = pd.read_csv("Figure7A.csv")
#Necessary to rename this column for clarity of the data it represents
#because of formatting issues when parsing data from the html
data.rename(columns={'Black (Bad).1':'Hispanic (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Good)':'White (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Bad)': 'White (Bad)'}, inplace=True)
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

In [39]:
def getSeries(data, goodOrBad):
    one = data[data["Score"] == 20.0][goodOrBad].iloc[0]
    two = data[data["Score"] == 40.0][goodOrBad].iloc[0]
    three = data[data["Score"] == 60.0][goodOrBad].iloc[0]
    four = data[data["Score"] == 80.0][goodOrBad].iloc[0]
    five = data[data["Score"] == 100.0][goodOrBad].iloc[0]
#     two = data[data["Score"] > 20.0][data["Score"] <= 40.0][goodOrBad].sum() + one
#     three = data[data["Score"] > 40.0][data["Score"] <= 60.0][goodOrBad].sum() + one + two
#     four = data[data["Score"] > 60.0][data["Score"] <= 80.0][goodOrBad].sum() + one + two + three
#     five = data[data["Score"] > 80.0][data["Score"] <= 100.0][goodOrBad].sum() + one + two + three + four
    return pd.Series([one, two, three, four, five])

In [38]:
whites[whites["Score"] == 20.0]["Good"].iloc[0]

2.79

In [40]:
scores = pd.Series([20, 40, 60, 80, 100])
whitef = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "white"),
    'Good' : getSeries(whites, "Good"),
    'Bad' : getSeries(whites, "Bad") })
asianf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "asian"),
    'Good' : getSeries(asians, "Good"),
    'Bad' : getSeries(asians, "Bad") })
blackf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "black"),
    'Good' : getSeries(blacks, "Good"),
    'Bad' : getSeries(blacks, "Bad") })
hispanicf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "hispanic"),
    'Good' : getSeries(hispanics, "Good"),
    'Bad' : getSeries(hispanics, "Bad") })

In [41]:
frames = [whitef, blackf, asianf, hispanicf]
shortenedData = pd.concat(frames)
shortenedData.rename(columns={'Score' : 'TransRisk Score'}, inplace=True)
shortenedData = shortenedData[["TransRisk Score", "Demographic", "Good", "Bad"]]
shortenedData.set_index("TransRisk Score", inplace=True)
shortenedData.to_csv("ShortenedData.csv")

In [42]:
shortenedData

Unnamed: 0_level_0,Demographic,Good,Bad
TransRisk Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,white,2.79,59.09
40,white,16.27,87.1
60,white,40.06,95.71
80,white,66.27,98.4
100,white,100.0,100.0
20,black,11.53,75.02
40,black,41.7,95.56
60,black,70.33,98.99
80,black,87.04,99.74
100,black,100.0,100.0


*** END OF CONDENSED DATA CALCULATION ***

## Total Data Calculation

In [43]:
def getPD(goodName, badName, data, raceName):
    pd = data['Score'].to_frame(name="Score")
    race = np.full(len(data), raceName)
    pd["Demographic"] = race
    pd["Good"] = data[goodName].copy()
    pd["Bad"] = data[badName].copy()
    return pd

In [44]:
data = pd.read_csv("ficoscores.csv")
#Necessary to rename this column for clarity of the data it represents
#because of formatting issues when parsing data from the html
data.rename(columns={'Black (Bad).1':'Hispanic (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Good)':'White (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Bad)': 'White (Bad)'}, inplace=True)
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

In [45]:
frames = [whites, blacks, asians, hispanics]
totalData = pd.concat(frames)
totalData.rename(columns={'Score' : 'TransRisk Score'}, inplace=True)
totalData.set_index("TransRisk Score", inplace=True)
totalData.head()

Unnamed: 0_level_0,Demographic,Good,Bad
TransRisk Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,white,0.0,0.17
0.5,white,0.03,1.85
1.0,white,0.22,7.26
1.5,white,0.26,8.85
2.0,white,0.35,10.58


In [46]:
totalData.to_csv("TransRiskScores.csv")

# Workspace for solution and Tutorial below!

In [15]:
white_non_default = data[["Score", "Non- Hispanic white (Good)"]]
white_default = data[["Score","Non- Hispanic white (Bad)"]]
black_non_default = data[["Score","Black (Good)"]]
black_default = data[["Score","Black (Bad)"]]
hispanic_non_default = data[["Score","Hispanic (Good)"]]
hispanic_default = data[["Score","Hispanic (Bad)"]]
asian_non_default = data[["Score","Asian (Good)"]]
asian_default = data[["Score","Asian (Bad)"]]

KeyError: "['Non- Hispanic white (Good)'] not in index"

In [None]:
def getGraph(dataset, metricName, graphType):
    i= 0
    x = []
    y = []
    while(i < 100.5):
        if(i == 72.5 or i == 77.5 or i == 92.5):
            i = (i + 0.5)
        curr_race_non_default = dataset[dataset["Score"] >= i][metricName].sum()
        total_race_non_default = dataset[metricName].sum()
        yVal = curr_race_non_default / total_race_non_default
        x.append(i)
        y.append(yVal)
        i = (i + 0.5)
    plt.plot(x, y, graphType, label=metricName)

## Visualizing This Data:
#### Now we have data for how many people are defaulters and non-defaulters for each score, theoretically the probability of a non-defaulter getting approved a loan ($\hat Y$ = 1) should be the same amongst all four groups. You can see from the graph that this is not the case: a person from the black demographic group is much less likely to be approved than a white or asian non-defaulting person.

In [None]:
getGraph(asian_non_default, "Asian (Good)", 'b-')
getGraph(white_non_default, "Non- Hispanic white (Good)", 'g-')
getGraph(black_non_default, "Black (Good)", 'c-')
getGraph(hispanic_non_default, "Hispanic (Good)", 'm-')
plt.title("Probability of Non-Defaulters Getting $\hat Y$ = 1 (Beneficial Outcome)" )


blue_line = mlines.Line2D([], [], color='blue', marker='.',
                          markersize=15, label='Asian')
green_line = mlines.Line2D([], [], color='green', marker='.',
                          markersize=15, label='White')
cyan_line = mlines.Line2D([], [], color='cyan', marker='.',
                          markersize=15, label='Black')
purple_line = mlines.Line2D([], [], color='purple', marker='.',
                          markersize=15, label='Hispanic')

plt.legend(handles=[blue_line, green_line, cyan_line, purple_line])

#todo: compare the defaulters getting beneficial outcome too
#facetgrid and seaborn - make variable 'default' or 'non-default' or 'race' for 1 panel for each race
# recall - x
# precision - y

In [None]:
asian_non_default.to_csv("asian-non-default.csv")
asian_default.to_csv("asian-default.csv")
white_non_default.to_csv("white-non-default.csv")
white_default.to_csv("white-default.csv")
black_non_default.to_csv("black-non-default.csv")
black_default.to_csv("black-default.csv")
hispanic_non_default.to_csv("hispanic-non-default.csv")
hispanic_default.to_csv("hispanic-default.csv")

In [None]:
import pickle
pickle.dump(asian_non_default.to_csv, open("asian-non-default.pkl", "wb"))
pickle.dump(asian_default, open("asian-default.pkl", "wb"))
pickle.dump(white_non_default, open("white-non-default.pkl", "wb"))
pickle.dump(white_default, open("white-default.pkl", "wb"))
pickle.dump(black_non_default, open("black-non-default.pkl", "wb"))
pickle.dump(black_default, open("black-default.pkl", "wb"))
pickle.dump(hispanic_non_default, open("hispanic-non-default.pkl", "wb"))
pickle.dump(hispanic_default, open("hispanic-default.pkl", "wb"))


## Precision / Recall Analysis

##### Recall: Of all non-defaulters, how many did we correctly identify as non-defaulters (Gave beneficial outcome?)
- true positives / (total false negatives and true positives)
- correctly predicted non-defaulters / all non-defaulters

##### Precision: Of the non-defaulters we predicted (given a beneficial outcome?), how many were actually non-defaulting
- true positives / true positives + false positives
- correctly identified non-defaulters / all predicted non-defaulters



##### Notes:
- true: non-defaulting (they are a good candidate for a loan)
- good precision: good precision would mean out of the people that we predict are going to pay it, a high percentage actually are
- poor precision: out of the people that we predict are going to pay it, a low percentage actually will

*** This is why banks are looking for good precision for lower risk / cost on their part, but the non-discriminatory models attempt to put the burden of this cost on the data scientist to make more accurate models and away from the minority/protected group

- recall: of the people who would pay it back, how many did we correctly identify
- F-1 scores, can weight precision or recall depending on what is important

## Cost Analysis / Problem Space
- Cost that accompanies low precision
- What kinds of companies might risk placing cost on discriminated-against groups (even by accident)
- What data sets and attributes are most commonly 'protected' and what kinds of models need to be re-trained to fit ethical platforms
- How non-discriminatory supervised learning models can come into play on already-trained data

## Possible Solutions
(Non-Discriminatory Supervised Learning Models)
- Explanation of non-discriminatory supervised learning models
- Max Profit Classifier, Race Blind Classifier, Demographic Parity Classifier, Equal Opportunity Classifier, Equal Odds Classifier
- Pros/Cons of using each one
- Examples of when one might be better over another
- Recommendations for this dataset