# This workbook is for the concatenation of the scraped data to be used for the tutorial and solution
### Explanation / Motivation:
- Study on non-discriminatory supervised learning models
- This data set is a case study on FICO scores and how they determine a 'threshold' cutoff score to either deny or approve a loan application
- Using data from the Federal Reserve, we can see the distribution of scores (FICO score percentile) against four main demographic groups: Asian, Hispanic, Black, and White
- With this data, we can plot the probability of defaulting and/or non-defaulting people from a specified demographpic group getting approved a loan ($\hat Y$ = 1).
- Theoretically, the probability of defaulting and/or non-defaulting people getting ($\hat Y$ = 1) should be equal amongst all demographic groups, but as you can see from this study, that is not the case.
- I dive into what this means in terms of precision/recall, the cost of these discrepencies, why this is happening, as well as methodologies for improvement.
- Data and non-descriminitaory model analysis courtesy of https://arxiv.org/pdf/1610.02413.pdf

In [1]:
from bs4 import BeautifulSoup
import fileinput
import sys
import re
import csv
import requests
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.lines as mlines


## Scraping Data from Federal Reserve:
#### Explanations / Guidelines:
- All files are saved in current directory
- We will be analyzing "Figure_7.A._TransRisk_Score_Cumulative_Percentage_of_Goods_and_Bads,_by_Demographic_Group(Random-Account_Performance)_-_Race_or_ethnicity_(SSA_data).csv" and saving that file as 'random-account-ficoscores.csv' for easier reference
- To analyze the Cumulative Percentage of Goods and Bads for any of the other protected groups (sex, age, marital status, or income ratio) simply plug in this csv when assigning the 'data' variable
- Options for account types: any-account, new-account, existing-account, random-account
- In this study, "good" means non-defaulting for loans (will pay it off). "Bad" means defaulting for loans (will not pay it off).

## HERE IS WHERE I CALCULATE THE CONDENSED VERSION OF THE DATA!

In [13]:
import numpy as np

In [14]:
def getPD(goodName, badName, data, raceName):
    pd = data['Score'].to_frame(name="Score")
    race = np.full(len(data[badName]), raceName)
    pd["Demographic"] = race
    pd["Good"] = data[goodName].copy()
    pd["Bad"] = data[badName].copy()
    return pd

In [15]:
#cell updated on April 15 to reflect the correct probability data
data = pd.read_csv("NonCumulativeProbabilities.csv")
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

In [16]:
def getSeries(data, goodOrBad):
    one = data[data["Score"] <= 20.0][goodOrBad].iloc[0]
    two = data[data["Score"] > 20.0][data["Score"] <= 40.0][goodOrBad].sum() + one
    three = data[data["Score"] > 40.0][data["Score"] <= 60.0][goodOrBad].sum() + one + two
    four = data[data["Score"] > 60.0][data["Score"] <= 80.0][goodOrBad].sum() + one + two + three
    five = data[data["Score"] > 80.0][data["Score"] <= 100.0][goodOrBad].sum() + one + two + three + four
    return pd.Series([one, two, three, four, five])

In [17]:
scores = pd.Series([20, 40, 60, 80, 100])
whitef = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "white"),
    'Good' : getSeries(whites, "Good"),
    'Bad' : getSeries(whites, "Bad") })
asianf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "asian"),
    'Good' : getSeries(asians, "Good"),
    'Bad' : getSeries(asians, "Bad") })
blackf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "black"),
    'Good' : getSeries(blacks, "Good"),
    'Bad' : getSeries(blacks, "Bad") })
hispanicf = pd.DataFrame({ 'Score' : scores,
    'Demographic' : np.full(len(scores), "hispanic"),
    'Good' : getSeries(hispanics, "Good"),
    'Bad' : getSeries(hispanics, "Bad") })

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  


In [18]:
frames = [whitef, blackf, asianf, hispanicf]
shortenedData = pd.concat(frames)
shortenedData.rename(columns={'Score' : 'TransRisk Score'}, inplace=True)
shortenedData = shortenedData[["TransRisk Score", "Demographic", "Good", "Bad"]]
shortenedData.set_index("TransRisk Score", inplace=True)
shortenedData.to_csv("ShortenedData.csv")

In [19]:
shortenedData

Unnamed: 0_level_0,Demographic,Good,Bad
TransRisk Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,white,0.0,1.0
40,white,24.120496,16.879504
60,white,60.245004,21.754996
80,white,121.266575,40.733425
100,white,244.108495,80.891505
20,black,0.0,1.0
40,black,17.139582,23.860418
60,black,49.018407,32.981593
80,black,100.597892,61.402108
100,black,203.193748,121.806252


*** END OF CONDENSED DATA CALCULATION ***

## Total Data Calculation

In [9]:
def getPD(goodName, badName, data, raceName):
    pd = data['Score'].to_frame(name="Score")
    race = np.full(len(data), raceName)
    pd["Demographic"] = race
    pd["Good"] = data[goodName].copy()
    pd["Bad"] = data[badName].copy()
    return pd

In [10]:
#cell added today on April 15
data = pd.read_csv("NonCumulativeProbabilities.csv")
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

In [11]:
frames = [whites, blacks, asians, hispanics]
totalData = pd.concat(frames)
totalData.rename(columns={'Score' : 'TransRisk Score'}, inplace=True)
totalData.set_index("TransRisk Score", inplace=True)
totalData.head()

Unnamed: 0_level_0,Demographic,Good,Bad
TransRisk Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,white,0.0,1.0
0.5,white,0.0,1.0
1.0,white,0.025629,0.974371
1.5,white,0.027318,0.972682
2.0,white,0.042152,0.957848


In [12]:
totalData.to_csv("TransRiskScores.csv")