*** OLD STUFF FROM PREVIOUS VERSIONS OF THE TUTORIAL / SOLUTION***

## Scraping Data from Federal Reserve:
#### Explanations / Guidelines:
- All files are saved in current directory
- We will be analyzing "Figure_7.A._TransRisk_Score_Cumulative_Percentage_of_Goods_and_Bads,_by_Demographic_Group(Random-Account_Performance)_-_Race_or_ethnicity_(SSA_data).csv" and saving that file as 'random-account-ficoscores.csv' for easier reference
- To analyze the Cumulative Percentage of Goods and Bads for any of the other protected groups (sex, age, marital status, or income ratio) simply plug in this csv when assigning the 'data' variable
- Options for account types: any-account, new-account, existing-account, random-account
- In this study, "good" means non-defaulting for loans (will pay it off). "Bad" means defaulting for loans (will not pay it off).

## Next Steps:

Now that you have calculated the precision and recall of a single threshold, let's calculate it for every TransRisk score threshold. Create a pandas dataframe that includes all four groups (defaulting and non-defaulting) and TransRisk scores. With each score acting as the <i>threshold</i> value, for each group - determine the F1 score for that threshold. Are there any significant discrepancies you notice between groups?

In [None]:
def getPrecisionSeries(goodName, badName, dataArray):
    tp = dataArray[goodName]
    fp = dataArray[badName]
    precision = tp / (tp + fp)
    return precision

In [None]:
def getRecallSeries(goodName, badName, dataArray):
    totalGood = totalData[goodName].sum()
    tp = dataArray[goodName]
    recall = tp / totalGood
    return recall

In [None]:
def getF1Series(goodName, badName, dataArray):
    precision = getPrecisionSeries(goodName, badName, dataArray)
    recall = getRecallSeries(goodName, badName, dataArray)
    denom = ((1/precision) + (1/recall)) / 2
    f1 = 1/denom
    return f1

In [None]:
F1Scores = totalData["Score"].to_frame(name="TransRisk Score")
white_series = getF1Series("Non- Hispanic white (Good)", "Non- Hispanic white (Bad)", totalData)
asian_series = getF1Series("Asian (Good)", "Asian (Bad)", totalData)
black_series = getF1Series("Black (Good)", "Black (Bad)", totalData)
hispanic_series = getF1Series("Hispanic (Good)", "Hispanic (Bad)", totalData)
F1Scores['White F1'] = white_series
F1Scores['Asian F1'] = asian_series
F1Scores['Black F1'] = black_series
F1Scores['Hispanic F1'] = hispanic_series
F1Scores.head()

In [None]:
## Still trying to fix this to get the F1 Score Panda Array looking good
F1Scores.set_index(score) and melt df if choose to do this, so that demographic F1s become column

*** OLD STUFF FROM A PREVIOUS VERSION OF THE DATA SCRAPING ***

In [1]:
# FOR TOTAL DATA
data = pd.read_csv("ficoscores.csv")
#Necessary to rename this column for clarity of the data it represents
#because of formatting issues when parsing data from the html
data.rename(columns={'Black (Bad).1':'Hispanic (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Good)':'White (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Bad)': 'White (Bad)'}, inplace=True)
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

NameError: name 'pd' is not defined

In [2]:
# FOR SHORTENED DATA
data = pd.read_csv("Figure7A.csv")
#Necessary to rename this column for clarity of the data it represents
#because of formatting issues when parsing data from the html
data.rename(columns={'Black (Bad).1':'Hispanic (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Good)':'White (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Bad)': 'White (Bad)'}, inplace=True)
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

NameError: name 'pd' is not defined

*** OLD STUFF FROM WHEN I SCRAPED DATA FROM THE WEBSITE ***

In [None]:

url = "https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables7.htm#d7A"
r  = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data, 'lxml')


def cell_text(cell):
    return " ".join(cell.stripped_strings)

for table in soup.find_all('table'):
    title = table.find('span', { 'class' : 'tablehead' }).getText()
    subhead = table.find('span', { 'class' : 'tablesubheadsmall' }).getText()
    fname = (title + ' - '+subhead).replace(' ', '_') + '.csv'
    fname = fname.replace(':', '-')
    with open(fname, 'w') as outfile:
        output = csv.writer(outfile)

        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            output.writerow(col)
            

In [4]:
#os.rename('Figure_7.D._TransRisk_Score-_Cumulative_Percentage_of_Goods_and_Bads,_by_Demographic_Group(Random-Account_Performance)_-_Race_or_ethnicity_(SSA_data).csv', 'random-account-ficoscores.csv')

*** Old version of getting condensed data ***

In [None]:
def getSeries(data, goodOrBad):
    one = data[data["Score"] == 20.0][goodOrBad].iloc[0]
    two = data[data["Score"] == 40.0][goodOrBad].iloc[0]
    three = data[data["Score"] == 60.0][goodOrBad].iloc[0]
    four = data[data["Score"] == 80.0][goodOrBad].iloc[0]
    five = data[data["Score"] == 100.0][goodOrBad].iloc[0]
#     two = data[data["Score"] > 20.0][data["Score"] <= 40.0][goodOrBad].sum() + one
#     three = data[data["Score"] > 40.0][data["Score"] <= 60.0][goodOrBad].sum() + one + two
#     four = data[data["Score"] > 60.0][data["Score"] <= 80.0][goodOrBad].sum() + one + two + three
#     five = data[data["Score"] > 80.0][data["Score"] <= 100.0][goodOrBad].sum() + one + two + three + four
    return pd.Series([one, two, three, four, five])

In [None]:
data = pd.read_csv("ficoscores.csv")
#Necessary to rename this column for clarity of the data it represents
#because of formatting issues when parsing data from the html
data.rename(columns={'Black (Bad).1':'Hispanic (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Good)':'White (Good)'}, inplace=True)
data.rename(columns={'Non- Hispanic white (Bad)': 'White (Bad)'}, inplace=True)
whites = getPD('White (Good)', 'White (Bad)', data, "white")
blacks = getPD('Black (Good)', 'Black (Bad)', data, "black")
asians = getPD('Asian (Good)', 'Asian (Bad)', data, "asian")
hispanics = getPD('Hispanic (Good)', 'Hispanic (Bad)', data, "hispanic")

# Workspace for solution and Tutorial below!

In [6]:
white_non_default = data[["Score", "Non- Hispanic white (Good)"]]
white_default = data[["Score","Non- Hispanic white (Bad)"]]
black_non_default = data[["Score","Black (Good)"]]
black_default = data[["Score","Black (Bad)"]]
hispanic_non_default = data[["Score","Hispanic (Good)"]]
hispanic_default = data[["Score","Hispanic (Bad)"]]
asian_non_default = data[["Score","Asian (Good)"]]
asian_default = data[["Score","Asian (Bad)"]]

NameError: name 'data' is not defined

In [None]:
def getGraph(dataset, metricName, graphType):
    i= 0
    x = []
    y = []
    while(i < 100.5):
        if(i == 72.5 or i == 77.5 or i == 92.5):
            i = (i + 0.5)
        curr_race_non_default = dataset[dataset["Score"] >= i][metricName].sum()
        total_race_non_default = dataset[metricName].sum()
        yVal = curr_race_non_default / total_race_non_default
        x.append(i)
        y.append(yVal)
        i = (i + 0.5)
    plt.plot(x, y, graphType, label=metricName)

## Visualizing This Data:
#### Now we have data for how many people are defaulters and non-defaulters for each score, theoretically the probability of a non-defaulter getting approved a loan ($\hat Y$ = 1) should be the same amongst all four groups. You can see from the graph that this is not the case: a person from the black demographic group is much less likely to be approved than a white or asian non-defaulting person.

In [None]:
getGraph(asian_non_default, "Asian (Good)", 'b-')
getGraph(white_non_default, "Non- Hispanic white (Good)", 'g-')
getGraph(black_non_default, "Black (Good)", 'c-')
getGraph(hispanic_non_default, "Hispanic (Good)", 'm-')
plt.title("Probability of Non-Defaulters Getting $\hat Y$ = 1 (Beneficial Outcome)" )


blue_line = mlines.Line2D([], [], color='blue', marker='.',
                          markersize=15, label='Asian')
green_line = mlines.Line2D([], [], color='green', marker='.',
                          markersize=15, label='White')
cyan_line = mlines.Line2D([], [], color='cyan', marker='.',
                          markersize=15, label='Black')
purple_line = mlines.Line2D([], [], color='purple', marker='.',
                          markersize=15, label='Hispanic')

plt.legend(handles=[blue_line, green_line, cyan_line, purple_line])

#todo: compare the defaulters getting beneficial outcome too
#facetgrid and seaborn - make variable 'default' or 'non-default' or 'race' for 1 panel for each race
# recall - x
# precision - y

In [None]:
asian_non_default.to_csv("asian-non-default.csv")
asian_default.to_csv("asian-default.csv")
white_non_default.to_csv("white-non-default.csv")
white_default.to_csv("white-default.csv")
black_non_default.to_csv("black-non-default.csv")
black_default.to_csv("black-default.csv")
hispanic_non_default.to_csv("hispanic-non-default.csv")
hispanic_default.to_csv("hispanic-default.csv")

In [None]:
import pickle
pickle.dump(asian_non_default.to_csv, open("asian-non-default.pkl", "wb"))
pickle.dump(asian_default, open("asian-default.pkl", "wb"))
pickle.dump(white_non_default, open("white-non-default.pkl", "wb"))
pickle.dump(white_default, open("white-default.pkl", "wb"))
pickle.dump(black_non_default, open("black-non-default.pkl", "wb"))
pickle.dump(black_default, open("black-default.pkl", "wb"))
pickle.dump(hispanic_non_default, open("hispanic-non-default.pkl", "wb"))
pickle.dump(hispanic_default, open("hispanic-default.pkl", "wb"))


## Precision / Recall Analysis

##### Recall: Of all non-defaulters, how many did we correctly identify as non-defaulters (Gave beneficial outcome?)
- true positives / (total false negatives and true positives)
- correctly predicted non-defaulters / all non-defaulters

##### Precision: Of the non-defaulters we predicted (given a beneficial outcome?), how many were actually non-defaulting
- true positives / true positives + false positives
- correctly identified non-defaulters / all predicted non-defaulters



##### Notes:
- true: non-defaulting (they are a good candidate for a loan)
- good precision: good precision would mean out of the people that we predict are going to pay it, a high percentage actually are
- poor precision: out of the people that we predict are going to pay it, a low percentage actually will

*** This is why banks are looking for good precision for lower risk / cost on their part, but the non-discriminatory models attempt to put the burden of this cost on the data scientist to make more accurate models and away from the minority/protected group

- recall: of the people who would pay it back, how many did we correctly identify
- F-1 scores, can weight precision or recall depending on what is important

## Cost Analysis / Problem Space
- Cost that accompanies low precision
- What kinds of companies might risk placing cost on discriminated-against groups (even by accident)
- What data sets and attributes are most commonly 'protected' and what kinds of models need to be re-trained to fit ethical platforms
- How non-discriminatory supervised learning models can come into play on already-trained data

## Possible Solutions
(Non-Discriminatory Supervised Learning Models)
- Explanation of non-discriminatory supervised learning models
- Max Profit Classifier, Race Blind Classifier, Demographic Parity Classifier, Equal Opportunity Classifier, Equal Odds Classifier
- Pros/Cons of using each one
- Examples of when one might be better over another
- Recommendations for this dataset

## Part 1: Measuring Performance on Binary Classifiers
While are many ways to calculate the performance of a binary predictor, two methods are particularly useful for fairness models:
<ul>
<li><i>Sensitivity</i>:
<br/> - True Positive Rate
<br/> - Among all of the actual 1's, what percentage did we predict were 1?
</li>
<li><i>Specificity</i>:
<br/> - True Negative Rate
<br/> - Among all of the actual 0's, what percentage did we predict were 0?
</li>
</ul>

In [1]:
def getGraphData(dataset, graphType):
    i= 0
    x = []
    y = []
    while(i < 100.5):
        # our dataset doesn't include these scores so this line is necessary
        if(i == 72.5 or i == 77.5 or i == 92.5):
            i = (i + 0.5)
        # create and append the x and y values to the x and y arrays to be returned for the plot here:
        curr_race_non_default = dataset[dataset["TransRisk Score"] >= i]["Good"].sum()
        total_race_non_default = dataset["Good"].sum()
        yVal = curr_race_non_default / total_race_non_default
        x.append(i)
        y.append(yVal)
        i = (i + 0.5)
    plt.plot(x, y, graphType)

## Next Steps:

Ideally we would use these four different values as each groupâ€™s threshold during the final decision process. As mentioned before, the Equality of Opportunity model requires the same sensitivity of all groups for its fairness requirements. Using these thresholds would satisfy those requirements, and allow us to label this predictor as Fair Under Equalized Opportunity.

For our case study, the predictor that achieves fairness under Equalized Opportunity would have these thresholds for each demographic group: