# Created in this Notebook:

** P(NonDefault | race) **
- ProbNonDefaultGivenRace

** P(NonDefault | Score=x, race) **
- ProbNonDefaultGivenScoreEqualsXAndRace

** P(Default | Score=x, race) **
- ProbDefOrNotGivenScoreEqualsXAndRace

** P(race) **
- ProbOfBeingRace

** P(Score=x | race) **
- ProbScoreEqualsXGivenRace

** P(score=x & good & race) **
- ProbScoreEqualsXAndGoodAndRace

** P(NonDefault & race) **
- ProbNonDefaultAndRace

** P(Score=x | NonDefault, race) **
- ProbScoreEqualsXGivenNonDefaultAndRace

** P(Score>=x | NonDefault, race) **
- ProbScoreGreaterThanXGivenNonDefaultAndRace
- to obtain this, I calculated:
    - <b>P(score=x & NonDefault & race)</b>
    - <b>P(race & score=x) </b>
    - <b>P(NonDefault and race)</b>

** P(NonDefault | Score=>x, race) **
- ProbNonDefaultGivenScoreGreaterXAndRace
- This is supposed to yield the same results that ProbLoanReceiverIsGood.csv did
- Small rounding errors are causing this dataset to create different final results - unsure the cause

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

### Figure 3A:
https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables3.htm#d3A
    
### Figure 7A:
https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables7.htm#d7A

In [2]:
# Figure 7A-fixed is the same as Figure 7A I just changed the column names from good/bad to default/nondefault
CumulativePercentageByDefaulters = pd.read_csv("Figure7A-fixed.csv")
CumulativePercentageByDemographic = pd.read_csv("Figure3A.csv")

<hr/>

# Calculating P(NonDefault | race) - pi values
- Named ProbNonDefaultGivenRace.csv

In [3]:
def getPisNonCumulative(dataset, raceSet, goodName, badName):
    good = dataset.set_index("Score")[goodName].diff().fillna(value=0)
    bad = dataset.set_index("Score")[badName].diff().fillna(value=0)
    total = raceSet.set_index("Score")["Percentage"].diff().fillna(value=0)
    return ((total - bad) / (good - bad)).fillna(value=0)

In [4]:
# Get the dataframe that holds the cumulative percentage, by demographic group
# This function is just for reorganizing the given data
def getPD(data, col, raceName):
    pd = data["Score"].to_frame(name="Score")
    race = np.full(len(data), raceName)
    pd["Demographic"] = race
    pd["Percentage"] = data[col]
    return pd

whites = getPD(CumulativePercentageByDemographic, "White", "white")
blacks = getPD(CumulativePercentageByDemographic, "Black", "black")
asians = getPD(CumulativePercentageByDemographic, "Asian", "asian")
hispanics = getPD(CumulativePercentageByDemographic, "Hispanic", "hispanic")

In [5]:
whitePi = getPisNonCumulative(CumulativePercentageByDefaulters, whites, "White (NonDefault)", "White (Default)")
blackPi = getPisNonCumulative(CumulativePercentageByDefaulters, blacks, "Black (NonDefault)", "Black (Default)")
asianPi = getPisNonCumulative(CumulativePercentageByDefaulters, asians, "Asian (NonDefault)", "Asian (Default)")
hispanicPi = getPisNonCumulative(CumulativePercentageByDefaulters, hispanics, "Hispanic (NonDefault)", "Hispanic (Default)")

In [6]:
whitePi[50], blackPi[50], asianPi[50], hispanicPi[50]

(0.77551020408163152,
 0.29896907216495305,
 0.80645161290323375,
 0.54545454545453709)

In [7]:
# hardcoded pi values to account for rounding error
# Right now, if I use the hard coded rounded values I get the closest to the correct solution

# pis = [0.759185,0.315164,0.550595,0.80066]

# These are the real values
pis = [whitePi[50], blackPi[50], hispanicPi[50], asianPi[50]]
ProbNonDefaultGivenRace = pd.DataFrame(data=[pis], columns=['white', 'black', 'hispanic', 'asian'])
ProbNonDefaultGivenRace.set_index("white").to_csv("ProbNonDefaultGivenRace.csv")

<hr/>

# Calculating P(NonDefault | Score=x, race) and P(Default | Score=x, race)

<h2 align='center'>$\frac{pi * P(Score=x | NonDefault)}{pi * P(Score=x|NonDefault) + (1-pi) * P(Score=x | Default)}$</h2>
- from our discussion / email about Bayes' Rule on April 10th
- CumulativePercentageByDefaulters (from Figure 7A) gives us P(Score>=x | NonDefault) and P(Score>=x | Default).
- https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables7.htm#d7A
- Using diff(), we can obtain P(Score=x|NonDefault) and P(Score=x|Default).
- Together, with the pi values calculated above. We have the necessary requirements to calculate P(NonDefault | Score=x)

In [8]:
def getProbDefOrNotGivenScoreEqualsX(dataset, names, pis):
    probabilities = pd.DataFrame(index=dataset.index)
    for i in range(len(names)):
        nonDefault = dataset[names[i] + ' (NonDefault)'].diff().fillna(value=0)
        default = dataset[names[i] + ' (Default)'].diff().fillna(value=0)
        finalProbability = (
            (pis[i] * nonDefault) / ((pis[i] * nonDefault) + ((1 - pis[i])*(default))))
        probabilities['P(NonDefault|Score=x, ' + names[i] + ')'] = finalProbability.fillna(value=0)
        probabilities['P(Default|Score=x, ' + names[i] + ')'] = (1 - finalProbability.fillna(value=0))
    probabilities['Score'] = dataset.index
    return probabilities.set_index('Score')

In [9]:
ProbDefOrNotGivenScoreEqualsXAndRace = getProbDefOrNotGivenScoreEqualsX(
    CumulativePercentageByDefaulters.set_index('Score'),
    ["White", "Black", "Hispanic", "Asian"],
    pis)

In [10]:
ProbDefOrNotGivenScoreEqualsXAndRace.head()

Unnamed: 0_level_0,"P(NonDefault|Score=x, White)","P(Default|Score=x, White)","P(NonDefault|Score=x, Black)","P(Default|Score=x, Black)","P(NonDefault|Score=x, Hispanic)","P(Default|Score=x, Hispanic)","P(NonDefault|Score=x, Asian)","P(Default|Score=x, Asian)"
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
0.5,0.0,1.0,0.004516,0.995484,0.020619,0.979381,0.0,1.0
1.0,0.025629,0.974371,0.021857,0.978143,0.011168,0.988832,0.06068,0.93932
1.5,0.027318,0.972682,0.018501,0.981499,0.0,1.0,0.0,1.0
2.0,0.042152,0.957848,0.020513,0.979487,0.025424,0.974576,0.104603,0.895397


<hr/>

# Creating P(race)
- taken from Table 9. 
- https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/datamodel_tables.htm

In [11]:
sizes = [133165, 18274, 14702, 7906]
total = sum(sizes)
ProbOfBeingRace = pd.DataFrame(
    { 
    'Demographic' : ['white', 'black', 'hispanic', 'asian'],
    'P(race)' : [sizes[0]/total, sizes[1]/total, sizes[2]/total, sizes[3]/total]
    },
    columns=["Demographic", "P(race)"]
)

ProbOfBeingRace.set_index('Demographic').to_csv('ProbOfBeingRace.csv')

<hr/>

# Creating P(Score=x | race)
- taken from Table 3A. 
- https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/figtables3.htm#d3A
- using the diff values

In [12]:
# used to be ProbOfBeingScore
# I'll give them this in a csv
ProbScoreEqualsXGivenRace = (
    pd.read_csv("figure3A.csv")
    .set_index(["Score"]).diff().fillna(0) / 100
)

ProbScoreEqualsXGivenRace.to_csv('ProbScoreEqualsXGivenRace.csv')

In [13]:
ProbScoreEqualsXGivenRace.head()

Unnamed: 0_level_0,White,Black,Hispanic,Asian
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0025,0.0112,0.0046,0.0013
1.0,0.009,0.0414,0.0175,0.0075
1.5,0.0027,0.0114,0.0052,0.0019
2.0,0.0037,0.0142,0.0075,0.0025


In [14]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
ProbFromOther = pd.read_csv('ProbScoreEqualsXGivenRace-old.csv').set_index("TransRisk Score")
ProbFromOther.index.names = ["Score"]

for race in ["White", "Black", "Hispanic", "Asian"]:
    print(race, (ProbFromOther[race.lower()] - ProbScoreEqualsXGivenRace[race]).abs().sum())

White 5.692061405548898e-17
Black 4.7488055154865094e-17
Hispanic 5.800481622797449e-17
Asian 5.117434254131581e-17


<hr/>

# Calculating P(score=x | NonDefault, race)

### P(score=x | NonDefault, race) = P(score=x & NonDefault & race) / P(NonDefault and race)
<hr/>
### Step 1:
### P(score=x & NonDefault & race ) = P(race & score=x) * P(NonDefault | race, score=x)
###  P(score=x & NonDefault & race ) = P(race) * P(score=x | race) * P(NonDefault | race, score=x)
<hr/>
### Step 2:
### P(NonDefault and race) = P(race) * P(NonDefault | race)
<hr/>
### Step 3:
### Step 1 / Step2
### P(score=x | NonDefault, race) = P(score=x & NonDefault & race) / P(NonDefault and race)
<hr/>

## Step 1: Calculate P(Score=x & NonDefault & race)

In [15]:
ProbOfBeingRace.set_index('Demographic', inplace=True)

In [16]:
ProbScoreEqualsXGivenRace.head()

Unnamed: 0_level_0,White,Black,Hispanic,Asian
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0025,0.0112,0.0046,0.0013
1.0,0.009,0.0414,0.0175,0.0075
1.5,0.0027,0.0114,0.0052,0.0019
2.0,0.0037,0.0142,0.0075,0.0025


In [17]:
ProbRaceAndScoreEqualsX = pd.DataFrame({
    'white': ProbOfBeingRace.loc['white']['P(race)'] * ProbScoreEqualsXGivenRace['White'],
    'asian': ProbOfBeingRace.loc['asian']['P(race)'] * ProbScoreEqualsXGivenRace['Asian'],
    'black': ProbOfBeingRace.loc['black']['P(race)'] * ProbScoreEqualsXGivenRace['Black'],
    'hispanic': ProbOfBeingRace.loc['hispanic']['P(race)'] * ProbScoreEqualsXGivenRace['Hispanic'],
})

In [18]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
ProbFromOther = pd.read_csv('ProbRaceAndScoreEqualsX.csv').set_index("TransRisk Score")
ProbFromOther.index.names = ["Score"]
for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race.lower()] - ProbRaceAndScoreEqualsX[race]).abs().sum())

white 3.5344990823027445e-17
black 5.89534931288993e-18
hispanic 4.929731753020028e-18
asian 2.778268066994105e-18


In [19]:
ProbNonDefaultGivenScoreEqualsXAndRace = (
    ProbDefOrNotGivenScoreEqualsXAndRace[['P(NonDefault|Score=x, White)', 
                                 'P(NonDefault|Score=x, Black)',
                                'P(NonDefault|Score=x, Hispanic)',
                                'P(NonDefault|Score=x, Asian)']])

In [20]:
ProbNonDefaultGivenScoreEqualsXAndRace.head()

Unnamed: 0_level_0,"P(NonDefault|Score=x, White)","P(NonDefault|Score=x, Black)","P(NonDefault|Score=x, Hispanic)","P(NonDefault|Score=x, Asian)"
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0,0.004516,0.020619,0.0
1.0,0.025629,0.021857,0.011168,0.06068
1.5,0.027318,0.018501,0.0,0.0
2.0,0.042152,0.020513,0.025424,0.104603


In [21]:
ProbScoreEqualsXAndGoodAndRace = pd.DataFrame(index=ProbRaceAndScoreEqualsX.index)
ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and White)'] = (
    ProbRaceAndScoreEqualsX['white'] * ProbNonDefaultGivenScoreEqualsXAndRace['P(NonDefault|Score=x, White)'])
ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and Black)'] = (
    ProbRaceAndScoreEqualsX['black'] * ProbNonDefaultGivenScoreEqualsXAndRace['P(NonDefault|Score=x, Black)'])
ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and Hispanic)'] = (
    ProbRaceAndScoreEqualsX['hispanic'] * ProbNonDefaultGivenScoreEqualsXAndRace['P(NonDefault|Score=x, Hispanic)'])
ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and Asian)'] = (
    ProbRaceAndScoreEqualsX['asian'] * ProbNonDefaultGivenScoreEqualsXAndRace['P(NonDefault|Score=x, Asian)'])
# i'll give them this in a csv I think
ProbScoreEqualsXAndGoodAndRace.head()

Unnamed: 0_level_0,P(Score=x and NonDefault and White),P(Score=x and NonDefault and Black),P(Score=x and NonDefault and Hispanic),P(Score=x and NonDefault and Asian)
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0,5e-06,8e-06,0.0
1.0,0.000176,9.5e-05,1.7e-05,2.1e-05
1.5,5.6e-05,2.2e-05,0.0,0.0
2.0,0.000119,3.1e-05,1.6e-05,1.2e-05


In [22]:
# this column renaming is here just here for comparisons with the old data frames right now
ProbScoreEqualsXAndGoodAndRace.columns=['white', 'black', 'hispanic', 'asian']
ProbNonDefaultGivenScoreEqualsXAndRace.columns=['white', 'black', 'hispanic', 'asian']

In [23]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
# This one is going to differ because the old datasets used the rounded pi values for half of the 
# calculations and the real pi values for the rest of them. Here is where the differences begin.
ProbFromOther = pd.read_csv("ProbGoodGivenRaceAndScoreEqualsX.csv").set_index("TransRisk Score")
ProbFromOther.index.names = ["Score"]
for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race] - ProbNonDefaultGivenScoreEqualsXAndRace[race]).abs().sum())

white 6.553785292240377e-15
black 3.6012859361278515e-15
hispanic 4.812122922359663e-15
asian 5.856426454897701e-15


In [24]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
# This one is going to differ because the old datasets used the rounded pi values for half of the 
# calculations and the real pi values for the rest of them.
ProbFromOther = pd.read_csv('P(score=xandgoodandrace).csv').set_index("TransRisk Score")
ProbFromOther.index.names = ["Score"]

for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race] - ProbScoreEqualsXAndGoodAndRace[race]).abs().sum())

white 4.909402962285925e-17
black 2.6715419156400633e-18
hispanic 3.3068166260807885e-18
asian 3.0819293785847718e-18


In [25]:
# Renaming columns again now that comparison with old datasets is done
ProbScoreEqualsXAndGoodAndRace.columns=['P(Score=x and NonDefault and White)',
                                        'P(Score=x and NonDefault and Black)',
                                        'P(Score=x and NonDefault and Hispanic)',
                                        'P(Score=x and NonDefault and Asian)']

## Step 2: Calculate P(NonDefault and race)

In [26]:
#Error checking:

# At this point, the old tutorial switched to using the rounded values (even though everything
# prior was done with the REAL values)
# If I switch to using the ROUNDED (hard coded) values here, I match the old datasets, BUT I get the wrong solution...
# In conclusion:
# What yields the closest solution currently is choosing the ROUNDED (hard coded) values for all of the
# calculations, though it does make the Asian threshold higher than the White threshold in the final answer
# From here on out, the datasets from the old and new are going to differ because of this.

pd.read_csv("ProbGoodGivenRace.csv").head()

Unnamed: 0,white,black,hispanic,asian
0,0.759185,0.315164,0.550595,0.80066


In [27]:
ProbNonDefaultGivenRace.head()

Unnamed: 0,white,black,hispanic,asian
0,0.77551,0.298969,0.545455,0.806452


## Uncomment the line of code below if you'd like to see the old and new datasets being the same after switching to the rounded pi values half way through:

In [28]:
# ProbNonDefaultGivenRace = pd.read_csv("ProbGoodGivenRace.csv")

# ------------------------------

In [29]:
ProbNonDefaultAndRace = pd.DataFrame({
    'white': ProbOfBeingRace.loc['white']['P(race)'] * ProbNonDefaultGivenRace['white'],
    'asian': ProbOfBeingRace.loc['asian']['P(race)'] * ProbNonDefaultGivenRace['asian'],
    'black': ProbOfBeingRace.loc['black']['P(race)'] * ProbNonDefaultGivenRace['black'],
    'hispanic': ProbOfBeingRace.loc['hispanic']['P(race)'] * ProbNonDefaultGivenRace['hispanic'],
})

In [30]:
# I'll give them this csv I think
ProbNonDefaultAndRace

Unnamed: 0,asian,black,hispanic,white
0,0.036633,0.03139,0.046075,0.59335


## Step 3: Calculate P(Score=x | NonDefault, race)

In [31]:
ProbScoreEqualsXGivenNonDefaultAndRace = pd.DataFrame({
    'white': ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and White)'] / ProbNonDefaultAndRace['white'].values[0],
    'asian': ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and Asian)'] / ProbNonDefaultAndRace['asian'].values[0],
    'black': ProbScoreEqualsXAndGoodAndRace['P(Score=x and NonDefault and Black)'] / ProbNonDefaultAndRace['black'].values[0],
    'hispanic': ProbScoreEqualsXAndGoodAndRace ['P(Score=x and NonDefault and Hispanic)'] / ProbNonDefaultAndRace['hispanic'].values[0],
})


ProbScoreEqualsXGivenNonDefaultAndRace.head()

Unnamed: 0_level_0,asian,black,hispanic,white
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0,0.000169,0.000174,0.0
1.0,0.000564,0.003027,0.000358,0.000297
1.5,0.0,0.000705,0.0,9.5e-05
2.0,0.000324,0.000974,0.00035,0.000201


In [32]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
# This one is going to differ because the old datasets used the rounded pi values for half of the 
# calculations and the real pi values for the rest of them.
ProbFromOther = pd.read_csv('ProbScoreEqualsXGivenGoodAndRace.csv').set_index("TransRisk Score")
ProbFromOther.index.names = ["Score"]
for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race.lower()] - ProbScoreEqualsXGivenNonDefaultAndRace[race]).abs().sum())

white 0.021050924162967805
black 0.05416915666916594
hispanic 0.009424162798668428
asian 0.007181599105042662


<hr/>

# Calculating P(Score>=x | NonDefault, race)

In [33]:
ProbScoreGreaterThanXGivenNonDefaultAndRace = ProbScoreEqualsXGivenNonDefaultAndRace.iloc[::-1].cumsum()[::-1]
ProbScoreGreaterThanXGivenNonDefaultAndRace.head()

Unnamed: 0_level_0,asian,black,hispanic,white
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.992818,1.054168,1.009424,0.978949
0.5,0.992818,1.054168,1.009424,0.978949
1.0,0.992818,1.053998,1.00925,0.978949
1.5,0.992254,1.050972,1.008892,0.978652
2.0,0.992254,1.050266,1.008892,0.978557


<hr/>

# Calculating P(NonDefault | Score=>x, race)
### Should match ProbLoanReceiverIsGood.csv

<h2 align='center'>$\frac{P(NonDefault and Score>=x | race)}{P(Score>=x | race)}$</h2>

** P(NonDefault | Score=x, race) **
- calculated earlier in this notebook
- ProbNonDefaultGivenScoreEqualsXAndRace

** P(NonDefault and Score=x | race) = P(NonDefault | Score=x, race) * P(Score=x | race) **
- ProbNonDefaultGivenScoreEqualsXAndRace * ProbScoreEqualsXAndRace (calculated above)
- we can use cumsum() on the reverse of this to obtain P(NonDefault and Score>=x | race)

** P(NonDefault and Score>=x | race) **
- ProbNonDefaultAndScoreGreaterXGivenRace
- see above for how to calculate this

** P(Score>=x | race) **
- We already calculated P(Score=x | race), we can use cumsum() on the reverse of ProbScoreEqualsXGivenRace to obtain this

In [34]:
ProbNonDefaultGivenScoreEqualsXAndRace.head()

Unnamed: 0_level_0,white,black,hispanic,asian
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.0,0.0
0.5,0.0,0.004516,0.020619,0.0
1.0,0.025629,0.021857,0.011168,0.06068
1.5,0.027318,0.018501,0.0,0.0
2.0,0.042152,0.020513,0.025424,0.104603


### Step 1:

** P(NonDefault and Score=x | race) = P(NonDefault | Score=x, race) * P(Score=x | race) **

In [35]:
ProbScoreEqualsXGivenRace.columns = [x.lower() for x in ProbScoreEqualsXGivenRace.columns]
ProbNonDefaultAndScoreEqualsXGivenRace = (
    ProbScoreEqualsXGivenRace[['asian', 'black', 'hispanic', 'white']] * 
    ProbNonDefaultGivenScoreEqualsXAndRace[['asian', 'black', 'hispanic', 'white']])
ProbNonDefaultAndScoreGreaterXGivenRace = (
    ProbNonDefaultAndScoreEqualsXGivenRace.iloc[::-1].cumsum()[::-1])

### Step 2:

Take the cumsum of the reverse of P(Score=x | race) to obtain:
** P(Score>=x | race) **

In [36]:
ProbScoreGreaterXGivenRace = ProbScoreEqualsXGivenRace.iloc[::-1].cumsum()[::-1]

In [37]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
ProbFromOther = pd.read_csv("ProbScoreGreaterThanXGivenRace.csv").set_index("TransRisk Score")
ProbFromOther.index.names = ['Score']

for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race] - ProbScoreGreaterXGivenRace[race]).abs().sum())

white 2.5400688496990398e-15
black 9.074772183703672e-16
hispanic 1.916869440954372e-15
asian 4.121919819355391e-15


In [38]:
ProbNonDefaultAndScoreGreaterXGivenRace.head()

Unnamed: 0_level_0,asian,black,hispanic,white
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.80066,0.315164,0.550595,0.759185
0.5,0.80066,0.315164,0.550595,0.759185
1.0,0.80066,0.315113,0.5505,0.759185
1.5,0.800205,0.314208,0.550304,0.758954
2.0,0.800205,0.313997,0.550304,0.758881


In [39]:
ProbScoreGreaterXGivenRace.head()

Unnamed: 0_level_0,white,black,hispanic,asian
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.9999,0.9993,0.9999,1.0
0.5,0.9999,0.9993,0.9999,1.0
1.0,0.9974,0.9881,0.9953,0.9987
1.5,0.9884,0.9467,0.9778,0.9912
2.0,0.9857,0.9353,0.9726,0.9893


### Step 3:

** P(NonDefault | Score>=x and Race) = P(NonDefault and Score>=x | Race) / P(Score>=x | race) **

In [40]:
ProbNonDefaultGivenScoreGreaterXAndRace = (
    ProbNonDefaultAndScoreGreaterXGivenRace / ProbScoreGreaterXGivenRace)

In [41]:
ProbNonDefaultGivenScoreGreaterXAndRace.head()

Unnamed: 0_level_0,asian,black,hispanic,white
Score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.80066,0.315384,0.55065,0.759261
0.5,0.80066,0.315384,0.55065,0.759261
1.0,0.801702,0.318908,0.553099,0.761164
1.5,0.807309,0.331898,0.562799,0.767862
2.0,0.80886,0.335718,0.565808,0.76989


In [42]:
# Cross checking to ensure it matches the old dataset from the calculations done in the winter
ProbFromOther = pd.read_csv('ProbLoanReceiverIsGood.csv').set_index("TransRisk Score")
ProbFromOther.index.names = ['Score']

for race in ["white", "black", "hispanic", "asian"]:
    print(race, (ProbFromOther[race.lower()] - ProbNonDefaultGivenScoreGreaterXAndRace[race]).abs().sum())

white 1.1324274851176597e-14
black 7.271960811294775e-15
hispanic 2.4202861936828413e-14
asian 9.880984919163893e-15


In [43]:
# THEY ARE GIVEN THIS CSV IN THE TUTORIAL
ProbNonDefaultGivenScoreGreaterXAndRace.to_csv('ProbNonDefaultGivenScoreGreaterXAndRace.csv')