# Loan Discrimination Exploration

By: Kenny Tang & Sara Haptonstall

Recently, our team came across an article published by Reveal news, that talks about the presence of discriminations in home mortgage loans in today's society. In their analysis, they were able to determine the likelihood of mortgage denials for different minority groups through the use of binary logistic regression. The results of their research showed large significant discrepancies across 48 different metropolitan areas.

This research was conducted over 5 years ago using 2015 and 2016 mortgage data provided by HMDA. As years past and America approaches a more multiculturally accepting society, our team is curious to see how discrimination affects the mortgage loan market today. At the time of our analysis, it is the year 2022 and the home mortgage data we will be analyzing will cover loans approved and rejected in 2021.

To be more precise, we would like to know whether discrimination still plays a significant affect on home mortgage loan approvals for different minority groups. In addition to this, to expand upon the work that inspired our analysis, we will also observe discrepancies in interest rates for different minority groups. 

To conduct our research, we will initially look at all loans across the United States and do a exploratory analysis on interest rate differences and approval rate differences. 

Later, we will delve deeper into the matter by looking at mortgage loans from Los Angeles, a county historically known for redlining. (Please see other notebook for this part). The datasets we will use is a dataset that includes a collection of 2021 home mortgage loans provided by HMDA and a redlining grade dataset provided by HOLC. And without further ado, let's get started.

Let's first download and import any dependencies we will use. In addition, some of our datasets are extremely large; therefore, we will utilize PySpark to process the data more efficiently. 

In [1]:
! pip install causalinference
! pip install statsmodels


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import pandas as pd
import numpy as np
import altair as alt
import pyspark
from pyspark.sql import SparkSession 
from io import StringIO
from scipy.stats import ttest_ind
from causalinference import CausalModel
import statsmodels.formula.api as smf
import statsmodels.api as sm
alt.themes.enable("fivethirtyeight")

ThemeRegistry.enable('fivethirtyeight')

In [3]:
#pyspark session

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('My First Spark application') \
    .getOrCreate()
sc = spark.sparkContext

Our mortgage DataFrame is quite large so we will load it in as a pyspark table. Our HOLC dataset is not as large so lets load in our HOLC dataset in a pandas DataFrame.

In [4]:
df_hm = spark.read.option("header",True) \
     .csv("/work/2021_public_lar.csv")
df_hm.show(2,truncate=False)

+-------------+--------------------+--------------+----------+-----------+------------+---------------------+-------------------------+------------------------------------+----------------------+------------+-----------+------------+--------------+-----------+---------+------------+-----------+----------------+-----------------------+------------------------------+-----------+----------------------------+-------------+-----------+------------+----------------+---------------------+-------------------+---------------+--------------+---------+-----------------------+-----------------+---------------------+---------------------+---------------+----------------------------+--------------+-------------------+--------------+---------------------------------------+----------------------------------------+-----------+----------------------------+------+--------------------+---------------------------+------------------------------+---------------------+---------------------+-------------------

In [5]:
holc=pd.read_csv('/work/HOLC_2020_census_tracts/HOLC_2020_census_tracts.csv')
holc.head()

Unnamed: 0,geoid20,class1,class1_lbl,class2,class2_lbl,class2_red,class3,class3_lbl,area_total,area_rated,area_U,area_A,area_B,area_C,area_D
0,1073000100,D,Mainly D,D-C,"Mainly D, some C","Mainly D, some C",D-C-B,"Mainly D, some C, some B",7549580.5,73.293671,26.706331,0.0,4.261451,26.091967,42.94025
1,1073000300,D,Mainly D,D-C,"Mainly D, some C","Mainly D, some C",D-C,"Mainly D, some C",2078504.4,94.276787,5.723211,0.0,0.0,0.040526,94.236259
2,1073000400,D,Mainly D,D-C,"Mainly D, some C","Mainly D, some C",D-C,"Mainly D, some C",7998765.0,46.557659,53.442341,0.0,0.0,10.347183,36.210476
3,1073000500,D,Mainly D,D,Mainly D,Only D,D,Mainly D,4680667.0,64.39016,35.60984,0.0,0.0,0.0,64.39016
4,1073000700,D,Mainly D,D,Mainly D,Only D,D,Mainly D,3520562.8,41.288933,58.711067,0.0,0.0,0.0,41.288933


There are a lot of cleaning we need to do for our mortgage dataset. There are missing values, erroneous values such as negative income, and extreme outliers. In addition, there are many types of loans documented in the dataset and we will need to control for them. For the purpose of our analysis, we will observe loans that are: conventional loans, single family homes, loan applications that are not incomplete or self-forfeited, for personal use, and more. Please see our report for more details. 

In [6]:
df_hm.createOrReplaceTempView('N_df_view')

In [7]:
def cut_view_red():
    return spark.sql("""\
        SELECT *
        FROM N_df_view
        WHERE derived_loan_product_type = "Conventional:First Lien" AND
        derived_dwelling_category = 'Single Family (1-4 Units):Site-Built' AND
        conforming_loan_limit = "C" AND
        action_taken != 4 AND
        action_taken != 5 AND
        loan_type = 1 AND
        loan_purpose = 1 AND
        lien_status = 1 AND
        reverse_mortgage = 2 AND
        open_end_line_of_credit = 2 AND
        business_or_commercial_purpose = 2 AND
        negative_amortization = 2 AND
        occupancy_type = 1 AND
        total_units = 1 AND
        balloon_payment = 2
        """)

In [8]:
df_hm_cleaned = cut_view_red()

In [9]:
# Take only features I need
df_hm_cleaned = df_hm_cleaned.select('county_code',
                            'derived_ethnicity', 
                            'derived_race', 
                            'derived_sex', 
                            'action_taken', 
                            'loan_purpose', 
                            'business_or_commercial_purpose',
                            'derived_dwelling_category',
                            'loan_amount',
                            'occupancy_type',
                            'combined_loan_to_value_ratio',
                            'interest_rate', 'property_value',
                            'income',
                            'debt_to_income_ratio',
                            'denial_reason_1',
                            'loan_term',
                            'rate_spread')

Next, we will split our dataset by different ethnicity groups so that we may load the data as pandas DataFrames. This makes some of our computations more efficient and it makes comparing select groups easier. To make our lives even easier, we will also merge the derived ethnicity and derived race column by moving 'Hispanic and Latino' to race so that all minority groups are recorded in the same column. 

In [10]:
df_hm

DataFrame[activity_year: string, lei: string, derived_msa_md: string, state_code: string, county_code: string, census_tract: string, conforming_loan_limit: string, derived_loan_product_type: string, derived_dwelling_category: string, derived_ethnicity: string, derived_race: string, derived_sex: string, action_taken: string, purchaser_type: string, preapproval: string, loan_type: string, loan_purpose: string, lien_status: string, reverse_mortgage: string, open_end_line_of_credit: string, business_or_commercial_purpose: string, loan_amount: string, combined_loan_to_value_ratio: string, interest_rate: string, rate_spread: string, hoepa_status: string, total_loan_costs: string, total_points_and_fees: string, origination_charges: string, discount_points: string, lender_credits: string, loan_term: string, prepayment_penalty_term: string, intro_rate_period: string, negative_amortization: string, interest_only_payment: string, balloon_payment: string, other_nonamortizing_features: string, prop

In [11]:
# Split Dataframe by race and ethinicity
df_hm_white = df_hm_cleaned.select('*').filter(df_hm_cleaned.derived_race =='White').toPandas()
df_hm_asian = df_hm_cleaned.select('*').filter(df_hm_cleaned.derived_race =='Asian').toPandas()


In [12]:
df_hm_black = df_hm_cleaned.select('*').filter(df_hm_cleaned.derived_race =='Black or African American').toPandas()
df_hm_hispanic = df_hm_cleaned.select('*').filter(df_hm_cleaned.derived_ethnicity =='Hispanic or Latino').toPandas()

In [13]:
# Lets merge ethnicity and race by moving hispanic and latino over to race.
df_hm_hispanic.loc[df_hm_hispanic.derived_ethnicity == 'Hispanic or Latino', 'derived_race'] = 'Hispanic or Latino'


### Lets look at Approval Rates by Race

In [14]:
# Compute approval Rates
white_approval_rate = len(df_hm_white[df_hm_white.action_taken == '1'])/len(df_hm_white)
asian_approval_rate = len(df_hm_asian[df_hm_asian.action_taken == '1'])/len(df_hm_asian)
black_approval_rate = len(df_hm_black[df_hm_black.action_taken == '1'])/len(df_hm_black)
hispanic_approval_rate = len(df_hm_hispanic[df_hm_hispanic.action_taken == '1'])/len(df_hm_hispanic)

In [15]:
print(white_approval_rate, asian_approval_rate, black_approval_rate, hispanic_approval_rate)

0.8881990208875556 0.8924153291209722 0.7790991526683514 0.8197981644085979


In [16]:
# Create approval dataframe to plot
approval_df = pd.DataFrame({'ethnicity':['white','asian','black','hispanic'],\
     'approval_rate':[white_approval_rate, asian_approval_rate, black_approval_rate, hispanic_approval_rate]})
approval_df = approval_df.sort_values(by= 'approval_rate', ascending= False)

In [17]:
# Plot approval rates
approval_bar = alt.Chart(approval_df).mark_bar(size = 50).encode(
    x= alt.X('ethnicity', sort= approval_df.approval_rate.values, title="Race"),
    y= alt.Y('approval_rate', scale= alt.Scale(domain=[0,1]), axis=alt.Axis(format='%'))
).properties(width=300,
    height=300,
    title=alt.TitleParams(
            text='Approval Rate'))

text = approval_bar.mark_text(
    align='left',
    baseline='middle',
    dy=-10  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text= alt.Text('approval_rate:Q', format='.0%')
)

bar=approval_bar+text
bar.configure_title(fontSize=14).configure(background='#FFFFFF').configure_axis(
    grid=False)

  for col_name, dtype in df.dtypes.iteritems():


We can see some differences between groups. Asian group has the highest, sitting at around 89% approval rate. White group follows closely with a similar number. Hispanic group has a slightly lower approval rate of about 82% and the Black group comes last with the lowest rate of about 78%, 10% lower than the White group. 10% is quite a large difference; however, we can not conclude that the difference is attributable to race, that would require further analysis. We will come back to delve deeper into analyzing approval rates but for the time being, lets continue our exploratory analysis.

### Lets look at the distribution of debt to income ratio for all approved loans

As mentioned previously, our dataset is full of errors. To filter out some of these errors, we filtered income to be over $20,000 and under $500,000. This effectively removes extreme outliers and enacts some form of control for our income variable. In addition to income, we discovered outliers where a loan had 100% interest rate and whether this was due to error or not, we decided to filtered interest rate to be any number above 0 and less than 10%.  Similarly to our method of filtering income, this effectively removes extreme outliers and contributes some control over our interest variable.

In [18]:
# Drop missing values
df_hm_white= df_hm_white.replace({'NA':np.nan, 'Exempt':np.nan}).dropna(
    subset=['interest_rate','debt_to_income_ratio','income','rate_spread','combined_loan_to_value_ratio'])
df_hm_asian = df_hm_asian.replace({'NA':np.nan, 'Exempt':np.nan}).dropna(
    subset=['interest_rate','debt_to_income_ratio','income','rate_spread','combined_loan_to_value_ratio'])
df_hm_black = df_hm_black.replace({'NA':np.nan, 'Exempt':np.nan}).dropna(
    subset=['interest_rate','debt_to_income_ratio','income','rate_spread','combined_loan_to_value_ratio'])
df_hm_hispanic = df_hm_hispanic.replace({'NA':np.nan, 'Exempt':np.nan}).dropna(
    subset=['interest_rate','debt_to_income_ratio','income','rate_spread','combined_loan_to_value_ratio'])

# Make separate Dataframe for each group where it only contains approved loans.
df_white_approve = df_hm_white[df_hm_white.action_taken =='1']
df_asian_approve = df_hm_asian[df_hm_asian.action_taken =='1']
df_black_approve = df_hm_black[df_hm_black.action_taken =='1']
df_hispanic_approve = df_hm_hispanic[df_hm_hispanic.action_taken =='1']

# Convert data type for some of our variables.
df_white_approve = df_white_approve.astype({'interest_rate':'float', 'income':'float', 
    'loan_amount':'float','rate_spread':'float','combined_loan_to_value_ratio':'float'})
df_asian_approve = df_asian_approve.astype({'interest_rate':'float', 'income':'float', 
    'loan_amount':'float','rate_spread':'float','combined_loan_to_value_ratio':'float'})
df_hispanic_approve = df_hispanic_approve.astype({'interest_rate':'float', 'income':'float', 
    'loan_amount':'float','rate_spread':'float','combined_loan_to_value_ratio':'float'})
df_black_approve = df_black_approve.astype({'interest_rate':'float', 'income':'float', 
    'loan_amount':'float','rate_spread':'float','combined_loan_to_value_ratio':'float'})

# Lets filter out extreme outliers.
df_white_approve= df_white_approve[(df_white_approve['interest_rate']>0) & 
    (df_white_approve['interest_rate']<=10) &
    (df_white_approve['income']>20) &
    (df_white_approve['income']<500)]
df_asian_approve= df_asian_approve[(df_asian_approve['interest_rate']>0) & 
    (df_asian_approve['interest_rate']<=10) &
    (df_asian_approve['income']>20) &
    (df_asian_approve['income']<500)]
df_hispanic_approve= df_hispanic_approve[(df_hispanic_approve['interest_rate']>0) & 
    (df_hispanic_approve['interest_rate']<=10) &
    (df_hispanic_approve['income']>20) &
    (df_hispanic_approve['income']<500)]
df_black_approve= df_black_approve[(df_black_approve['interest_rate']>0) & 
    (df_black_approve['interest_rate']<=10) &
    (df_black_approve['income']>20) &
    (df_black_approve['income']<500)]


Debt to income ratio can be strong control variable for our group analysis. HMDA uses applicant's monthly debt payments and monthly income to compute the ratio, therefore it can be an accurate measure of applicant's financial stability. Unfortunately, due to the way debt to income ratio was collected (a mix of percentage bins and individual percentages), we found that this important variable would not be usable in any regression we conduct. We can however, aggregate the ratio bins in a more complete manner and observe the distribution of debt to income ratios for all approved loans for each race groups. To do this we lumped the values of recorded by intervals of 10% and aggregate the count.

In [19]:
# Create debt to income bins by intervals of 10%
def bins(x):
    if x in ['30%-<36%','36', '37', '38', '39']:
        return '30%-<40%'
    elif x in ['40', '41', '42', '43', '44', '45', '46', '47', '48', '49']:
        return '40%-<50%'
    elif x in ['50%-60%', '>60%']:
        return '>50%'
    else:
        return x

# Create plots of debt to income ratio distribution for approved loans
def debt_to_income_groupby_df(race = 'white'):
    # The race parameter should be a string of one of the following: 'white', 'asian', 'hispanic', 'black'

    if race == 'black':
        df = df_black_approve
        title = 'Black'
    elif race == 'asian':
        df = df_asian_approve
        title = 'Asian'
    elif race == 'hispanic':
        df = df_hispanic_approve
        title = 'Hispanic'
    else:
        df = df_white_approve
        title = 'White'

    # Create Bins for our race group
    df['debt_to_income_ratio'] = df.debt_to_income_ratio.transform(lambda x: bins(x))
    bin_ct_df= df.groupby('debt_to_income_ratio').agg({'debt_to_income_ratio':'count'})
    bin_ct_df = bin_ct_df.transform(lambda x: round(x/sum(bin_ct_df.debt_to_income_ratio),4))
    bin_ct_df = bin_ct_df.rename(columns={'debt_to_income_ratio':'percentage'}).reset_index().iloc[:6]
    bin_ct_df['race'] = race
    return bin_ct_df


In [20]:
def d2i_comparison(minority_group, baseline='white'):

    df = debt_to_income_groupby_df(race = baseline).append(
        debt_to_income_groupby_df(race = minority_group))

    title = minority_group.capitalize()+' vs '+baseline.capitalize()

    chart = alt.Chart(df).mark_bar().encode(
        x= alt.X('race:N', axis=None),
        y= alt.Y('percentage', title='Percentage', axis=alt.Axis(tickCount=5, offset=10),
                 scale=alt.Scale(domain=[0,.5])),
        color= 'race',
        column= alt.Column(
            'debt_to_income_ratio:O', 
            sort=['<20%','20%-<30%','30%-<40%','40%-<50%','>50%'],
            header= alt.Header(title='Distribution of Debt to Income Ratio'))
    ).configure(background='#FFFFFF').configure_axis(
        grid=False
    ).configure_view(
        strokeWidth=0, 
        strokeOpacity=0
    ).configure_header(
        labelOrient='bottom'
    ).properties(height= 200, width=50, title=title
    ).configure_title(fontSize=20, offset=5, orient='top', anchor='middle'
    )

    return chart

In [21]:
d2i_comparison('asian', baseline='white')

  df = debt_to_income_groupby_df(race = baseline).append(
  for col_name, dtype in df.dtypes.iteritems():


In [22]:
d2i_comparison('hispanic', baseline='white')

  df = debt_to_income_groupby_df(race = baseline).append(
  for col_name, dtype in df.dtypes.iteritems():


In [23]:
d2i_comparison('black', baseline='white')

  df = debt_to_income_groupby_df(race = baseline).append(
  for col_name, dtype in df.dtypes.iteritems():


When we compare the three minority groups (Asian, Hispanic, and Black) to White, we can see that the minority groups have higher frequency in the 40-50% debt to income ratio bins than the White group. Debt to income ratio is a good measure of financial stability. A high debt to income ratio means more debt to be paid or less income, representing higher risk. A low debt to income ratio means less debt to be paid or more income, representing lower risk. Because minority groups have higher distribution in the higher debt to ratio bins compared to the white group, our results suggest that loan companies are accepting higher risk by approving loans to these minority groups. This can potentially explain the discrepancy in loan approval rates for our groups. Next we will dive into interest rates.

### Lets run some hypothesis test to see if we have differences.

First, we will run a simple hypothesis test to see the differences in interest rates for our groups.

In [24]:
white_interest = df_white_approve.interest_rate
asian_interest = df_asian_approve.interest_rate
hispanic_interest = df_hispanic_approve.interest_rate
black_interest = df_black_approve.interest_rate

white_interest.mean(), asian_interest.mean(), hispanic_interest.mean(), black_interest.mean()

(3.070415258496041, 2.983161093868947, 3.1842750920983387, 3.163821641607099)

In [25]:
def t_test(df1, df2, variable= 'interest_rate', tails= 'two-sided'):
    # takes 2 dataframe from different groups.
    # variable of interest as a string, 
    # tails for the test as a string.
    df1 = df1[variable]
    df2 = df2[variable]
    print(df1.mean(), df2.mean())
    return ttest_ind(df1,df2, alternative= tails)

In [26]:
t_test(df_white_approve, df_black_approve, 'interest_rate')

3.070415258496041 3.163821641607099


Ttest_indResult(statistic=-32.12215333445033, pvalue=4.682479271547002e-226)

In [27]:
t_test(df_white_approve, df_hispanic_approve, 'interest_rate')

3.070415258496041 3.1842750920983387


Ttest_indResult(statistic=-51.10821012284086, pvalue=0.0)

In [28]:
t_test(df_white_approve, df_asian_approve, 'interest_rate')

3.070415258496041 2.983161093868947


Ttest_indResult(statistic=37.39496321476996, pvalue=1.8419087473668434e-305)

In our hypothesis test, we tested to see if the mean interest rates of our minority groups were the same as the white group. Our results show that the average interest rate for Black was 0.08% higher than White and the average interest rate for Hispanic was 0.1% higher than white. To our surprise, Asian had 0.08% lower average interest rate than White, signifying that they get a more favorable rate than White. Although these differences may not seem like much, when you consider the fact that the average interest rate is around 3 to 3.25% for loans that amount to hundreds of thousands of dollars, 0.1% can be quite significant. All of our hypothesis test were significant at the 0.05 level, which comes to no surprise due to the massive size of our sample size.

### Regression

We will take 5000 random samples from each group. We will also compute dummy variables for each of our minority groups where value of 1 means the applicant belongs to the group and 0 means they are not. In addition, we will compute the log of some of our variables. Lastly, due to the debt to income ratio recorded by HMDA being unusable, we will also compute our own debt to loan ratio by taking the loan amount divided by the applicant's yearly income. 

In [29]:
# Take 5000 random samples from each group
df_white_samp = df_white_approve.sample(5000)
df_asian_samp = df_asian_approve.sample(5000)
df_hispanic_samp = df_hispanic_approve.sample(5000)
df_black_samp = df_black_approve.sample(5000)

# Merge the samples into one dataset
regression_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)

# Take the fields we need
regression_df= regression_df[[
                            'derived_ethnicity', 
                            'derived_race',
                            'loan_amount', 
                            'interest_rate', 
                            'income', 
                            'debt_to_income_ratio',
                            'combined_loan_to_value_ratio',
                            'rate_spread']]

  regression_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)
  regression_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)
  regression_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)


In [30]:
# Create dummy variables
regression_df['white'] = np.where(regression_df.derived_race == 'White', 1, 0)
regression_df['asian'] = np.where(regression_df.derived_race == 'Asian', 1, 0)
regression_df['black'] = np.where(regression_df.derived_race == 'Black or African American', 1, 0)
regression_df['hispanic'] = np.where(regression_df.derived_race == 'Hispanic or Latino', 1, 0)

# Compute our own debt to income ratio
regression_df['debt_to_income'] = round(regression_df['loan_amount'] / (regression_df['income']),2)

#compute natural log of interest_rate, income, and loan amount
regression_df['ln_interest_rate'] = round(np.log(regression_df.interest_rate),3)
regression_df['ln_income'] = round(np.log(regression_df.income),3)
regression_df['ln_loan_amount'] = round(np.log(regression_df.loan_amount),3)

In [31]:
def get_ols_regression(minority_group, filter_other_race = False, percent_change= False):
    # minority_group is the treatment variable, or the race of interest. Input as a string
    # minority_group should be one of the following: 'asian', 'hispanic', 'black'

    if percent_change == True:
        formula = 'ln_interest_rate~'+minority_group+'+debt_to_income'
    else:
        formula = 'interest_rate~'+minority_group+'+debt_to_income'

    if filter_other_race == True:
        nontreatment = regression_df[regression_df.white == 1]  
        treatment = regression_df[regression_df[minority_group]==1]
        df = nontreatment.append(treatment)
    else:
        df = regression_df

    reg = smf.ols(formula = formula, data = df).fit()
    reg = reg.get_robustcov_results(cov_type= 'HC1')
    
    return reg.summary()

In [32]:
get_ols_regression(minority_group='black', filter_other_race = True, percent_change = False)

  df = nontreatment.append(treatment)


0,1,2,3
Dep. Variable:,interest_rate,R-squared:,0.015
Model:,OLS,Adj. R-squared:,0.015
Method:,Least Squares,F-statistic:,76.43
Date:,"Tue, 08 Nov 2022",Prob (F-statistic):,1.1400000000000001e-33
Time:,01:17:48,Log-Likelihood:,-4954.0
No. Observations:,10000,AIC:,9914.0
Df Residuals:,9997,BIC:,9936.0
Df Model:,2,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0766,0.012,250.783,0.000,3.053,3.101
black,0.0986,0.008,12.363,0.000,0.083,0.114
debt_to_income,-3.792e-06,2.95e-06,-1.286,0.199,-9.57e-06,1.99e-06

0,1,2,3
Omnibus:,5245.677,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,144683.5
Skew:,1.966,Prob(JB):,0.0
Kurtosis:,21.215,Cond. No.,10500.0


Through our OLS regression analysis, we can see that applicants that are Hispanic tend to have around 0.1% higher interest rate than white applicants. Alternatively, if we model the regression using black as our treatment variable, we can see that black applicants have 0.04% higher interest rate than white applicants. The problem with our model, however, is that our R squared is extremely low, which may suggests that our independent variables are uninformative of our dependent variable, interest rate. Lets have a look at how our independent variables correlate with the dependent variables.

In [33]:
def scatterplot(df):

    chart = alt.Chart(df).mark_point().encode(
        x= alt.X('income'),
        y= alt.Y('ln_interest_rate')
    )
    return chart

In [34]:
scatterplot(regression_df[regression_df.derived_race == 'Hispanic or Latino'])

  for col_name, dtype in df.dtypes.iteritems():


Unfortunately, we were not able to find any variables that have strong relationships with interest rate. Which may prove this method to be quite futile. We will need to try a different method, the causal model test. 

### Average Treatment Effect

The goal of the causal model test is to explain the effects of our treatment variables on interest rate. We will be using the causal inference library for this analysis. Through causal modeling, it controls other independent variables (loan amount and income) by matching applicants from each group that have similar control variable values. By doing so, it only looks at a subgroup of our sample size where each group has similar loan amount and income. It then computes the average effects our treatment variable (minority groups) has on interest rate.

In [35]:
def avg_treat_effect(minority_group):
    nontreatment = regression_df[regression_df.white == 1]  
    treatment = regression_df[regression_df[minority_group]==1]
    df = nontreatment.append(treatment)

    Y= np.array(df.interest_rate)
    D= np.array(df[minority_group])
    X= np.array(df[['loan_amount', 'income']])

    model= CausalModel(Y=Y, D=D, X=X)
    model.est_via_matching()
    return model.estimates['matching']['ate']


In [36]:
avg_treat_effect('black')

  df = nontreatment.append(treatment)


0.09310048503968255

In [37]:
avg_treat_effect('hispanic')

  df = nontreatment.append(treatment)


0.1032122905952381

In [38]:
avg_treat_effect('asian')

  df = nontreatment.append(treatment)


-0.09786195440476189

By controlling for loan amount and income, we are comparing groups that are as similar as they can be except for the fact that one group are black applicants and the other being white applicants. In essence, we are now comparing apples to apples and through our results, we can see that the average affect of being a black applicant is a 0.09% increase to interest rate.

### Violin Plot

Lets see the distribution of Interest Rate across all our racial groups. We will make a violin plot for easy comparison.

In [39]:
# Take samples of 1200 from each racial group. We do this because we cannot plot datasets with over 5000 rows.
df_white_samp = df_white_approve.sample(1200, random_state = 42)
df_asian_samp = df_asian_approve.sample(1200, random_state = 42)
df_hispanic_samp = df_hispanic_approve.sample(1200, random_state = 42)
df_black_samp = df_black_approve.sample(1200, random_state = 42)

# Lets combine the Dataframes to one so that we may easily plot them using a single dataframe.
violin_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)

# Take the features we need or may need. We may not use all of them.
violin_df= violin_df[['derived_race', 'loan_amount', 'interest_rate', 'income', 'debt_to_income_ratio']]

  violin_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)
  violin_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)
  violin_df = df_white_samp.append(df_asian_samp).append(df_hispanic_samp).append(df_black_samp)


Lets make our Violin Chart.

In [40]:
# Construct Violin Chart.
violins = alt.Chart().transform_density(
    'interest_rate',
    as_=['interest_rate','density'],
    extent=[1, 5],
    groupby=['derived_race']
).mark_area(orient='horizontal').encode(
    y=alt.Y('interest_rate:Q', title = '', axis=alt.Axis(tickCount=10, offset=10)),
    color='derived_race:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=None #alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
)

# Lets make quantile bars to see median, 25% quantiles, and 75% quantiles.
# We will then facet the violin plots for each racial group to plot them next to each other.
violins2=alt.layer(
    violins,
    alt.Chart().mark_boxplot(size=5, extent=0, outliers=False).encode(
        y= alt.Y('interest_rate', title = 'Interest Rate'),
        x= alt.value(46),
        color=alt.value('black')),
).properties(
    width=100,
    height= 500,
).facet(
    data=violin_df,
    column=alt.Column(
        'derived_race:N',
    header=alt.Header(
            title= 'Density Distribution of Interest Rate by Race',
            titleFontSize = 24,
            labels=False
        ),
    )
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

# Give white background and minor configuration.
violins2.configure(background='#FFFFFF').configure_axis(
    grid=False).configure_view(
    strokeWidth=0).configure_view(strokeOpacity=0)

  for col_name, dtype in df.dtypes.iteritems():


### Logistic Regression for Los Angeles County

Lets look at a Binary Logistic Regression that computes how more likely Black, Asians, and Hispanics are to be denied a loan than White applicants. We will create and use dummy variables for joint (jointly filed application), White, Asian, Hispanic, Black, and Declined. The Declined Variable is 1 if the application is denied and 0 if approved, it will serve as our dependent variable. In addition, I computed log of income and log of loan amount to mimic the use of these control variables in the analysis done by Reveal News. I also converted the debt to income ratio from bins into a quantifiable variable. To do this, I took the percentage ranges that were represented as bins and gave it a value of the median of the range.

This Logistic Regression is to provide supplementary analysis to the LA HRS Map created in our other notebook. Please visit the notebook, 'LA HRS map'

In [41]:
def log_reg_model(county_code):
    # get county data
    df = df_hm_cleaned.select('*')\
        .filter(df_hm_cleaned.county_code == county_code).toPandas()

    # drop unneccessary columns. We no longer use interest because we are now observing declined loans as well.
    df = df.drop(columns=[
        'loan_purpose', 
        'occupancy_type', 
        'business_or_commercial_purpose', 
        'derived_dwelling_category', 
        'interest_rate', 
        'rate_spread'])

    # clean data. Lets convert debt to income into something usable by subbing median of bin range.
    df = df.replace({
        'NA':np.nan, 
        'Exempt':np.nan, 
        '<20%':20,
        '20%-<30%':25,
        '30%-<36%': 33, 
        '50%-60%': 55, 
        '>60%':60}
        ).dropna().astype({
        'income':'float', 
        'loan_amount':'float',
        'combined_loan_to_value_ratio':'float', 
        'action_taken':'int',
        'debt_to_income_ratio':'int'})

    # Applicants either are approved or declined. Lets move hispanic into race for simplicity purposes.
    df = df[(df.action_taken==1) | (df.action_taken==3)]
    df.loc[df.derived_ethnicity == 'Hispanic or Latino', 'derived_race'] = 'Hispanic or Latino'

    # Make dummy variables for log regression
    df['joint'] = np.where(df.derived_sex == 'Joint', 1, 0)
    df['white'] = np.where(df.derived_race == 'White', 1, 0)
    df['asian'] = np.where(df.derived_race == 'Asian', 1, 0)
    df['black'] = np.where(df.derived_race == 'Black or African American', 1, 0)
    df['hispanic'] = np.where(df.derived_race == 'Hispanic or Latino', 1, 0)
    df['declined'] = np.where(df.action_taken == 3, 1, 0)

    # Compute log income and log loan amount
    df['ln_income'] = round(np.log(df.income * 1000),3)
    df['ln_loan_amount'] = round(np.log(df.loan_amount),3)

    # Due to the log function above, we have inf and nan values. So lets drop those.
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df = df.dropna()

    # Make logistic regression
    log_function = "declined~asian+black+hispanic+joint+ln_income+ln_loan_amount+debt_to_income_ratio"
    log_reg = smf.logit(log_function, data=df).fit()

    return log_reg.summary()

In [42]:
# code 06037 is Los Angeles County. Lets look at the data.
log_reg_model('06037')

Optimization terminated successfully.
         Current function value: 0.158816
         Iterations 8
  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)


0,1,2,3
Dep. Variable:,declined,No. Observations:,8099.0
Model:,Logit,Df Residuals:,8091.0
Method:,MLE,Df Model:,7.0
Date:,"Tue, 08 Nov 2022",Pseudo R-squ.:,0.1002
Time:,01:18:07,Log-Likelihood:,-1286.2
converged:,True,LL-Null:,-1429.5
Covariance Type:,nonrobust,LLR p-value:,4.7880000000000005e-58

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-11.5200,2.269,-5.078,0.000,-15.966,-7.074
asian,-0.2567,0.156,-1.648,0.099,-0.562,0.049
black,0.4107,0.266,1.543,0.123,-0.111,0.932
hispanic,0.2897,0.143,2.021,0.043,0.009,0.571
joint,-0.4290,0.131,-3.271,0.001,-0.686,-0.172
ln_income,0.5859,0.105,5.566,0.000,0.380,0.792
ln_loan_amount,-0.2800,0.187,-1.494,0.135,-0.647,0.087
debt_to_income_ratio,0.1291,0.009,14.054,0.000,0.111,0.147


To Interpret the results, all we need to do is take e to the power of the race coefficients. In our results, we can see that Black applicants are 1.5 times more likely to be denied a loan than White applicants. Hispanic applicants are 1.34 times more likely to be denied a loan than White applicants. Asian applicants are 0.77 times likely to be denied a loan than white applicants, in other words, they are 23% less likely to be denied a loan.

Unfortunately our P-values for Asian and Black are larger than 0.05%, therefore the coefficients are not statistically different than 0 (no effect).

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=61770dc9-8282-488c-8a0c-8819ba3c4f95' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>