As a starting point, we have to read in all dataframes that have the word 'acquisition' and merge them into one dataframe.  Each of these CSV files contains info about a different bank.

In [2]:
import pandas as pd
import glob

acquisition_files = glob.glob("*acquisitions.csv")

acquisitions_df = [pd.read_csv(filename) for filename in acquisition_files]
acquisitions_df = pd.concat(acquisitions_df, axis=0, ignore_index=True)
acquisitions_df

Unnamed: 0,loan_id,orig_channel,seller_name,orig_int_rate,original_upb,original_loan_term,orig_date,first_pymt_date,orig_ltv,orig_cltv,...,number_units,occ_type,prop_state,zip_code,primary_insurance_pct,product_type,coborrower_credit_score,mortgage_insurance_type,relo_mortgage_indicator,current_delq_status
0,1.001134e+11,B,"FLAGSTAR BANK, FSB",6.375,73000.0,360.0,01/2005,03/2005,80.0,80.0,...,1.0,P,MI,492.0,,FRM,,,N,1.0
1,1.001134e+11,B,"FLAGSTAR BANK, FSB",5.750,70000.0,360.0,02/2005,04/2005,72.0,72.0,...,1.0,P,MO,656.0,,FRM,761.0,,N,1.0
2,1.001673e+11,C,"FLAGSTAR BANK, FSB",6.125,180000.0,240.0,01/2005,03/2005,58.0,58.0,...,1.0,P,CA,934.0,,FRM,,,N,1.0
3,1.002136e+11,B,"FLAGSTAR BANK, FSB",5.375,122000.0,180.0,12/2004,02/2005,85.0,85.0,...,1.0,P,MI,481.0,6.0,FRM,675.0,1.0,N,1.0
4,1.002599e+11,C,"FLAGSTAR BANK, FSB",5.375,206000.0,180.0,02/2005,04/2005,53.0,53.0,...,3.0,P,CA,900.0,,FRM,,,N,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231492,9.998117e+11,C,"CITIMORTGAGE, INC.",4.875,70000.0,180.0,01/2005,03/2005,25.0,25.0,...,1.0,P,PA,156.0,,FRM,789.0,,N,1.0
231493,9.998226e+11,C,"CITIMORTGAGE, INC.",6.000,75000.0,360.0,12/2004,02/2005,75.0,75.0,...,1.0,P,WV,249.0,,FRM,,,N,1.0
231494,9.998283e+11,C,"CITIMORTGAGE, INC.",5.500,204000.0,360.0,01/2005,03/2005,65.0,65.0,...,1.0,P,NJ,88.0,,FRM,786.0,,N,1.0
231495,9.998401e+11,R,"CITIMORTGAGE, INC.",5.875,315000.0,360.0,12/2004,02/2005,62.0,62.0,...,1.0,P,CA,956.0,,FRM,,,N,1.0


Next, we'll read in the data for foreclosed loans and merge them with acquisitions.  The merge will have to be a left join, since not all acquisitions were foreclosed.

In [3]:
foreclosed_df = pd.read_csv("foreclosed_loans.csv")
all_loans = acquisitions_df.merge(foreclosed_df, how="left")

all_loans

Unnamed: 0,loan_id,orig_channel,seller_name,orig_int_rate,original_upb,original_loan_term,orig_date,first_pymt_date,orig_ltv,orig_cltv,...,prop_state,zip_code,primary_insurance_pct,product_type,coborrower_credit_score,mortgage_insurance_type,relo_mortgage_indicator,current_delq_status,foreclosure_date,foreclosure_flag
0,1.001134e+11,B,"FLAGSTAR BANK, FSB",6.375,73000.0,360.0,01/2005,03/2005,80.0,80.0,...,MI,492.0,,FRM,,,N,1.0,,
1,1.001134e+11,B,"FLAGSTAR BANK, FSB",5.750,70000.0,360.0,02/2005,04/2005,72.0,72.0,...,MO,656.0,,FRM,761.0,,N,1.0,,
2,1.001673e+11,C,"FLAGSTAR BANK, FSB",6.125,180000.0,240.0,01/2005,03/2005,58.0,58.0,...,CA,934.0,,FRM,,,N,1.0,,
3,1.002136e+11,B,"FLAGSTAR BANK, FSB",5.375,122000.0,180.0,12/2004,02/2005,85.0,85.0,...,MI,481.0,6.0,FRM,675.0,1.0,N,1.0,,
4,1.002599e+11,C,"FLAGSTAR BANK, FSB",5.375,206000.0,180.0,02/2005,04/2005,53.0,53.0,...,CA,900.0,,FRM,,,N,1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231492,9.998117e+11,C,"CITIMORTGAGE, INC.",4.875,70000.0,180.0,01/2005,03/2005,25.0,25.0,...,PA,156.0,,FRM,789.0,,N,1.0,,
231493,9.998226e+11,C,"CITIMORTGAGE, INC.",6.000,75000.0,360.0,12/2004,02/2005,75.0,75.0,...,WV,249.0,,FRM,,,N,1.0,,
231494,9.998283e+11,C,"CITIMORTGAGE, INC.",5.500,204000.0,360.0,01/2005,03/2005,65.0,65.0,...,NJ,88.0,,FRM,786.0,,N,1.0,,
231495,9.998401e+11,R,"CITIMORTGAGE, INC.",5.875,315000.0,360.0,12/2004,02/2005,62.0,62.0,...,CA,956.0,,FRM,,,N,1.0,,


What we want to explore is which factors contribute the most to foreclosure.  We want to know what patterns exist that increase the likelihood that a loan will be foreclosed.  In other words, we want to hone in on the most important **predictor variables**.

As a starting point, let's look at 3 factors that banks usually look at to determine whether a borrower is credit worthy:

### 1. Credit score
From Investopedia:

> A credit score is a number ranging from 300-850 that depicts a consumer's creditworthiness. The higher the credit score, the more attractive the borrower.


### 2. Loan-to-value ratio (LTV)
This represents the total \\$ value of the mortgage divided by the \\$ appraised value of the house. For example, if you buy a house for $100,000 and put \\$20k down, you will need a mortgage of \\$80k. The LTV of this purchase would be 80\% (80k/100k).

If the LTV is higher, that means that you have less equity in the house. Higher LTVs are
traditionally considered to be riskier.


### 3. Debt-to-Income ratio (DTI)
DTI stands for Debt to Income. This represents total debt payments the borrower needs to
make in a month divided by their monthly gross income (including the mortgage payment
itself).

For Example – If you have a \\$500 Car Payment and a \\$2000 Mortgage and gross income of
\\$5000 a month, your DTI would be 50% (\\$2500 ÷ \\$5000). Typically, a higher DTI is riskier.