# Loan Default Prediction
## Step 2: Data Wrangling

## Table of Content
1. Introduction
2. Imports
3. Data Loading and Preview
4. YData Profiling Report
5. Test/Train Split
6. Summary

## 2.1 Introduction
Loans makes for an important source of revnue for retail banks, but it has an associated risk of borrowers defaulting on their loans. Defaulting happens when the borrower stops making required payments on the loan. The purpose of this data science project is to build a predictive model which estimates the probability of default based on customer characteristics.

The dataset we will use to train this model is called Customer Loan Data, which has a column indicating if the customer has previously defaulted. Other columns contain information on the borrower as well as some other metrics. We are primarily interested in generating a PD (probability of default) which would be a floating point value between 0 and 1, given the details of the loan as described. We will also note down any other pattern that emerges.

## 2.2 Imports

In [1]:
import os
import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split

## 2.3 Data Loading and Preview

In [2]:
path = '/Users/hao/loan_default_prediction_repo/data/raw/'
filename = 'Customer Loan Data.csv'
df = pd.read_csv(path+filename)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               10000 non-null  int64  
 1   credit_lines_outstanding  10000 non-null  int64  
 2   loan_amt_outstanding      10000 non-null  float64
 3   total_debt_outstanding    10000 non-null  float64
 4   income                    10000 non-null  float64
 5   years_employed            10000 non-null  int64  
 6   fico_score                10000 non-null  int64  
 7   default                   10000 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 625.1 KB


In [4]:
df.head()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0


We can see that we have 10000 datapoints and no missing value. There are 8 columns, the first being the customer_id which is just indexing and should have no influence on the output; this we will verify through a correlation matrix later. The last column default, which should be a value of either 0 or 1 as they indicate known outcomes (either the customer defaulted or they did not). Here we have done some research into the reaining 6 columns in order to better understand them in later analysis.

1. credit_lines_outstanding: Line of credit is one way the customer could borrow funds. They will receive a credit limit and make regular payments on principal and interest. They have continuous and repeated access to funds.

2. loan_amt_outstanding: Loan is the other way the customer could borrow funds. They will only have access to the fund once and make payments on principal and interest until the loan is paid off.

3. total_debt_outstanding: potentially referring only to the debt through line of credit.

4. income: self-explanatory

5. years_employed: self-explanatory

6. fico_score: FICO scores take into account data in five areas to determine a borrower's credit worthiness: payment history, the current level of indebtedness, types of credit used, length of credit history, and new credit accounts. Scores range from 300 to 850.

## 2.4 YData Profiling Report

In [None]:
profile = ProfileReport(df, title="Profiling Report")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

### Variables
0. customer_id: all unique as expected.

1. credit_lines_outstanding: Integer values, from 0 to 5. Although 41.3% has value zero, it is unclear if any of them indicate unknown values.

2. loan_amt_outstanding: Floating point values, from roughly 47 to 10,751 with a mean of 4160. Bell-shaped. All unique values.

3. total_debt_outstanding: Floating point values, from roughly 32 to 43,689 with a mean of 8719. Negatively skewed, although no negative values. All unique values.

4. income: Floating point values, from roughly 100 to 148,412 with a mean of 70,040. Bell-shaped. Values are mostly distinct except for 6 occurrencies of 1,000, which might be a place-holder value and might call for special considerations in later analysis.

5. years_employed: Interger values, from 0 to 10. Bell-shaped which makes the zeros likely meaningful values.

6. fico_score: Integer values, from 408 to 850, agreeing with range previously stated. 

7. default: 0 or 1 as expected. Roughtly 80:20 not default to default.

### Correlations
1. Variable of interest, default, positively correlated with credit_line_outstanding as well as total_debt_outstanding. Somewhat negatively correlated with fico_score and years_employed.

2. total_debt_outstanding and credit_line_outstanding positively correlated.

3. income and loan_amt_outstanding positively correlated.

## 2.5 Train/Test Split

Given the number of observations in the dataset (n=10,000), and no significant outliers, a reasonable split maybe 80/20. We will perform the split, verify the results, and save the data as csv files for later use.

In [None]:
df_train, df_test = train_test_split(df, train_size=0.8)

In [None]:
profile_train = ProfileReport(df_train, title="Profiling Report on Training Set")
profile_train.to_notebook_iframe()

In [None]:
saving_path = '/Users/hao/loan_default_prediction_repo/data/interim/'
filename1 = 'Training Set.csv'
filename2 = 'Testing Set.csv'

os.makedirs(saving_path, exist_ok=True)  
df_train.to_csv(saving_path+filename1)
df_test.to_csv(saving_path+filename2)

## 2.6 Summary
In this step of the project we loaded and investigated the dataset for missing values and outliers; then we performed a train/test split and saved the datasets as csv files for later use. We also investigated the variables individually as well as some correlations to get a better sense of the data. Overall the dataset is of high quality and now ready for exploratory analysis.