# Probability of Loan Default Model
This project uses data from the Small Business Association (SBA) to build a decision tree with the RandomForestRegressor package from sklearn that predicts the probability of default. This project was created for a course in programming for finance.

_Author: Jordan Saethre_ 

_Date: December 2018_

### Set Up and Data Import

In [28]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
import pydot

Pull in the csv file

In [29]:
features = pd.read_csv('SBA_data.csv')

### Data Cleaning

Drop all columns except State, Zip, ApprovalFY, Bank, BankState,Term,and MIS_Status

In [30]:
features = features.drop(columns = ['LoanNr_ChkDgt',
                                      'Name',
                                      'NAICS','City',
                                      'Term',
                                      'ApprovalDate',
                                      'NoEmp', 
                                      'NewExist',
                                      'CreateJob',
                                      'RetainedJob',
                                      'FranchiseCode',
                                      'FranchiseCode',
                                      'RevLineCr',
                                      'LowDoc',
                                      'ChgOffDate',
                                      'DisbursementDate',
                                      'DisbursementGross',
                                      'BalanceGross',
                                      'ChgOffPrinGr',
                                      'GrAppv',
                                      'SBA_Appv'], 
                           axis = 1)

Turn the new dataset into a one hot vector

In [31]:
features = pd.get_dummies(features)

Create an array of the labels that fall in the column MIS_Status_CHGOFF. 0 indicates PIF and 1 indicates CHGOFF

In [32]:
labels = np.array(features['MIS_Status_CHGOFF'])

Drop the MIS_Status_P I F and MIS_Status_CHGOFF since this is our variable of interest

In [33]:
features = features.drop(['MIS_Status_P I F','MIS_Status_CHGOFF'], axis = 1)

Create a list of all the column names in our modified data set

In [34]:
features_list = list(features.columns)

Create an array of our modified data set

In [15]:
features = np.array(features)

Partition the data sets features and labels into training and testing sets

### Splitting into training and testing sets

In [35]:
train_features = features[0:10000]
test_features = features[10000:]
train_labels = labels[:10000]
test_labels = labels[10000:]

### Build the Model

Rename RandomForestRegressor() as rf. Indicate that it should create 1000 trees, with a maximum depth of 5, and should have at least 100 samples in order to split

In [36]:
rf = RandomForestRegressor(n_estimators = 1000, max_depth = 5, min_samples_split = 100)

Fit rf to the training set train_features and classify into the labels found in train_labels

In [37]:
rf.fit(train_features, train_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=100,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

### Test the Model

Predict the label of new instances of data, i.e. classify the records found in the testing set into either 0 for PIF or 1 for CHGOFF

In [38]:
predictions = rf.predict(test_features)

Print the predictions for each instance in the testing set

In [22]:
print(predictions)

[0.30347579 0.18842729 0.15154328 0.19546049 0.28825888 0.05690712
 0.34248153 0.19522968 0.25287344 0.05690712 0.28232577 0.17581053
 0.28232577 0.25198333 0.05690712 0.18577708 0.15293922 0.05788314
 0.28232577 0.05690712 0.15154328 0.05690712 0.2850343  0.05723675
 0.18549324 0.18557695 0.05690712 0.31664567 0.1537806  0.25809881
 0.15154328 0.05690712 0.28743883 0.05792896 0.32304655 0.18807935
 0.05690712 0.18842729 0.05792896 0.15154328 0.05690712 0.05690712
 0.16081381 0.05690712 0.17591212 0.16081381 0.27699666 0.05690712
 0.32304655 0.15154328 0.19546049 0.05690712 0.18549324 0.05690712
 0.28520423 0.18964812 0.29231764 0.25198333 0.05690712 0.28303311
 0.18842729 0.05690712 0.29876011 0.05690712 0.15154328 0.28234809
 0.28592566 0.34266925 0.05690712 0.18557695 0.05690712 0.20796571
 0.05690712 0.15154328 0.05690712 0.26960747 0.05770629 0.25655413
 0.27249802 0.05832087 0.19546049 0.90978444 0.05832087 0.18574291
 0.25198333 0.28232577 0.18577708 0.28534109 0.28286797 0.1850

### Display the Decision Tree

In [None]:
tree = rf.estimators_[5]

Create a png file of the decision tree

In [42]:
export_graphviz(tree, out_file = 'tree.dot', feature_names = features_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')

![Probability of Default Tree](tree.png)