# Using a decision tree to predict financial distress
### Machine Learning for Public Policy - HW #2
### Cecile Murray


Many people rely on their ability to borrow money to help cushion the impact of financial shocks such as a medical emergency or a job loss. Sometimes, however, borrowers cannot pay back what they owe, and can be sucked into a downward spiral of greater debt and declining credit-worthiness. In a policy context, the ability to identify individuals at high risk of serious delinquency could allow for interventions that would help individuals get back on a firm financial footing before their credit suffers serious damage. 

In this assignment, I explore a Kaggle dataset and use to predict who will experience a serious delinquency.

### Load  data

In [1]:
# import modules
import numpy as np
import pandas as pd 
import seaborn as sns
import plotnine as p9
import matplotlib.pyplot as plt
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as accuracy
import graphviz 

# bring in pipeline library
import pipeline as pipe
import utils
import exploration as exp

In [2]:
# read in data and look at first few rows
credit_raw = utils.read_data("credit-data", file_type = 'csv')

# get count of null values for all columns
credit_raw.isnull().sum()

# replace missing values with the median
credit = utils.replace_missing(credit_raw, 'MonthlyIncome', 'NumberOfDependents', method = 'median')

In [3]:
# create discrete buckets for variables
credit = utils.bin_continuous(credit, 'age', 'age_bracket',
                         breaks = list(range(0, 100, 20)),
                         labels = ['under20', '20-40', '40-60', '60-80', '80+'])
credit = utils.bin_continuous(credit, 'MonthlyIncome', 'income_cat',
                             breaks = [-1, 1000, 2500, 5000, 10000],
                             labels = ['low', 'modest', 'medium', 'high', 'highest'])
credit = utils.bin_continuous(credit, 'RevolvingUtilizationOfUnsecuredLines', 'utilization',
                             breaks = [0, 0.5, 1, 2], 
                             labels = ['under_half', 'over_half', 'over_one', 'extreme'])

In [4]:
# make categorical variables into binaries
credit = utils.make_cat_dummy(credit, ['age_bracket', 'income_cat', 'utilization'])

In [5]:
# create training and testing sets
feature_list = ['NumberOfTime30-59DaysPastDueNotWorse',
                'DebtRatio',
                'age_bracket_under20',
                'age_bracket_20-40',
                'age_bracket_40-60',
                'age_bracket_60-80',
                'age_bracket_80+',
                'age_bracket_nan',
                'income_cat_low',
                'income_cat_modest',
                'income_cat_medium',
                'income_cat_high',
                'income_cat_highest',
                'utilization_under_half',
                'utilization_over_half',
                'utilization_over_one',
                'utilization_extreme',
                'NumberOfOpenCreditLinesAndLoans',
                'NumberRealEstateLoansOrLines',
                'NumberOfDependents']

x_train, x_test, y_train, y_test = pipe.create_train_test_sets(credit,
                                                               'SeriousDlqin2yrs',
                                                               feature_list,
                                                               size = 0.25)

In [8]:
def build_classifier(classifier_type, x_train, y_train, **params):
    ''' Takes specified type of classifier using training set and optional keyword arguments
        Returns the trained classifier object
    '''

    if classifier_type == 'DecisionTree':
        return DecisionTreeClassifier(params).fit(x_train, y_train)

    elif classifier_type == "LogisticRegression":
        return LogisticRegression(params).fit(x_train, y_train)
    
    elif classifier_type == "KNN":
        return KNeighborsClassifier(params).fit(x_train, y_train)
    
    elif classifier_type == "SVM":
        return LinearSVC(params).fit(x_train, y_train)

    else:
        print("Classifier not supported.")
        return 


In [12]:
tree = pipe.build_classifier("DecisionTree", x_train, y_train)
pipe.compute_eval_stats(tree, x_test, y_test, 0.3)

[0.8326506729081334,
 0.4740947075208914,
 0.5243376463339495,
 0.49795201872440026,
 0.7074822283343946]

In [None]:
# test different tree depths
depths = [1, 3, 5, 6, 7, 10, 15, 25]
criteria = ['gini', 'entropy']
params = [(c, d) for c in criteria for d in depths]
# params
pipe.build_classifier('LogisticRegression', x_train, y_train, {})

Of these trees, the one with the greatest accuracy on the test dataset is the entropy tree with a maxmimum depth of three. I plot the predicted probabilities from this tree below.

In [None]:
dec_tree = DecisionTreeClassifier(max_depth=3, criterion='entropy').fit(x_train, y_train)
pipe.plot_prediction_distribution(dec_tree, x_test)

### Plot the decision tree

By plotting the decision tree, we can understand how it classifies observations. Looking at the plot below, existing measures of credit health - namely utilization of existing revolving credit lines and the number of times an individual had a short-term delinquency - largely drive the classification. This result is not surprising, but it also may not add much knowledge from the perspective of someone who wishes to more carefully target individuals at risk of serious delinquency. 

In [None]:
pipe.make_tree_chart(dec_tree, x_train.columns, ['NoDelinquency', 'Delinquency'], out_file = 'tree_entropy_d3.dot')