# Decision Trees
This notebook is designed to make an initial exploration of decision trees.  It will use the week 3 loan dataset.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
loans = pd.read_csv("./lending-club-data.csv", dtype={'next_pymnt_d':str, 'desc':str})

Format the output column to reflect safe as 1 and a risky loan as -1.

In [31]:
loans['output'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', axis=1)

Split the data in train and test set

In [32]:
train, test = train_test_split(loans, test_size=0.2, random_state=1)

Display the percentage of risky and safe loans.

In [38]:
percent_safe = train[train['output'] == 1].size / (1.0 * train.size) * 100
percent_risky = train[train['output'] == -1].size / (1.0 * train.size) * 100

print('Safe: {0:.2f}%'.format(percent_safe))
print('Risky: {0:.2f}%'.format(percent_risky))

Safe: 81.16%
Risky: 18.84%


Compute classification error for a feature.

In [62]:
def get_feature_error(data, feat, output, Verbose=False):
    '''
    Args:
        feat (str): Catagorical variable
        output (str): associated data should be binary
    '''
    num_error = 0
    
    bins = data[feat].unique()
    for value in bins.tolist():
        data_in_bin = data[data[feat] == value]
        num_pos = data_in_bin[data_in_bin[output] == 1].size
        num_neg = data_in_bin[data_in_bin[output] == -1].size
        if Verbose:
            print('-------------------')
            print('In feature', value)
            print('Safe: {0:.2f}% with {1} total'.format(num_pos/ (1.0 * data_in_bin.size), num_pos))
            print('Risky: {0:.2f}% with {1} total'.format(num_neg / (1.0 * data_in_bin.size), num_neg))
            print('-------------------')
        if num_pos > num_neg:
            num_error += num_neg
        else:
            num_error += num_pos
            
    return num_error / (1.0 * data.size)

In [64]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
for feature in features:
    print(get_feature_error(train, feature, 'output', Verbose=True))

-------------------
('In feature', 'B')
Safe: 0.85% with 1726044 total
Risky: 0.15% with 295528 total
-------------------
-------------------
('In feature', 'C')
Safe: 0.79% with 1296692 total
Risky: 0.21% with 335444 total
-------------------
-------------------
('In feature', 'A')
Safe: 0.93% with 1130908 total
Risky: 0.07% with 85340 total
-------------------
-------------------
('In feature', 'D')
Safe: 0.73% with 761124 total
Risky: 0.27% with 278664 total
-------------------
-------------------
('In feature', 'E')
Safe: 0.68% with 334288 total
Risky: 0.32% with 155856 total
-------------------
-------------------
('In feature', 'F')
Safe: 0.61% with 128928 total
Risky: 0.39% with 82484 total
-------------------
-------------------
('In feature', 'G')
Safe: 0.60% with 35088 total
Risky: 0.40% with 23392 total
-------------------
0.188418208697
-------------------
('In feature', ' 36 months')
Safe: 0.84% with 4469844 total
Risky: 0.16% with 850680 total
-------------------
--------

In [59]:
temp = 0
for value in train['grade'].unique().tolist():
    print('-------------------')
    temp_data = train[train['grade']==value].size)
    print('-------------------')

-------------------
2021572
-------------------
-------------------
1632136
-------------------
-------------------
1216248
-------------------
-------------------
1039788
-------------------
-------------------
490144
-------------------
-------------------
211412
-------------------
-------------------
58480
-------------------


We have an issue with unbalanced data. A simple approach would be to throw out some data with safe loans as to even it out. Explore there methods.