## Decision Tree - Example
### Problem: Predicting risky bank loans using C5.0 decision trees

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default. We have to train our model and predict such defaulters. 

### Data: 
1. checking_balance        - object
2. months_loan_duration     - int64
3. credit_history          - object
4. purpose                 - object
5. amount                   - int64
6. savings_balance         - object
7. employment_length       - object
8. installment_rate         - int64
9. personal_status         - object
10. other_debtors           - object
11. residence_history        - int64
12. property                - object
13. age                      - int64
14. installment_plan        - object
15. housing                 - object
16. existing_credits         - int64
17. job                     - object
18. dependents               - int64
19. telephone               - object
20. foreign_worker          - object
21. default                  - int64 (Target variable/Label)

a) default = 1 --> Normal Customer

b) default = 2 --> Risky Customer/Probable Deaulter

### 1. Load the necessary packages

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
pd.set_option('display.max_columns',30)

### 2. Exploring the data

In [3]:
credit = pd.read_csv('credit.csv')

In [5]:
credit.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,job,dependents,telephone,foreign_worker,default
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,4,real estate,67,none,own,2,skilled employee,1,yes,yes,1
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,2,real estate,22,none,own,1,skilled employee,1,none,yes,2
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,3,real estate,49,none,own,1,unskilled resident,2,none,yes,1
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,4,building society savings,45,none,for free,1,skilled employee,2,none,yes,1
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,4,unknown/none,53,none,for free,2,skilled employee,2,none,yes,2


In [6]:
credit.dtypes # same as str(credit) in R

checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
job                     object
dependents               int64
telephone               object
foreign_worker          object
default                  int64
dtype: object

In [8]:
credit.describe() # same as summary() in R

Unnamed: 0,months_loan_duration,amount,installment_rate,residence_history,age,existing_credits,dependents,default
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


In [10]:
credit.isnull().sum() #checking NA values

checking_balance        0
months_loan_duration    0
credit_history          0
purpose                 0
amount                  0
savings_balance         0
employment_length       0
installment_rate        0
personal_status         0
other_debtors           0
residence_history       0
property                0
age                     0
installment_plan        0
housing                 0
existing_credits        0
job                     0
dependents              0
telephone               0
foreign_worker          0
default                 0
dtype: int64

In [12]:
credit.checking_balance.value_counts() # same as table(credit$checking_balance) in R

unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64

In [13]:
credit['savings_balance'].value_counts()

< 100 DM         603
unknown          183
101 - 500 DM     103
501 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64

### 3. Data preparation

#### 3.1 Find out the columns which are strings and cateogrical
- Checking unique values in each column to find the categorical columns.
- The description of data tells us which columns are categorical and which are continous.

In [4]:
# Checking unique values in each column, just to find the categorical columns.
# Generally it is given in the description of data which columns are categorical and which are continous.
for i in credit.columns:
    print(i,credit[i].nunique())

checking_balance 4
months_loan_duration 33
credit_history 5
purpose 10
amount 921
savings_balance 5
employment_length 5
installment_rate 4
personal_status 4
other_debtors 3
residence_history 4
property 4
age 53
installment_plan 3
housing 3
existing_credits 4
job 4
dependents 2
telephone 2
foreign_worker 2
default 2


#### 3.2 LabelEncoder is used for converting categorical string columns to numeric.
- Algorithms from sklearn do not accept input columns with string type, convert those columns to numerical. 
 
 - So, we need to convert such columns (e.g. "checking_balance" or "purpose" in this dataset) into numbers.

In [6]:
# Following coloumns are to be converted into srting
categorical_cols = ['checking_balance','credit_history','purpose','savings_balance','employment_length','personal_status','other_debtors','property','installment_plan','housing', 'job', 'telephone', 'foreign_worker']

In [7]:
# LabelEncoder is used for converting categorical string columns to numeric.
# Read more about LabelEncoder in sklearn documentation.

le = LabelEncoder()
for col in categorical_cols:
    # Taking a column from dataframe, encoding it and replacing same column in the dataframe.
    credit[col] = le.fit_transform(credit[col])

In [8]:
credit.head()      # now all the string columns are converted into numbers

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,job,dependents,telephone,foreign_worker,default
0,1,6,0,7,1169,4,3,4,3,2,4,2,67,1,1,2,1,1,1,1,1
1,0,48,4,7,5951,2,1,2,1,2,2,2,22,1,1,1,1,1,0,1,2
2,3,12,0,4,2096,2,2,2,3,2,3,2,49,1,1,1,3,2,0,1,1
3,1,42,4,5,7882,2,2,2,3,1,4,0,45,1,0,1,1,2,0,1,1
4,1,24,1,1,4870,2,1,3,3,2,4,3,53,1,0,2,1,2,0,1,2


#### 3.3 Split the data into train and test

In [10]:
# Total customers/samples - 1000
credit.shape # 1000 samples with 21 attributes

(1000, 21)

In [11]:
# Train Data - Selecting 900 rows at random from the dataframe for training
credit_train = credit.sample(900, random_state = 123)

In [12]:
# Test Data - Taking remaining 100 rows for testing by dropping the rows present in train dataframe from original dataframe.
credit_test = credit.drop(credit_train.index)

In [13]:
# Check whether this appears to be a fairly even split or not,
# train should have about 30 percent of defaulted loans 
# and test data also should have similar % of default loans
(credit.default.value_counts()/credit.default.count())*100

1    70.0
2    30.0
Name: default, dtype: float64

In [27]:
# Train data - Ration of normal and risky customers
(credit_train.default.value_counts()/credit_train.default.count())*100

1    69.777778
2    30.222222
Name: default, dtype: float64

In [14]:
# Test data - Ration of normal and risky customers
(credit_test.default.value_counts()/credit_test.default.count())*100

1    72.0
2    28.0
Name: default, dtype: float64

In [18]:
#taking label in seperate objects
train_labels = credit_train.default
test_labels = credit_test.default

### 4. Training the model (Decison Tree)

In [19]:
# Creating object of the DT with required options 
clf = DecisionTreeClassifier(criterion='entropy')

In [20]:
# Training/Build the model with train data
clf.fit(credit_train.iloc[:,:-1],train_labels)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [21]:
# Make predictions on test data
predictions = clf.predict(credit_test.iloc[:,:-1])

In [22]:
predictions

array([1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2,
       1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1])

### 5. Evaluate the model (DT)

#### Confusion Matrix

In [23]:
confusion_matrix(test_labels,predictions)

array([[54, 18],
       [20,  8]])

#### Simple Accuracy

In [24]:
accuracy_score(test_labels,predictions)*100

62.0

#### Precision, Recall and F1-Score

In [25]:
from sklearn.metrics import precision_score, recall_score, f1_score

In [27]:
# Precision (P)
precision_score(test_labels,predictions)

0.7297297297297297

In [28]:
# Recall (R)
recall_score(test_labels,predictions)

0.75

In [30]:
# F1-Score
f1_score(test_labels,predictions)

0.7397260273972601

In [86]:
# AUC 
from sklearn.metrics import roc_auc_score
roc_auc_score(test_labels,predictions)

0.5178571428571428

### 6. Cross Validation

In [76]:
# simplify names
y_tr = train_labels

In [77]:
# k-fold cross validation
from sklearn.model_selection import cross_val_predict
y_pr = cross_val_predict(clf, credit_train, train_labels, cv=5)

In [72]:
confusion_matrix(y_tr, y_pr)

array([[628,   0],
       [  0, 272]])

In [73]:
# Precision (P)
precision_score(y_tr, y_pr)

1.0

In [74]:
# Recall (R)
recall_score(y_tr, y_pr)

1.0

In [78]:
# F1-Score
f1_score(y_tr, y_pr)

1.0

### 7. Tuning the model

In [87]:
from sklearn.model_selection import GridSearchCV

params = {'max_leaf_nodes': list(range(2, 10)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3)

grid_search_cv.fit(credit_train, train_labels)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.3s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9], 'min_samples_split': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [88]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=2, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

In [89]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(credit_test)
accuracy_score(test_labels, y_pred)

1.0