## Agenda

- Estimate a simple classification tree *without* penalisation
- Estimate a simple classification tree *without* cross-validation 

### Upload the data and create the variables of interest

Import the original mortgage data which are contained in the "MortgageData.csv" file. The data are imported by using the **pandas liibrary** via the usual command *import pandas as pd* 

In [3]:
import pandas as pd

df = pd.read_csv(r'Default.csv', index_col =0)
df.describe()

Unnamed: 0,balance,income
count,10000.0,10000.0
mean,835.374886,33516.981876
std,483.714985,13336.639563
min,0.0,771.967729
25%,481.731105,21340.462905
50%,823.636973,34552.6448
75%,1166.308387,43807.729275
max,2654.322576,73554.2335


The variable we want to predict here is the **default**. In particular, we want to predict how many times a credit card loan is defaulted based on a large set of individuals in the US. The predictors are the **balance** of account and the **income**. We need first to transform the default variabe from categorical to numerical. 

This can be done in a variety of alternative ways. The one we use here is simply set of a dummy equal to one for when default = YES and zero otherwise. 

In [4]:
import numpy as np

df['default_dummy'] = np.where(df['default'].str.contains('Yes'), 1, 0)

We see that we created a set of dummy variables which coincides with the different categories of the loans. We are interested in the "default_Yes" category, so this dummy variable is going to be our target object we need to predict. 

As predictors we use a set of mortgage characteristics which are in the dataset, such as:

- Balance of account (balance)
- Household income (income)

Since we have been working with a classification tree, there is NOT need to standardize the variables. 

In [5]:
X = df[['balance','income']]
y = df['default_dummy']

Since we wish to predict the default event for a loan out of sample, we need to split the sample of observations between those that will be used for training/validation and those which will be used for testing. This is done using the command **train_test_split**. Let consider a 50% of the observations for the training period and 50% for the testing period. 

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42, stratify=y)

Notice that since the variable of interest is highly imbalanced, meaning there is only a small fraction of defaulted loans compared to the total amount of observations, we need to "stratify" the splitting, meaning we need to split between training/validation and testing in a way that keeps the proportion of default vs non-default comparable between the sample that is used for training and validation and the sample that is used for testing. 

## Estimate a classification tree 

### Without cross-validation 

We first estimate a classification tree without cross validation of the model features, for instance, we assume the maximum number of layers (**max_depth**) is equal to 6.

In [9]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth = 6)
dt.fit(X_train, y_train)

y_dt_pred = dt.predict(X_test)

from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_dt_pred))

[[4800   34]
 [ 124   42]]


Next week, we will learn how to compare different estimates. For now, we can just focus on the model estimation. 

### With cross-validation 

Now, we are going to cross validate two different features of the classification tree. The maximum number of layers (**max_depth**) and the minimum number of samples required to split an internal node (**min_samples_leaf**)

In [10]:
dtree = DecisionTreeClassifier()

# Define params_dt
params_dt = {'max_depth': [2,3,4,6],
             'min_samples_leaf': [1, 1.5, 2]}

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
dt_cv = GridSearchCV(estimator=dtree,
                       param_grid=params_dt,
                       scoring='accuracy',
                       cv=5,#cv: cross validation folds, i.e. dividing into how many folds
                       verbose=1,
                       n_jobs=-1)

dt_cv.fit(X_train, y_train)

y_dt_cv_pred = dt_cv.predict(X_test)

print(metrics.confusion_matrix(y_test, y_dt_cv_pred))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[[4807   27]
 [ 113   53]]


    nan 0.97  ]
