### Decision Trees Example 3.1
We split the **heart** data set into a training set of 250 items and use the remaining 53 cases as a test set. We predict the outcome of these 53 cases using a decision tree that is grown from the training set and evaluate the prediction result by means of a confusion matrix.


In [4]:
import numpy as np
import pandas as pd
from sklearn import tree

# Load data
df = pd.read_csv('./data/Heart.csv')

# Replace Categorical Variable with dummies
df = pd.get_dummies(data=df, columns=['AHD'], drop_first=True)
df['ChestPain'], ChestPain_codes = pd.factorize(df['ChestPain'])
df['Thal'], Thal_codes = pd.factorize(df['Thal'])
# Drop NA rows:
df.dropna(inplace=True)
df.reset_index(inplace=True) # After removing NA

# Split in test-train
np.random.seed(0)
i = df.index
# Index of train
i_train = np.random.choice(i, replace=False, size=int(250))

# Save DataFrames
df_train = df.iloc[i_train]
df_test = df.drop(i_train)

# Define x and y
y_train = df_train['AHD_Yes']
y_test = df_test['AHD_Yes']
X_train = df_train.drop(columns=['AHD_Yes'])
X_test = df_test.drop(columns=['AHD_Yes'])

# Create and fit Decision tree classifier
clf = tree.DecisionTreeClassifier(criterion='entropy', 
                                  min_samples_split=2, 
                                  min_samples_leaf=1, 
                                  min_impurity_decrease=0.0001)
clf = clf.fit(X_train, y_train)

# Predictions:
y_train_pred = clf.predict(X_train).astype(int)
y_test_pred = clf.predict(X_test).astype(int)

# Create confusion matrix
def confusion(y_true, y_pred):
    conf = pd.DataFrame({'predicted': y_pred, 'true': y_true})
    conf = pd.crosstab(conf.predicted, conf.true, 
                       margins=True, margins_name="Sum")
    return conf

print('Test data:\n', 
      confusion(y_test.T.to_numpy(), y_test_pred))
print('\n\nTrain data:\n', 
      confusion(y_train.T.to_numpy(), y_train_pred))

Test data:
 true        0   1  Sum
predicted             
0          25   6   31
1           5  13   18
Sum        30  19   49


Train data:
 true         0    1  Sum
predicted               
0          131    0  131
1            0  119  119
Sum        131  119  250


In [5]:
# Classification error:
err_test = abs(y_test - y_test_pred).mean()
err_train = abs(y_train - y_train_pred).mean()

print('Classification error on Testdata:\n', np.round(err_test, 3), 
     '\nClassification error on Traindata:\n', np.round(err_train, 3))

Classification error on Testdata:
 0.224 
Classification error on Traindata:
 0.0


The resulting classification error on the test set is $22.4\%$ of the test cases are classified correctly. The confusion matrix shows that the error is similar in both classes and that there is a slight imbalance between the classes. 

Compared to that, the error on the training set is smaller namely $0\%$. This error can be made arbitrarily small by further splitting the regions in the tree until each node only contains one observation. Then the training error will be zero. 

If we allow very small terminal nodes, then the training error can be shrinked to $0$. In this case, however, the tree has a very high variance and the performance on the test set is bad.