<script>
var css = '.container { width: 100% !important; padding-left: 1em; padding-right: 2em; } div.output_stderr { background: #FFA; }',
    head = document.head || document.getElementsByTagName('head')[0],
    style = document.createElement('style');

style.type = 'text/css';
if (style.styleSheet){
  style.styleSheet.cssText = css;
} else {
  style.appendChild(document.createTextNode(css));
}

head.appendChild(style);
</script>

In [1]:
# %load nbinit.py
from IPython.display import HTML
HTML("<style>.container { width: 100% !important; padding-left: 1em; padding-right: 2em; } div.output_stderr { background: #FFA; }</style>")

# Decision Tree
Let's see how well a decision tree can classify the data. Hereby we need to consider
1. the parameters to the classifier, and
2. the features of the data set that will be used.
We may just explore the impact of the maximum depth of the decision tree. Two of the 16 features ('day' and 'month') may not be useful because they reflect a date, and we're not looking for seasonal effects. So, it's fairly safe to take them out.

Once the dataset is loaded we will convert the categorical data into numeric values.

Finding the right parameters and features for the best performing classifier can be a challenge. The number of possible configurations grows quickly, and knowing how they perform requires training and testing with each of them.

We may also run the training and testing on a configuration multiple times with different random splits of the data set. The performance metrics will be avaraged over the iterations.

We use percision, recall, and the F1 score to evaluate each configuration.


In [29]:
### Load Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree
import pydot_ng as pdot
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
import itertools

## Reading Data

In [2]:
### Read data
DATAFILE = '/home/data/archive.ics.uci.edu/BankMarketing/bank.csv'
df = pd.read_csv(DATAFILE, sep=';')

In [3]:
### use sets and '-' difference operation 'A-B'. Also there is a symmetric different '^'
all_features = set(df.columns)-set(['y'])
num_features = set(df.describe().columns)
cat_features = all_features-num_features
print("All features:         ", ", ".join(all_features), "\nNumerical features:   ", ", ".join(num_features), "\nCategorical features: ", ", ".join(cat_features))

All features:          balance, day, education, previous, loan, contact, pdays, marital, duration, job, campaign, month, poutcome, age, default, housing 
Numerical features:    balance, day, duration, previous, campaign, age, pdays 
Categorical features:  job, education, month, loan, contact, poutcome, default, marital, housing


In [30]:
### convert to categorical variables to numeric ones
level_substitution = {}

def levels2index(levels):
    dct = {}
    for i in range(len(levels)):
        dct[levels[i]] = i
    return dct

df_num = df.copy()

for c in cat_features:
    level_substitution[c] = levels2index(df[c].unique())
    df_num[c].replace(level_substitution[c], inplace=True)

## same for target
df_num.y.replace({'no':0, 'yes':1}, inplace=True)
df_num

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,0,0,0,0,1787,0,0,0,19,0,79,1,-1,0,0,0
1,33,1,0,1,0,4789,1,1,0,11,1,220,1,339,4,1,0
2,35,2,1,2,0,1350,1,0,0,16,2,185,1,330,1,1,0
3,30,2,0,2,0,1476,1,1,1,3,3,199,4,-1,0,0,0
4,59,3,0,1,0,0,1,0,1,5,1,226,1,-1,0,0,0
5,35,2,1,2,0,747,0,0,0,23,4,141,2,176,3,1,0
6,36,4,0,2,0,307,1,0,0,14,1,341,1,330,2,2,0
7,39,5,0,1,0,147,1,0,0,6,1,151,2,-1,0,0,0
8,41,6,0,2,0,221,1,0,1,14,1,57,2,-1,0,0,0
9,43,1,0,0,0,-88,1,1,0,17,2,313,1,147,2,1,0


In [33]:
### create feature matrix and target vector
X = df_num[list(all_features-set(['day', 'month']))].as_matrix()
y = df_num.y.as_matrix()
X, y

(array([[1787,    0,    0, ...,   30,    0,    0],
        [4789,    1,    4, ...,   33,    0,    1],
        [1350,    2,    1, ...,   35,    0,    1],
        ..., 
        [ 295,    1,    0, ...,   57,    0,    0],
        [1137,    1,    3, ...,   28,    0,    0],
        [1136,    2,    7, ...,   44,    0,    1]]),
 array([0, 0, 0, ..., 0, 0, 0]))

## Evaluation
Test how Maximum Depth of tree impacts performance

In [34]:
for d in [3, 5, 7, 11, 13]:
    clf = DecisionTreeClassifier(max_depth=d)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)
    clf.fit(X_train, y_train)
    ŷ = clf.predict(X_test)
    print('Depth %d' % d)
    print(classification_report(y_test, ŷ))

Depth 3
             precision    recall  f1-score   support

          0       0.93      0.97      0.95      1620
          1       0.55      0.35      0.43       189

avg / total       0.89      0.90      0.89      1809

Depth 5
             precision    recall  f1-score   support

          0       0.93      0.96      0.94      1620
          1       0.51      0.34      0.41       189

avg / total       0.88      0.90      0.89      1809

Depth 7
             precision    recall  f1-score   support

          0       0.93      0.96      0.94      1620
          1       0.51      0.34      0.41       189

avg / total       0.88      0.90      0.89      1809

Depth 11
             precision    recall  f1-score   support

          0       0.93      0.94      0.93      1620
          1       0.41      0.38      0.39       189

avg / total       0.87      0.88      0.88      1809

Depth 13
             precision    recall  f1-score   support

          0       0.93      0.92      0.92  

Two methods from `sklearn.metrics` can be helpful:
1. `confusion_matrix` produces a confusion matrix
2. `precision_recall_fscore_support` returns a matrix with values for each of them across all target levels.

In [36]:
cm = confusion_matrix(y_test, ŷ)
cm

array([[1484,  136],
       [ 109,   80]])

In [37]:
prf1s = precision_recall_fscore_support(y_test, ŷ)
prf1s

(array([ 0.93157564,  0.37037037]),
 array([ 0.91604938,  0.42328042]),
 array([ 0.92374728,  0.39506173]),
 array([1620,  189]))

In [10]:
perf = None
for i in range(100):
    if type(perf)!=type(None):
        perf = np.vstack((perf, np.array(prf1s).reshape(1,8)))
    else:
        perf = np.array(prf1s).reshape(1,8)
perf_agg = perf.mean(axis=0)
pd.DataFrame(perf_agg.reshape(1,8), columns=[[b for a in ['Precision', 'Recall', 'F1_score', 'Support'] for b in [a, a]], ['no', 'yes']*4])
##pd.DataFrame([5,5, 'a|b|c'] + list(perf.mean(axis=0)), columns=perf_df.columns)

Unnamed: 0_level_0,Precision,Precision,Recall,Recall,F1_score,F1_score,Support,Support
Unnamed: 0_level_1,no,yes,no,yes,no,yes,no,yes
0,0.933374,0.462428,0.942593,0.42328,0.937961,0.441989,1620.0,189.0


In [14]:
performance_df = pd.DataFrame(columns=[
        ['Params']*3 + [b for a in ['Precision', 'Recall', 'F1_score', 'Support'] for b in [a, a]],
        ['MaxDepth', 'Nfeature', 'Features'] + ['no', 'yes']*4
    ])
tempdf = pd.concat([
        pd.DataFrame({'a': [1], 'b': [2], 'c': ['Hello']}),
        pd.DataFrame(np.zeros((1,8)))
    ], axis=1, ignore_index=True)

tempdf.columns=performance_df.columns
#performance_df
tempdf

Unnamed: 0_level_0,Params,Params,Params,Precision,Precision,Recall,Recall,F1_score,F1_score,Support,Support
Unnamed: 0_level_1,MaxDepth,Nfeature,Features,no,yes,no,yes,no,yes,no,yes
0,1,2,Hello,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [274]:
pd.DataFrame(np.zeros(8).reshape(1,8))

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## The Heavy Lifting
Now, let's run the performance evaluation across a number of configurations. We'll collect the results for each configuration into a dataframe.

In [41]:
# creating a template (i.e. empty table)
performance_template_df = pd.DataFrame(columns= [
        ['Params']*3 + [b for a in ['Precision', 'Recall', 'F1_score', 'Support'] for b in [a, a]],
        ['MaxDepth', 'Nfeature', 'Features'] + ['no', 'yes']*4
    ])
performance_template_df

Unnamed: 0_level_0,Params,Params,Params,Precision,Precision,Recall,Recall,F1_score,F1_score,Support,Support
Unnamed: 0_level_1,MaxDepth,Nfeature,Features,no,yes,no,yes,no,yes,no,yes


The following code implements nested loops for MaxDepth, number and permutation of features. In addition, we have an internal loop to
aggregate the performance metrics over a number of different random splits.

The outer two loops, however, only iterate over one value each. The commmented code shows how they should run...

In [42]:
%%time
performance_df = performance_template_df.copy() #-- always start fresh

for MaxDepth in [5]: ###range(5,9):
    for Nftr in [8]: ###[len(all_features) - k for k in range(len(all_features)-2))]:
        for ftrs in itertools.combinations(all_features-set(['day', 'month']), Nftr):
            X = df_num[list(ftrs)].as_matrix()
            clf = DecisionTreeClassifier(max_depth=MaxDepth)

            perf_arr = None    #-- this array will hold results for different random samples
            for i in range(10): ### running train and test on different random samples
                X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=i)
                clf.fit(X_train, y_train)
                ŷ = clf.predict(X_test)
                #Prec, Recall, F1, Supp 
                prf1s = precision_recall_fscore_support(y_test, ŷ)

                ## 
                if type(perf_arr)!=type(None):
                    perf_arr = np.vstack((perf, np.array(prf1s).reshape(1,8)))
                else:
                    perf_arr = np.array(prf1s).reshape(1,8)
            perf_agg = perf_arr.mean(axis=0)  #-- mean over rows, for each column
            perf_df = pd.concat([    #-- creating a 1 row dataframe is a bit tricky because of the different data types
                        pd.DataFrame({'a': [MaxDepth], 'b': [Nftr], 'c': ['|'.join(list(ftrs))]}),
                        pd.DataFrame(perf_agg.reshape(1, 8))
                    ], axis=1, ignore_index=True)
            perf_df.columns=performance_df.columns
            performance_df = performance_df.append(perf_df, ignore_index=True)

  'precision', 'predicted', average, warn_for)


CPU times: user 2min 15s, sys: 13.7 ms, total: 2min 15s
Wall time: 2min 15s


In [43]:
performance_df

Unnamed: 0_level_0,Params,Params,Params,Precision,Precision,Recall,Recall,F1_score,F1_score,Support,Support
Unnamed: 0_level_1,MaxDepth,Nfeature,Features,no,yes,no,yes,no,yes,no,yes
0,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.933046,0.463507,0.942820,0.422060,0.937899,0.441508,1611.732673,188.316832
1,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932785,0.462509,0.943047,0.419750,0.937859,0.438769,1611.732673,188.316832
2,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932806,0.463350,0.943060,0.419915,0.937876,0.439048,1611.732673,188.316832
3,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932874,0.464323,0.943047,0.420492,0.937909,0.439919,1611.732673,188.316832
4,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932799,0.463790,0.943085,0.419832,0.937883,0.438933,1611.732673,188.316832
5,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932806,0.463350,0.943060,0.419915,0.937876,0.439048,1611.732673,188.316832
6,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.932779,0.463181,0.943085,0.419667,0.937872,0.438655,1611.732673,188.316832
7,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.933073,0.464120,0.942884,0.422225,0.937943,0.441793,1611.732673,188.316832
8,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.933084,0.464179,0.942884,0.422307,0.937949,0.441880,1611.732673,188.316832
9,5.0,8.0,balance|education|previous|loan|contact|pdays|...,0.933220,0.464450,0.942833,0.423380,0.937998,0.442813,1611.732673,188.316832


That took a while (about 2 minutes). Once computations take that long we should look at a different way to implement them ... ** outside the notebook **.

Let's see what the best performing configuration with respect to the F1-score of 'yes' is:

In [61]:
best = performance_df.F1_score.yes.argmax()
print(performance_df.iloc[best])
print("\nFeatures: ", ', '.join([ '"%s"'%f for f in performance_df.iloc[best].Params.Features.split('|') ], ))

Params     MaxDepth                                                    5
           Nfeature                                                    8
           Features    balance|education|previous|loan|contact|pdays|...
Precision  no                                                    0.93322
           yes                                                   0.46445
Recall     no                                                   0.942833
           yes                                                   0.42338
F1_score   no                                                   0.937998
           yes                                                  0.442813
Support    no                                                    1611.73
           yes                                                   188.317
Name: 9, dtype: object

Features:  "balance", "education", "previous", "loan", "contact", "pdays", "duration", "poutcome"
