# Decision Tree Lab

### Part 1: Load  data

Import "bank-data.csv"

In [1]:
import pandas as pd
bankData = pd.read_csv('bank-data.csv', sep = ';')
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### Part 2: Preprocess data

Preprocess the dataset as you have done before

In [2]:
bankData.shape

(4521, 17)

#### 2.1 Binary encoding

Use LabelEncoder to encode the following columns:
- y
- default
- housing
- loan

In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#example
bankData['y'] = le.fit_transform(bankData['y'])
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,0


In [4]:
#Encode the remaining columns
bankData['default'] = le.fit_transform(bankData['default'])
bankData['housing'] = le.fit_transform(bankData['housing'])
bankData['loan'] = le.fit_transform(bankData['loan'])

bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,0,1787,0,0,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,0,4789,1,1,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,0,1350,1,0,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,0,1476,1,1,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,0,0,1,0,unknown,5,may,226,1,-1,0,unknown,0


#### 2.2 Convert categorical variables into dummy columns

(1) Use pd.get_dummies to convert the following categorical variales into dummy columns
- job
- maritial
- education
- contact
- month
- poutcome

(2) Drop columns that have been converted

In [5]:
#example
bankData = pd.concat([bankData,pd.get_dummies(bankData['job'],prefix='job')],axis=1)
bankData = bankData.drop(columns=['job'])
bankData.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,day,month,...,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown
0,30,married,primary,0,1787,0,0,cellular,19,oct,...,0,0,0,0,0,0,0,0,1,0
1,33,married,secondary,0,4789,1,1,cellular,11,may,...,0,0,0,0,0,1,0,0,0,0
2,35,single,tertiary,0,1350,1,0,cellular,16,apr,...,0,0,1,0,0,0,0,0,0,0
3,30,married,tertiary,0,1476,1,1,unknown,3,jun,...,0,0,1,0,0,0,0,0,0,0
4,59,married,secondary,0,0,1,0,unknown,5,may,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#convert the remaining categorical variables
bankData = pd.concat([bankData,pd.get_dummies(bankData['marital'],prefix='marital')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['education'],prefix='education')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['contact'],prefix='contact')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['month'],prefix='month')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['poutcome'],prefix='poutcome')],axis=1)

bankData = bankData.drop(['marital', 'education', 'contact', 'month', 'poutcome'], axis=1)

bankData.head()

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,0,1787,0,0,19,79,1,-1,0,...,0,0,0,0,1,0,0,0,0,1
1,33,0,4789,1,1,11,220,1,339,4,...,0,0,1,0,0,0,1,0,0,0
2,35,0,1350,1,0,16,185,1,330,1,...,0,0,0,0,0,0,1,0,0,0
3,30,0,1476,1,1,3,199,4,-1,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,0,1,0,5,226,1,-1,0,...,0,0,1,0,0,0,0,0,0,1


In [7]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 49 columns):
age                    4521 non-null int64
default                4521 non-null int64
balance                4521 non-null int64
housing                4521 non-null int64
loan                   4521 non-null int64
day                    4521 non-null int64
duration               4521 non-null int64
campaign               4521 non-null int64
pdays                  4521 non-null int64
previous               4521 non-null int64
y                      4521 non-null int64
job_admin.             4521 non-null uint8
job_blue-collar        4521 non-null uint8
job_entrepreneur       4521 non-null uint8
job_housemaid          4521 non-null uint8
job_management         4521 non-null uint8
job_retired            4521 non-null uint8
job_self-employed      4521 non-null uint8
job_services           4521 non-null uint8
job_student            4521 non-null uint8
job_technician         4521 non-n

#### 2.3 Train/Test separation

Perform hold-out method
- 60% training set
- 40% testing set

In [8]:
bankData_train = bankData.sample(frac = 0.6)
bankData_test = bankData.drop(bankData_train.index)
print(pd.crosstab(bankData_train['y'],columns = 'count'))
print(pd.crosstab(bankData_test['y'],columns = 'count'))

col_0  count
y           
0       2391
1        322
col_0  count
y           
0       1609
1        199


##### X/y separation

In [9]:
bankData_train_y = bankData_train['y']
bankData_train_X = bankData_train.copy()
del bankData_train_X['y']

bankData_test_y = bankData_test['y']
bankData_test_X = bankData_test.copy()
del bankData_test_X['y']

### Part 3: Train a decision tree model

In [10]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=30, max_depth=5)
clf = clf.fit(bankData_train_X, bankData_train_y)
print(clf)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=30, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')


##### Tree Visualization

You MUST first install 'graphviz' in order to run the following code.

In [11]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                              feature_names=bankData_train_X.columns,
                              class_names=['0','1'],
                              filled=True, rounded=True,
                              special_characters=True, rotate=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render')

graph.view()

'dtree_render.pdf'

##### Variable importance

In [12]:
tree_feature = pd.DataFrame({'feature':bankData_train_X.columns,
                             'Score':clf.feature_importances_})

tree_feature.sort_values(by = 'Score', ascending=False)

Unnamed: 0,feature,Score
6,duration,0.587739
46,poutcome_success,0.279582
31,contact_unknown,0.047861
23,marital_married,0.034632
2,balance,0.019944
3,housing,0.019505
0,age,0.010737
36,month_jan,0.0
28,education_unknown,0.0
29,contact_cellular,0.0


##### Prediction

In [13]:
clf.predict(bankData_test_X)

array([0, 0, 0, ..., 0, 0, 0])

In [14]:
clf.predict_proba(bankData_test_X)

array([[0.98074278, 0.01925722],
       [0.86609687, 0.13390313],
       [0.98074278, 0.01925722],
       ...,
       [0.75      , 0.25      ],
       [0.98074278, 0.01925722],
       [0.98074278, 0.01925722]])

### Part 4: Model Evaluation

Evaluation metrics
- confusion metrix
- accuracy
- precision, recall, f1-score

In [15]:
#confusion metrix
res = clf.predict(bankData_test_X)
pd.crosstab(bankData_test_y, res)

col_0,0,1
y,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1544,65
1,127,72


In [34]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(bankData_test_y, res))
print(classification_report(bankData_test_y, res))

Accuracy:	 0.894
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1609
           1       0.53      0.36      0.43       199

    accuracy                           0.89      1808
   macro avg       0.72      0.66      0.69      1808
weighted avg       0.88      0.89      0.89      1808



### Part 5: Model tuning

#### Note:

After building the decision tree classifier, try answering the following questions.

1. What is the Accuracy Score?
2. If you change your preprosessing method, can you improve the model?
3. If you change your parameters setting, can you improve the model?

##### Pruning Parameters
- max_leaf_nodes
    - Reduce the number of leaf nodes
- min_samples_leaf
    - Restrict the size of sample leaf
    - Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- max_depth
    - Reduce the depth of the tree to build a generalized tree
    - Set the depth of the tree to 3, 5, 10 depending after verification on test data

<b>Description:</b> cells below are coded for tuning the parameters of model and for test the effect of preprocessing method

In [37]:
# preprocessing data with 70% train 30% test data
train_data = bankData.sample(frac=0.7)
test_data = bankData.drop(train_data.index)

# separate x and y
train_data_y = train_data['y']
train_data_X = train_data.copy()
del train_data_X['y']

test_data_y = test_data['y']
test_data_X = test_data.copy()
del test_data_X['y']

# train model 
# model parameters are same as model clf
clf2 = tree.DecisionTreeClassifier(min_samples_leaf=30, max_depth=5)
clf2.fit(train_data_X, train_data_y)
print(clf2)

# predict
clf2_pred = clf2.predict(test_data_X)

# evaulate model
print("Accuracy:\t %.3f" %accuracy_score(test_data_y, clf2_pred))
print(classification_report(test_data_y, clf2_pred))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=30, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Accuracy:	 0.907
              precision    recall  f1-score   support

           0       0.92      0.98      0.95      1215
           1       0.60      0.31      0.41       141

    accuracy                           0.91      1356
   macro avg       0.76      0.64      0.68      1356
weighted avg       0.89      0.91      0.89      1356



In [32]:
# train model
clf3 = tree.DecisionTreeClassifier(min_samples_leaf=20, max_depth=10, max_leaf_nodes=70)
clf3 = clf.fit(bankData_train_X, bankData_train_y)
print(clf3)

# predict
clf3_pred = clf3.predict(bankData_test_X)

# evaluate model
print("Accuracy:\t %.3f" %accuracy_score(bankData_test_y, clf3_pred))
print(classification_report(bankData_test_y, clf3_pred))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                       max_features=None, max_leaf_nodes=70,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=20, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Accuracy:	 0.895
              precision    recall  f1-score   support

           0       0.92      0.97      0.94      1609
           1       0.55      0.28      0.37       199

    accuracy                           0.90      1808
   macro avg       0.73      0.62      0.66      1808
weighted avg       0.88      0.90      0.88      1808



<b>Q1:</b> What is the Accuracy Score?  <br>
<b>A1:</b> Accuracy score is the score used to indicate the quality of model by focusing on the quantity of the correct predictions out of all predictions. For 3 above models, clf, clf2, clf3 have 0.894, 0.907 and 0.895 accuracy respectively.

<b>Q2:</b> If you change your preprosessing method, can you improve the model? <br>
<b>A2:</b> Yes. As the comparison of above evaluation metrices, they indicate that model clf2 with 70% train data, 30% test data has more accuracy than model clf which has 60% train data and 40% test data.

<b>Q3:</b> If you change your parameters setting, can you improve the model? <br>
<b>A3:</b> Yes. As there is no magic set of parameters that can suit for all type of data (which is used to construct the model). Therefore, chosing the proper set of parameters could affect more model reliability than using a default set of parameters. For example, compare model clf and clf3 with the same training set, the result shows that clf3 (with parameter tuning) has more accuracy than clf (with no parameter tuning).