#HW3: Decision Trees

First, we'll import all the packages and data we'll need.

In [1]:
# Import packages and modules
%matplotlib inline
import pandas as pd
import numpy as np
from seaborn import plt
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor

# Read both datasets into pandas DataFrames
df = pd.read_csv('bank.txt', sep=';')
df_full = pd.read_csv('bank-additional-full.txt', sep=';')

###Q1: Use the file <code>bank.csv</code> to explore the dataset.

Observe the features:
<ul>
<li>Are they numbers?</li>
<li>Are they strings?</li>
<li>Are they binary?</li>
<li>Are they continuous?</li>
</ul>

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age          4521 non-null int64
job          4521 non-null object
marital      4521 non-null object
education    4521 non-null object
default      4521 non-null object
balance      4521 non-null int64
housing      4521 non-null object
loan         4521 non-null object
contact      4521 non-null object
day          4521 non-null int64
month        4521 non-null object
duration     4521 non-null int64
campaign     4521 non-null int64
pdays        4521 non-null int64
previous     4521 non-null int64
poutcome     4521 non-null object
y            4521 non-null object
dtypes: int64(7), object(10)
memory usage: 635.8+ KB


In [3]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


The following fields are categorical:
* job (string)
* marital (string, binary)
* education (string)
* default (string, binary)
* housing (int64, binary)
* loan (string, binary)
* month (string)
* campaign (int64)
* previous (int64)
* poutcome (string, binary)
* y (string, binary)

These fields are continuous:
* age (int64)
* balance (int64)
* day (int64)
* duration (int64)
* pdays (int64)

###Q2: Learn about label encoders at the following link and use what you learn to transform the features to numerical features.

[OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

First, we'll import two additional modules that we'll use to preprocess the data: <code>OneHotEncoder</code> and <code>LabelEncoder</code>.

In [4]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

Next, we'll map binary numerical values for the <code>y</code> label to make modelling easier.

In [13]:
# We'll create a test DataFrame to practice feature engineering
df_test = df.copy(deep=True)

# Change categorial features from string values to numerical values, starting with the label ('y')
df_test['y'] = df_test.y.map({'no':0,'yes':1})

# Choose meaningful string type categorical features to convert to numerical values, then use for-loop combined with
# LabelEncoder to change values
# columns: 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome'
# TODO(justindelatorre): Leave out 'contact' feature because it doesn't seem 
feature_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']

le = LabelEncoder()
for col in feature_cols:
    df_test[col] = le.fit_transform(df_test[col])
    
print df_test.head(20)

    age  job  marital  education  default  balance  housing  loan  contact  \
0    30   10        1          0        0     1787        0     0        0   
1    33    7        1          1        0     4789        1     1        0   
2    35    4        2          2        0     1350        1     0        0   
3    30    4        1          2        0     1476        1     1        2   
4    59    1        1          1        0        0        1     0        2   
5    35    4        2          2        0      747        0     0        0   
6    36    6        1          2        0      307        1     0        0   
7    39    9        1          1        0      147        1     0        0   
8    41    2        1          2        0      221        1     0        2   
9    43    7        1          0        0      -88        1     1        0   
10   39    7        1          1        0     9374        1     0        2   
11   43    0        1          1        0      264        1     

###Q3: Build a decision tree model to predict whether a prospect will buy the product.

We'll use the categorical features we just converted as the features we'll use to predict outcomes.

In [14]:
# TODO(justindelatorre): Add 'previous' as a potentially meaningful feature column
X = df_test[feature_cols]
y = df_test.y

In [16]:
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=len(feature_cols), random_state=1)
treeclf.fit(X, y)

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1, splitter='best')

###Q4: Evaluate the accuracy of your decision tree model using cross validation.

In [18]:
scores = cross_val_score(treeclf, X, y, cv=10, scoring='accuracy')
print 'Decision tree accuracy: {}'.format(np.mean(scores))

Decision tree accuracy: 0.884758444197


###Q5: Repeat the analysis and cross validation with the file <code>bank-additional-full.csv</code>. 

How does the performance of the model change (with the additional training examples and additional features)?

In [19]:
# Examine the additional features made available in full dataset
df_full.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no


It doesn't look like the four new columns - <code>emp.var.rate</code>, <code>cons.price.idx</code>, <code>euribor3m</code>, <code>nr.employed</code> - are very meaningful, at least at first glance, so maybe the additional data will improve the model accuracy.

In [21]:
# We'll create a test DataFrame to practice feature engineering
df_full_test = df_full.copy(deep=True)

# Change categorial features from string values to numerical values, starting with the label ('y')
df_full_test['y'] = df_full_test.y.map({'no':0,'yes':1})

# Choose meaningful string type categorical features to convert to numerical values, then use for-loop combined with
# LabelEncoder to change values
# columns: 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome'
# TODO(justindelatorre): Leave out 'contact' feature because it doesn't seem 
feature_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']

le = LabelEncoder()
for col in feature_cols:
    df_full_test[col] = le.fit_transform(df_full_test[col])
    
print df_full_test.head(20)

    age  job  marital  education  default  housing  loan  contact month  \
0    56    3        1          0        0        0     0        1   may   
1    57    7        1          3        1        0     0        1   may   
2    37    7        1          3        0        2     0        1   may   
3    40    0        1          1        0        0     0        1   may   
4    56    7        1          3        0        0     2        1   may   
5    45    7        1          2        1        0     0        1   may   
6    59    0        1          5        0        0     0        1   may   
7    41    1        1          7        1        0     0        1   may   
8    24    9        2          5        0        2     0        1   may   
9    25    7        2          3        0        2     0        1   may   
10   41    1        1          7        1        0     0        1   may   
11   25    7        2          3        0        2     0        1   may   
12   29    1        2    

In [22]:
X = df_full_test[feature_cols]
y = df_full_test.y

In [23]:
treeclf = DecisionTreeClassifier(max_depth=len(feature_cols), random_state=1)
treeclf.fit(X, y)

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1, splitter='best')

In [24]:
scores = cross_val_score(treeclf, X, y, cv=10, scoring='accuracy')
print 'Decision tree accuracy: {}'.format(np.mean(scores))

Decision tree accuracy: 0.895478911089


It seems that adding the full dataset improves more than a full percent.