# Exercise with bank marketing data

## Introduction

- Data from the UCI Machine Learning Repository: [data](https://github.com/justmarkham/DAT8/blob/master/data/bank-additional.csv), [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
- **Goal:** Predict whether a customer will purchase a bank product marketed over the phone
- `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website

## Step 1: Read the data into Pandas

In [2]:
import pandas as pd
%matplotlib inline
url = '../data/bank-additional.csv'
bank = pd.read_csv(url, sep=';')
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


## Step 2: Prepare at least three features

- Include both numeric and categorical features
- Choose features that you think might be related to the response (based on intuition or exploration)
- Think about how to handle missing values (encoded as "unknown")

In [2]:
# list all columns (for reference)
bank.columns

Index([u'age', u'job', u'marital', u'education', u'default', u'housing',
       u'loan', u'contact', u'month', u'day_of_week', u'duration', u'campaign',
       u'pdays', u'previous', u'poutcome', u'emp.var.rate', u'cons.price.idx',
       u'cons.conf.idx', u'euribor3m', u'nr.employed', u'y'],
      dtype='object')

### y (response)

In [3]:
# convert the response to numeric values and store as a new column
bank['outcome'] = bank.y.map({'no':0, 'yes':1})

Take a look at each of the (potential) predictors. For continuous features, make some visualizations (boxplots etc). For categorical features, see how well the groupings are correlated with (aka separated into) each of the response categories (something like *bank.groupby('job').outcome.mean()*).

### age

In [4]:
# is this feature strongly separated in the two outcome groups?

### job

In [5]:
# how about this one?

In [6]:
# create job_dummies (we will add it to the bank DataFrame later)
# recall the pd.get_dummies() function.
# So create a DataFrame of the dummies and then remember to drop one column


### default

In [7]:
# This one seems fairly useful

In [8]:
# but only one person in the dataset has a status of yes
# use value_counts()

In [10]:
# so, let's treat this as a 2-class feature rather than a 3-class feature
bank['default'] = bank.default.map({'no':0, 'unknown':1, 'yes':1})

### contact

In [9]:
# looks like a useful feature


In [12]:
# convert the feature to numeric values
bank['contact'] = bank.contact.map({'cellular':0, 'telephone':1})

### month

In [10]:
# looks like a useful feature at first glance


In [14]:
# but, it looks like their success rate is actually just correlated with number of calls
# thus, the month feature is unlikely to generalize
bank.groupby('month').outcome.agg(['count', 'mean']).sort('count')

Unnamed: 0_level_0,count,mean
month,Unnamed: 1_level_1,Unnamed: 2_level_1
dec,22,0.545455
mar,48,0.583333
sep,64,0.40625
oct,69,0.362319
apr,215,0.167442
nov,446,0.096413
jun,530,0.128302
aug,636,0.100629
jul,711,0.082982
may,1378,0.065312


### duration

In [11]:
# looks like an excellent feature, but you can't know the duration of a call beforehand, thus it can't be used in your model
# but make the boxplot anyway

### previous

In [12]:
# looks like a useful feature


### poutcome

In [13]:
# looks like a useful feature


In [14]:
# create poutcome_dummies


In [19]:
# concatenate bank DataFrame with job_dummies and poutcome_dummies
# make sure these DataFrame names match actual names you created
bank = pd.concat([bank, job_dummies, poutcome_dummies], axis=1)

### euribor3m

In [15]:
# looks like an excellent feature
# make a boxplot

## Step 3: Model building

- So pick some features that you like
- Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
- Try to increase the AUC by selecting different sets of features

In [21]:
# new list of columns (including dummy columns)
bank.columns

Index([u'age', u'job', u'marital', u'education', u'default', u'housing',
       u'loan', u'contact', u'month', u'day_of_week', u'duration', u'campaign',
       u'pdays', u'previous', u'poutcome', u'emp.var.rate', u'cons.price.idx',
       u'cons.conf.idx', u'euribor3m', u'nr.employed', u'y', u'outcome',
       u'job_blue-collar', u'job_entrepreneur', u'job_housemaid',
       u'job_management', u'job_retired', u'job_self-employed',
       u'job_services', u'job_student', u'job_technician', u'job_unemployed',
       u'job_unknown', u'poutcome_nonexistent', u'poutcome_success'],
      dtype='object')

In [22]:
# create X (including 13 dummy columns)
feature_cols = ['default', 'contact', 'previous', 'euribor3m'] + list(bank.columns[-13:])
X = bank[feature_cols]

In [23]:
# create y
y = bank.outcome

In [24]:
# create LogisticRegression model
# calculate cross-validated AUC
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

0.75566564072331199