# Exploring a Direct-Marketing Dataset - Feature Selection
### (Uses code snippets and ideas from Kevin L. Davenport, Sebastian Raschka, and Jason Brownlee)

## Looking at the Dataset as a Whole

In the notebook titled "Exploratory-Data-Analysis" we looked at some methods for making sense of the dataset. We could try out various simple hypotheses and literally see if the data would bear them out. To make this exploration more precises we'll have to answer the usual pesky questions of estimates and error and p-value and confidence level, and so on. 

And of course, when the number of features are in the 1000s or even in the millions -- quite typical for datasets used in machine learning -- it becomes untenable to do the kind of exploratory data analysis we did above.

Can we do better? Can we get away from the assumptions we need to get these types of estimates? Can we wrap our heads around the the *complete* dataset?

So we'll have to use different techniques to make sense of whether and how the features are together related to each other.

We can, but first we need to clean up the data and convert the categorical variables to numeric scales.

In [61]:
# Import packages
import os
import time
import csv
import pickle
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import wraps
from matplotlib.font_manager import FontProperties
from scipy import interp
from tabulate import tabulate

In [62]:
# Import SciKit Learn packages
from sklearn import model_selection #where the cross_validation and learning_curve modules live
from sklearn import neighbors
from sklearn import preprocessing
from sklearn import tree
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA

In [63]:
%matplotlib inline

In [64]:
%%html
<!-- make markdown table pretty -->
<style>table {float:left}</style>

## Get the Already Pre-Processed Data

In [65]:
# Get the dataframe that was pickled in the Exploratory-Data-Analysis notebook
data = pd.read_pickle(os.getcwd() + '/Data/bank-additional/bank-additional-full_pickled')
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


We know some things about this data already. From the brief analysis in the Exploratory-Data-Analysis notebook we could already make some reasoned decisions about dropping certain features from the dataset. In this notebook we'll go further by looking at a number of feature selection techniques.

## One-Hot Encoding

So now let's handle the categorical attributes which all happen to be nominal variables -- i.e., they have no natural ordering or ranking.
- job
- marital
- education
- default
- housing
- loan
- contact
- month
- day_of_week
- poutcome

When attributes are ordinal, we can use mapping or label encoding to turn the text attribute values into numerical attribute values. But when the attributes are nominal, it's best to use *one-hot encoding*.

In [66]:
# list(data) returns the column names as a list
# Pandas get_dummies automatically converts every categorical variable to an equivalent one-hot encoding
# We can do this for our dataset without hesitation because all our categorical variables are nominal 
# -- i.e., they don't have any rank ordering.
# Note: this changes the ordering of the attributes in the dataframe
data2 = pd.get_dummies(data[list(data)])
data2.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,...,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y_no,y_yes
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [67]:
#Looks great except that we've one-hot encoded our outcome as well -- the dependent variable.
#Let's fix that.
data2['y'] = data2['y_yes'].map(lambda x: 1 if x > 0 else 0)
del data2['y_no']
del data2['y_yes']
data2.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,...,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0


## An Alternative Way to Convert the Outcome Variable into Numerical Values

In [68]:
# The last column contains info on whether the person bought the product or not
# The outcomes in the raw data are labeled 'yes' and 'no'.
# Change them into numerical outcomes -- 0 for no and 1 for yes
y = pd.Series([0 if val == 'no' else 1 for val in data.iloc[:,-1]])
y.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64

Let's use data2 as our dataset from now on.

## Get the Inputs and the Output

We need the data in the form of arrays so we can compute using the inputs.

In [69]:
# Get the input matrix and the output vector
n_features = data2.shape[1]
X, y = data2.iloc[:,0:n_features-1].values, data2.iloc[:, n_features-1].values

## Rescale the Features

Since our features are on scales that differ by orders of magnitude, it's essential to rescale them. This makes it possible to compute efficiently on the dataset.

In [70]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)

## Feature Selection Techniques

### Remove Repetitive Features - Ones that Don't Vary a Lot in the Dataset

As explained in the SciKit Learn documentation, "VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples."

Let's apply it to our dataset to see which features this might hold true of (if any).

In [96]:
# Remove any feature that is the same value in threshold% or more of the dataset
# Scaling actually gets in the way of this -- so that's why we've used
# std_scaler.inverse_transform below.
from sklearn.feature_selection import VarianceThreshold
threshold = 0.7 # Set to be any real between and including 0 and 1
# The higher the threshold value, the fewer the number of removed features
# Conversely, the lower the threshold value, the higher the number of removed features
selector = VarianceThreshold(threshold=(threshold * (1 - threshold)))
selector.fit_transform(std_scaler.inverse_transform(X))

array([[  56.,  261.,    1., ...,    0.,    1.,    1.],
       [  57.,  149.,    1., ...,    0.,    1.,    1.],
       [  37.,  226.,    1., ...,    0.,    1.,    1.],
       ..., 
       [  56.,  189.,    2., ...,    1.,    0.,    0.],
       [  44.,  442.,    1., ...,    1.,    0.,    0.],
       [  74.,  239.,    3., ...,    1.,    0.,    0.]])

In [97]:
# Index values of the features that *do* vary and hence are useful to keep around
idx_features_selected = selector.get_support(indices=True)
print idx_features_selected

[ 0  1  2  3  4  5  6  7  8  9 23 37 39 43 44 51]


In [93]:
# Get all the column names in our dataset *except* for the outcome variable name
col_names = list(data2.columns.values)[0:-1]

In [98]:
# Selected features by name
names_features_selected = [col_names[i] for i in idx_features_selected]
names_features_selected

['age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp_var_rate',
 'cons_price_idx',
 'cons_conf_idx',
 'euribor3m',
 'nr_employed',
 'marital_married',
 'housing_no',
 'housing_yes',
 'contact_cellular',
 'contact_telephone',
 'month_may']

In [99]:
# Features not selected because they are the same value for more than 80% of the dataset
names_features_not_selected = list(set(col_names) - set(names_features_selected))
names_features_not_selected

['job_admin.',
 'education_professional.course',
 'job_management',
 'marital_unknown',
 'marital_single',
 'job_student',
 'job_services',
 'education_university.degree',
 'education_basic.9y',
 'education_high.school',
 'marital_divorced',
 'job_unemployed',
 'month_sep',
 'education_basic.4y',
 'poutcome_failure',
 'poutcome_nonexistent',
 'default_no',
 'job_housemaid',
 'loan_unknown',
 'day_of_week_thu',
 'job_entrepreneur',
 'day_of_week_tue',
 'month_mar',
 'poutcome_success',
 'loan_yes',
 'job_unknown',
 'month_nov',
 'month_oct',
 'housing_unknown',
 'job_retired',
 'job_blue-collar',
 'job_self-employed',
 'education_basic.6y',
 'month_jul',
 'day_of_week_fri',
 'month_aug',
 'education_unknown',
 'month_dec',
 'day_of_week_wed',
 'default_yes',
 'education_illiterate',
 'default_unknown',
 'job_technician',
 'day_of_week_mon',
 'month_apr',
 'loan_no',
 'month_jun']

At the threshold value we set, we could do away with these features to make our dataset more manageable.

### L1 Regularization to Create Sparse Arrays

### Recursive Feature Elimination

### Ranking Feature Importance Using Random Forests

Check out the following links for more ideas on feature selection. 
- http://machinelearningmastery.com/feature-selection-machine-learning-python/ (Jason Brownlee)
- http://scikit-learn.org/stable/modules/feature_selection.html (SciKit Learn documentation)
- http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline (SciKit Learn documentation)