# Basic feature engineering and feature selection

In this notebook some basic methods for adding new, useful features to a dataset is introduced:

- Creating dummy variables
- Binning of numerical features
- Creating interacting features
- Scaling of numerical features

Furthermore, we show how we can automatically select the most useful features in the dataset.

We start by loading a few packages we know we will need. More will be loaded along the way.

In [9]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn as sk

import warnings; warnings.simplefilter('ignore')

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

# from sklearn.preprocessing import LabelEncoder

## The data

We use a toy dataset, consisting of fruits with different colors and diameters. In the dataset there are approximately:
- 500 grapes with a mean diameter of 1.5cm and a color which is a random assignment of either green or red.
- 400 ripe apples with a mean diameter of 7cm and a color which is a random assignment of green, red or yellow.
- 100 unripe apples with a mean diameter of 3cm, which are all green.

See the notebook "Appendix - generating fruits-data.ipynb" to see how the data is generated.

Below, the dataset is loaded, and values in the column "Diameter" are converted to floats:

In [10]:
data = pd.read_csv('fruits-data.csv')
data['Diameter'] = data['Diameter'].apply(pd.to_numeric, errors='coerce')

data[0:10]

Unnamed: 0.1,Unnamed: 0,Color,Diameter,Label
0,0,Red,1.883633,Grape
1,1,Green,0.912832,Grape
2,2,Yellow,12.021957,Apple
3,3,Green,6.097648,Apple
4,4,Red,1.786855,Grape
5,5,Yellow,7.593902,Apple
6,6,Red,1.534767,Grape
7,7,Green,2.128611,Grape
8,8,Red,2.526992,Grape
9,9,Red,4.442242,Apple


## Failed attempt

We want to train a decision tree on this data! We pull out the features and the labels and try training a tree:

In [11]:
# X = data.loc[:, 'Color':'Diameter']
# y = data.loc[:,'Label']
# tree = DecisionTreeClassifier()
# tree.fit(X,y)

The above code fails! Because even though the algorithm behind the decision tree is capable of handling categorical (i.e. non-numeric) features, the specific *implementation* in sklearn cannot do that! So we need to represent the color in a different way - e.g. by using socalled **dummy variables**:

## Dummy variables (one-hot-encoding)

We first pull out the labels and then use the pandas-function "get dummies" to create a new dataset in which all categorical variables are converted to dummy variables:

In [12]:
y = data.loc[:,'Label']
features = data.loc[:,'Color':'Diameter']
features = pd.get_dummies(features)
print(list(features.columns))
features[0:10]

['Diameter', 'Color_Green', 'Color_Red', 'Color_Yellow']


Unnamed: 0,Diameter,Color_Green,Color_Red,Color_Yellow
0,1.883633,False,True,False
1,0.912832,True,False,False
2,12.021957,False,False,True
3,6.097648,True,False,False
4,1.786855,False,True,False
5,7.593902,False,False,True
6,1.534767,False,True,False
7,2.128611,True,False,False
8,2.526992,False,True,False
9,4.442242,False,True,False


We then create train and test sets for subsequent training:

In [13]:
X = features.values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=69)

Note: **In all of the following examples, we will depart from the features and labels defined in the previous cell** - i.e. "features" will point back to the array of features with dummy variables created for the colors.

At last, we train a decision tree and print the obtained accuracies:

In [14]:
tree = DecisionTreeClassifier() 
tree.fit(X_train,y_train)
print("Accuracy on training data = {}".format(tree.score(X_train, y_train)))
print("Accuracy on testing data = {}\n".format(tree.score(X_test, y_test)))

Accuracy on training data = 1.0
Accuracy on testing data = 0.8947368421052632



## Binning

Some of the overfitting we saw in the previous tree might be due to the fact that the tree is able to perfectly predict each fruit in the training set based on its diameter alone. One way to avoid this is to bin the diameters. In order to do this, we first create the bins we want to use using the linspace-function from numpy:

In [15]:
# We choose the upper limit as some number strictly larger than the largest diameter IN THE TRAIN DATA!
bins = np.linspace(0, 18, 10) 
bins

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14., 16., 18.])

We then use the "digitize"-function to determine which bin each sample belongs to:

In [16]:
which_bin = np.digitize(features.loc[:,"Diameter"], bins=bins).reshape(-1,1)
which_bin[:10]

array([[1],
       [1],
       [7],
       [4],
       [1],
       [4],
       [1],
       [2],
       [2],
       [3]], dtype=int64)

We can then drop the "Diameter"-feature and append the corresponding bins instead:

In [17]:
which_bin_df = pd.DataFrame(which_bin, columns=['Bin'])
features_binned = features.drop(['Diameter'], axis=1).join(which_bin_df)
features_binned[0:10]

Unnamed: 0,Color_Green,Color_Red,Color_Yellow,Bin
0,False,True,False,1
1,True,False,False,1
2,False,False,True,7
3,True,False,False,4
4,False,True,False,1
5,False,False,True,4
6,False,True,False,1
7,True,False,False,2
8,False,True,False,2
9,False,True,False,3


We are now ready to split the data into train and test sets again:

In [18]:
X_binned = features_binned.values
X_train_binned, X_test_binned, y_train, y_test = train_test_split(X_binned, y)

At last, we train the same model as before:

In [19]:
tree = DecisionTreeClassifier() 
tree.fit(X_train_binned,y_train)
print("Accuracy on training data = {}".format(tree.score(X_train_binned, y_train)))
print("Accuracy on testing data = {}\n".format(tree.score(X_test_binned, y_test)))

Accuracy on training data = 0.9109311740890689
Accuracy on testing data = 0.9109311740890689



We see that in this case, binning reduced the amount of overfitting!

## Interactions

Could it be that for red or yellow fruits the diameter has a great importance, whereas it means nothing for green fruits? 
If this is the case, we might gain something by adding one or more of the **interaction features**   
- $\text{Color\_Green} \times \text{Diameter}$,
- $\text{Color\_Red} \times \text{Diameter}\text{      }$ or
- $\text{Color\_Yellow} \times \text{Diameter}$.

This is a way to allow the algorithm (in this case the decision tree) to take into account how these features interact with each other - lets try! 

We define each of the interacting features we are interested in and append them to the dataset:

In [20]:
green_times_diameter = pd.DataFrame(features['Color_Green']*features['Diameter'],columns=['Green_Times_Diameter'],dtype=float)
red_times_diameter = pd.DataFrame(features['Color_Red']*features['Diameter'],columns=['Red_Times_Diameter'],dtype=float)
yellow_times_diameter = pd.DataFrame(features['Color_Yellow']*features['Diameter'],columns=['Yellow_Times_Diameter'],dtype=float)

features_interact = features.join(green_times_diameter).join(red_times_diameter).join(yellow_times_diameter)
features_interact[0:10]

Unnamed: 0,Diameter,Color_Green,Color_Red,Color_Yellow,Green_Times_Diameter,Red_Times_Diameter,Yellow_Times_Diameter
0,1.883633,False,True,False,0.0,1.883633,0.0
1,0.912832,True,False,False,0.912832,0.0,0.0
2,12.021957,False,False,True,0.0,0.0,12.021957
3,6.097648,True,False,False,6.097648,0.0,0.0
4,1.786855,False,True,False,0.0,1.786855,0.0
5,7.593902,False,False,True,0.0,0.0,7.593902
6,1.534767,False,True,False,0.0,1.534767,0.0
7,2.128611,True,False,False,2.128611,0.0,0.0
8,2.526992,False,True,False,0.0,2.526992,0.0
9,4.442242,False,True,False,0.0,4.442242,0.0


We then pull out the feature values, split in traning and testing, fit a decision tree, and compute and compare the accuracies:

In [27]:
X_interact = features_interact.values
X_train_interact, X_test_interact, y_train, y_test = train_test_split(X_interact, y)

tree = DecisionTreeClassifier() 
tree.fit(X_train_interact,y_train)
print("Accuracy on training data = {}".format(tree.score(X_train_interact, y_train)))
print("Accuracy on testing data = {}\n".format(tree.score(X_test_interact, y_test)))

Accuracy on training data = 1.0
Accuracy on testing data = 0.8421052631578947



## Automatically adding interactions: PolynomialFeatures

We can actually ask sklearn to compute **all** interactions up to a specified degree automatically. We do this using the function "PolynomialFeatures", which add all possible multiplications of features up to a certain degree (which is controlled by the "degree"-argument as seen below):

In [28]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X = features.values
poly.fit(X)
X_poly = poly.transform(X)
print("Polynomial feature names:\n{}".format(poly.get_feature_names()))

AttributeError: 'PolynomialFeatures' object has no attribute 'get_feature_names'

In the list above, 'x0' correspond to the first feature in the dataset (i.e. 'Diameter'), 'x1' to the second feature (Color\_Green)  and so on. The feature 'x0 x1' corresponds to the feature $\text{Diameter}\times\text{Color\_Green}$ which we manually added before.


We have already pulled out the feature values above, so all we need to do is to split the data in train and test and fit the decision tree:

In [25]:
X_train_poly, X_test_poly, y_train, y_test = train_test_split(X_poly, y)

tree = DecisionTreeClassifier()
tree.fit(X_train_poly, y_train)

print("Accuracy on training data = {}".format(tree.score(X_train_poly, y_train)))
print("Accuracy on testing data = {}\n".format(tree.score(X_test_poly, y_test)))

Accuracy on training data = 1.0
Accuracy on testing data = 0.8785425101214575



Adding polynomial interactions can be extremely useful - especially for algorithms like the decision tree which cannot by itself consider combinations of features. However, when we do this, we create a much larger feature space - and this makes it more difficult for the algorithms to identify a good set of questions! Luckily, we can also make sklearn automatically select the most important features for us, and then leave out everything else:

## Feature selection 

The decision tree seems to be overfitting quite a bit after we have added all the interaction features. Adding too many useless features increases the risk of overfitting, and makes it more difficult for the algorithms to identify the relevant parameters.

Luckily, sklearn have methods for identifying the most useful features. They are all part of the module "sklearn.feature_selection":
- SelectPercentile: Select e.g. the 50% of features which have the largest correlation with the target
- SelectFromModel: Fits some model, and only keeps the features that this model finds to be the most important.
- RFE ("recursive feature elimination"): fits a model and discards the least useful feature. This is repeated until only the wanted number of features is left.

We will only demonstrate the first of these - you can look up the syntax for the other two in the documentation (or the book).

In [26]:
from sklearn.feature_selection import SelectPercentile

select = SelectPercentile(percentile=50) # 50% of the features will be chosen
select.fit(X_train_poly, y_train)

X_train_selected = select.transform(X_train_poly)
X_test_selected = select.transform(X_test_poly)

print("The shape of X_train with all interaction features: {}".format(X_train_poly.shape))
print("The shape of X_train after selected 50% of features: {}".format(X_train_selected.shape))

The shape of X_train with all interaction features: (741, 15)
The shape of X_train after selected 50% of features: (741, 7)


We can see which features were selected by using the "get_support"-method:

In [None]:
mask = select.get_support()
print("The selected features are:\n")
[print(name) for m, name in zip(mask, poly.get_feature_names()) if m]
np.array(poly.get_feature_names())[mask.astype(int)]

The selected features are:

x0
x3
x0^2
x0 x1
x0 x2
x0 x3
x3^2


array(['1', 'x0', '1', '1', 'x0', 'x0', 'x0', 'x0', 'x0', '1', '1', '1',
       '1', '1', 'x0'], dtype='<U5')

At last, we fit the tree and calculate the accuracies:

In [None]:
tree = DecisionTreeClassifier()
tree.fit(X_train_selected, y_train)
print("Accuracy on training data = {}".format(tree.score(X_train_selected, y_train)))
print("Accuracy on testing data = {}\n".format(tree.score(X_test_selected, y_test)))

Accuracy on training data = 1.0
Accuracy on testing data = 0.8947368421052632

