# Filter Methods

Filter methods were those that select based on feature information. They should be the first step in any ML analysis.

Basic filter methods consist of removing __constant__, __quasi constant__ or __duplicated__ features. Constant and quasi constant are features that always or almost always are the same value in each case, adding little to no value to the ML.

One hot encoding may be a source of duplicated features.


## Constant Features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This features provide no information that allows a ML model to discriminate or predict a target.

Indentifying and removing constant features is an easy first step towards feature selection and more easily interpretable ML models.

Here, we will demonstrate how to identify constant features using the Santander Customer Satisfaction dataset from Kaggle.

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves.

In [10]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

Load the Santander Customer Satisfaction dataset

In [11]:
data = pd.read_csv('../datasets/santander.csv')

data.shape

(76020, 371)

Check the presence of NULL data. The snippets below will be able to compare NaN values between 2 columns, so in principle, missing data is not a problem. In any case, we see that there is no missing data in this dataset.

In [12]:
[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Cross Validation

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=101
)

### Using Variance Threshold from sklearn
[VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn't meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [14]:
sel = VarianceThreshold(threshold=0)
sel.fit(X_train) # fit finds the features with zero variance

VarianceThreshold(threshold=0)

We use [get_support](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support) to get the features that are retained after applying VarianceThreshold.

In [15]:
sum(sel.get_support())

336

In [16]:
constant_features = [x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

print(constant_features)

['ind_var2_0', 'ind_var2', 'ind_var27_0', 'ind_var28_0', 'ind_var28', 'ind_var27', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27', 'num_var41', 'num_var46_0', 'num_var46', 'saldo_var28', 'saldo_var27', 'saldo_var41', 'saldo_var46', 'imp_amort_var18_hace3', 'imp_amort_var34_hace3', 'imp_reemb_var13_hace3', 'imp_reemb_var33_hace3', 'imp_trasp_var17_out_hace3', 'imp_trasp_var33_out_hace3', 'num_var2_0_ult1', 'num_var2_ult1', 'num_reemb_var13_hace3', 'num_reemb_var33_hace3', 'num_trasp_var17_out_hace3', 'num_trasp_var33_out_hace3', 'saldo_var2_ult1', 'saldo_medio_var13_medio_hace3']


These are the constant features, which means that 58 variables show the same value for all the observations of the training set.

In [17]:
X_train[constant_features[0]].unique()

array([0])

We then use the transform function to reduce the training and testing set.

In [18]:
print(X_train.shape, X_test.shape)

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

print(X_train.shape, X_test.shape)

(53214, 370) (22806, 370)
(53214, 336) (22806, 336)


### Variance Threshold from Scratch

In the following, we will code the VarianceThreshold from scratch

In [19]:
data = pd.read_csv('../datasets/santander.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

In [20]:
constant_features_coded = [feat for feat in X_train.columns if X_train[feat].std() == 0]

len(constant_features_coded)

38

In [21]:
X_train.drop(labels=constant_features_coded, axis=1, inplace=True)
X_test.drop(labels=constant_features_coded, axis=1, inplace=True)

X_train.shape, X_test.shape

((53214, 332), (22806, 332))

We see how by removing constant features, we managed to reduce the featured space quite a bit.

Both VarianceThreshold and the snippet code work well with numerical variables.

To do the same with categorical variables:

### Variance Threshold for Categorical variables

In [25]:
data = pd.read_csv('../datasets/santander.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

transform into strings to be considered categories instead of values ;)

In [26]:
X_train = X_train.astype('O') 

To find constant features we need to find those that contain only 1 label:

In [27]:
constant_features_categorical = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features_categorical)

38

We can appreciate the usefulness of looking out for constant variables at the beginning of any modelling exercise.

## Quasi Constant Features

Quasi Constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target. But there can be exceptions, so you should be careful when removing these type of features.

Identifying and removing quasi-constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

To identify constante features, we use the VarianceThreshold function from sklearn.

In [29]:
data = pd.read_csv('../datasets/santander.csv')

data.shape

(76020, 371)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

### Remove Constant Features
First, we will remove the constant features from the dataset, which will allow for a better visualisation of the quasi-constant values

In [42]:
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

#X_train = X_train.drop(labels=constant_features, axis=1, inplace=True)
#X_test = X_test.drop(labels=constant_features, axis=1, inplace=True)

print('Constant features are {} out of {}'.format(len(constant_features), len(X_train.columns)))

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

print('After removing the constant features, the number of columns in the train set is {}'.format(len(X_train.columns)))

Constant features are 38 out of 370
After removing the constant features, the number of columns in the train set is 332


### Remove Quasi-Constant features

In [43]:
from sklearn.feature_selection import VarianceThreshold

In [44]:
sel = VarianceThreshold(
    threshold=0.1) # approximately indicates 99% of observations

In [45]:
sel.fit(X_train)

VarianceThreshold(threshold=0.1)

In [46]:
num_features_not_quasi_constant = sum(sel.get_support())

num_features_not_quasi_constant

214

In [51]:
quasi_constant_features = [feat for feat in X_train.columns
        if feat not in X_train.columns[sel.get_support()]]

print(len(quasi_constant_features))
print(quasi_constant_features)

118
['ind_var1_0', 'ind_var1', 'ind_var5_0', 'ind_var6_0', 'ind_var6', 'ind_var8_0', 'ind_var8', 'ind_var12_0', 'ind_var12', 'ind_var13_0', 'ind_var13_corto_0', 'ind_var13_corto', 'ind_var13_largo_0', 'ind_var13_largo', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var13', 'ind_var14_0', 'ind_var14', 'ind_var17_0', 'ind_var17', 'ind_var18_0', 'ind_var18', 'ind_var19', 'ind_var20_0', 'ind_var20', 'ind_var24_0', 'ind_var24', 'ind_var25_cte', 'ind_var26_0', 'ind_var26_cte', 'ind_var26', 'ind_var25_0', 'ind_var25', 'ind_var29_0', 'ind_var29', 'ind_var30_0', 'ind_var31_0', 'ind_var31', 'ind_var32_cte', 'ind_var32_0', 'ind_var32', 'ind_var33_0', 'ind_var33', 'ind_var34_0', 'ind_var34', 'ind_var37_cte', 'ind_var37_0', 'ind_var37', 'ind_var40_0', 'ind_var40', 'ind_var39', 'ind_var44_0', 'ind_var44', 'num_var1_0', 'num_var1', 'num_var6_0', 'num_var6', 'num_var13_medio_0', 'num_var13_medio', 'num_var14', 'num_var17', 'num_var18_0', 'num_var18', 'num_var20_0', 'num_var20', 'num_op_var40_hace3', 'n

Take for instance the first one

In [52]:
X_train[quasi_constant_features[0]].value_counts() / np.float(len(X_train))

0    0.989044
1    0.010956
Name: ind_var1_0, dtype: float64

99% of observations show value 0. This feature can be described as quasi-constant. Let's remove those.

In [53]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)