# Filter Methods

Filter methods were those that select based on feature information. They should be the first step in any ML analysis.

Basic filter methods consist of removing __constant__, __quasi constant__ or __duplicated__ features. Constant and quasi constant are features that always or almost always are the same value in each case, adding little to no value to the ML.

One hot encoding may be a source of duplicated features.


## Constant Features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This features provide no information that allows a ML model to discriminate or predict a target.

Indentifying and removing constant features is an easy first step towards feature selection and more easily interpretable ML models.

Here, we will demonstrate how to identify constant features using the Santander Customer Satisfaction dataset from Kaggle.

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

Load the Santander Customer Satisfaction dataset

In [2]:
data = pd.read_csv('../datasets/santander.csv')

data.shape

(76020, 371)

Check the presence of NULL data. The snippets below will be able to compare NaN values between 2 columns, so in principle, missing data is not a problem. In any case, we see that there is no missing data in this dataset.

In [3]:
[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Cross Validation

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=101
)

### Using Variance Threshold from sklearn
[VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn't meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [5]:
sel = VarianceThreshold(threshold=0)
sel.fit(X_train) # fit finds the features with zero variance

VarianceThreshold(threshold=0)

We use [get_support](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support) to get the features that are retained after applying VarianceThreshold.

In [6]:
sum(sel.get_support())

336

In [7]:
constant_features = [x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

print(constant_features)

['ind_var2_0', 'ind_var2', 'ind_var27_0', 'ind_var28_0', 'ind_var28', 'ind_var27', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27', 'num_var41', 'num_var46_0', 'num_var46', 'saldo_var28', 'saldo_var27', 'saldo_var41', 'saldo_var46', 'imp_amort_var18_hace3', 'imp_amort_var34_hace3', 'imp_reemb_var13_hace3', 'imp_reemb_var33_hace3', 'imp_trasp_var17_out_hace3', 'imp_trasp_var33_out_hace3', 'num_var2_0_ult1', 'num_var2_ult1', 'num_reemb_var13_hace3', 'num_reemb_var33_hace3', 'num_trasp_var17_out_hace3', 'num_trasp_var33_out_hace3', 'saldo_var2_ult1', 'saldo_medio_var13_medio_hace3']


These are the constant features, which means that 58 variables show the same value for all the observations of the training set.

In [8]:
X_train[constant_features[0]].unique()

array([0])

We then use the transform function to reduce the training and testing set.

In [9]:
print(X_train.shape, X_test.shape)

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

print(X_train.shape, X_test.shape)

(53214, 370) (22806, 370)
(53214, 336) (22806, 336)


### Variance Threshold from Scratch

In the following, we will code the VarianceThreshold from scratch

In [10]:
data = pd.read_csv('../datasets/santander.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

In [11]:
constant_features_coded = [feat for feat in X_train.columns if X_train[feat].std() == 0]

len(constant_features_coded)

38

In [12]:
X_train.drop(labels=constant_features_coded, axis=1, inplace=True)
X_test.drop(labels=constant_features_coded, axis=1, inplace=True)

X_train.shape, X_test.shape

((53214, 332), (22806, 332))

We see how by removing constant features, we managed to reduce the featured space quite a bit.

Both VarianceThreshold and the snippet code work well with numerical variables.

To do the same with categorical variables:

### Variance Threshold for Categorical variables

In [13]:
data = pd.read_csv('../datasets/santander.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

transform into strings to be considered categories instead of values ;)

In [14]:
X_train = X_train.astype('O') 

To find constant features we need to find those that contain only 1 label:

In [15]:
constant_features_categorical = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features_categorical)

38

We can appreciate the usefulness of looking out for constant variables at the beginning of any modelling exercise.

## Quasi Constant Features

Quasi Constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target. But there can be exceptions, so you should be careful when removing these type of features.

Identifying and removing quasi-constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

To identify constante features, we use the VarianceThreshold function from sklearn.

In [16]:
data = pd.read_csv('../datasets/santander.csv')

data.shape

(76020, 371)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

### Remove Constant Features
First, we will remove the constant features from the dataset, which will allow for a better visualisation of the quasi-constant values

In [18]:
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

#X_train = X_train.drop(labels=constant_features, axis=1, inplace=True)
#X_test = X_test.drop(labels=constant_features, axis=1, inplace=True)

print('Constant features are {} out of {}'.format(len(constant_features), len(X_train.columns)))

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

print('After removing the constant features, the number of columns in the train set is {}'.format(len(X_train.columns)))

Constant features are 38 out of 370
After removing the constant features, the number of columns in the train set is 332


### Remove Quasi-Constant features

In [19]:
from sklearn.feature_selection import VarianceThreshold

In [20]:
sel = VarianceThreshold(
    threshold=0.1) # approximately indicates 99% of observations

In [21]:
sel.fit(X_train)

VarianceThreshold(threshold=0.1)

In [22]:
num_features_not_quasi_constant = sum(sel.get_support())

num_features_not_quasi_constant

214

In [23]:
quasi_constant_features = [feat for feat in X_train.columns
        if feat not in X_train.columns[sel.get_support()]]

print(len(quasi_constant_features))
print(quasi_constant_features)

118
['ind_var1_0', 'ind_var1', 'ind_var5_0', 'ind_var6_0', 'ind_var6', 'ind_var8_0', 'ind_var8', 'ind_var12_0', 'ind_var12', 'ind_var13_0', 'ind_var13_corto_0', 'ind_var13_corto', 'ind_var13_largo_0', 'ind_var13_largo', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var13', 'ind_var14_0', 'ind_var14', 'ind_var17_0', 'ind_var17', 'ind_var18_0', 'ind_var18', 'ind_var19', 'ind_var20_0', 'ind_var20', 'ind_var24_0', 'ind_var24', 'ind_var25_cte', 'ind_var26_0', 'ind_var26_cte', 'ind_var26', 'ind_var25_0', 'ind_var25', 'ind_var29_0', 'ind_var29', 'ind_var30_0', 'ind_var31_0', 'ind_var31', 'ind_var32_cte', 'ind_var32_0', 'ind_var32', 'ind_var33_0', 'ind_var33', 'ind_var34_0', 'ind_var34', 'ind_var37_cte', 'ind_var37_0', 'ind_var37', 'ind_var40_0', 'ind_var40', 'ind_var39', 'ind_var44_0', 'ind_var44', 'num_var1_0', 'num_var1', 'num_var6_0', 'num_var6', 'num_var13_medio_0', 'num_var13_medio', 'num_var14', 'num_var17', 'num_var18_0', 'num_var18', 'num_var20_0', 'num_var20', 'num_op_var40_hace3', 'n

Take for instance the first one

In [24]:
X_train[quasi_constant_features[0]].value_counts() / np.float(len(X_train))

0    0.989044
1    0.010956
Name: ind_var1_0, dtype: float64

99% of observations show value 0. This feature can be described as quasi-constant. Let's remove those.

In [25]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

### Removing Quasi-Constant features using bespoke code

In [26]:
data = pd.read_csv('../datasets/santander.csv')

data.shape

(76020, 371)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((53214, 370), (22806, 370))

In [28]:
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

print('Constant features are {} out of {}'.format(len(constant_features), len(X_train.columns)))

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

print('After removing the constant features, the number of columns in the train set is {}'.format(len(X_train.columns)))

Constant features are 38 out of 370
After removing the constant features, the number of columns in the train set is 332


In [29]:
threshold = 0.1
quasi_constant_features = []

for feature in X_train.columns:
    
    # predominant value: the value that appears the most as a percentage of total value appearances
    predominant_value = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]
    
    # evaluate predominant feature
    if predominant_value > (1 - threshold * 0.2):
        quasi_constant_features.append(feature)

len(quasi_constant_features)

195

This method was more aggresive than VarianceThreshold and thus the process was more restrictive.

In [30]:
sample = quasi_constant_features[0]
sample

'imp_op_var40_comer_ult1'

In [31]:
X_train[sample].value_counts() / np.float(len(X_train))

0.00       0.996148
396.00     0.000038
714.09     0.000019
103.80     0.000019
450.00     0.000019
247.56     0.000019
683.43     0.000019
2841.45    0.000019
451.74     0.000019
495.72     0.000019
383.79     0.000019
677.04     0.000019
1987.47    0.000019
2341.08    0.000019
350.46     0.000019
627.78     0.000019
721.50     0.000019
92.88      0.000019
404.34     0.000019
114.00     0.000019
1482.30    0.000019
2570.76    0.000019
499.80     0.000019
1030.77    0.000019
327.00     0.000019
713.91     0.000019
563.97     0.000019
4061.28    0.000019
2822.58    0.000019
3542.40    0.000019
             ...   
184.50     0.000019
2105.43    0.000019
140.55     0.000019
572.91     0.000019
867.87     0.000019
368.94     0.000019
299.82     0.000019
1224.87    0.000019
3639.87    0.000019
1368.00    0.000019
30.51      0.000019
6300.69    0.000019
373.50     0.000019
541.29     0.000019
408.51     0.000019
834.96     0.000019
195.18     0.000019
194.82     0.000019
137.94     0.000019


The feature shows 0 for 99.6% of the time

## Duplicated Features

Often, databases contain one or more features that show the same value across all the observations. This means that both features are in essence identical. It is not unusual to introduce duplicate features after performing __one hot__ encoding of categorical values, particularly when using several high cardinal variables.

Identifying and removing duplicated, redundant features is an easy first step towards feature selection and more easily interpretable machine learning models.

There is no function in Python or Pandas to find duplicated columns. Therefore, we will use two code snippets to apply to small and large datasets.

__Note__: Finding duplicated features is a computationally costly operation in Python, so you might not always want to perform it. Make sure you don't introduce duplicated values into the dataset, for starters.

In [32]:
data = pd.read_csv('../datasets/santander.csv', nrows=500)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((350, 370), (150, 370))

Pandas has a function 'duplicated' that evaluates if the dataframe contains duplicated rows. We can use this while transposing the dataset to look for columns.

In [33]:
data_t = X_train.T
data_t.head()

Unnamed: 0,141,383,135,493,122,22,68,20,382,14,...,211,9,359,195,251,323,192,117,47,172
ID,258.0,768.0,241.0,990.0,220.0,51.0,144.0,45.0,767.0,32.0,...,421.0,23.0,717.0,384.0,501.0,651.0,378.0,213.0,107.0,332.0
var3,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
var15,36.0,39.0,45.0,48.0,42.0,35.0,23.0,23.0,41.0,33.0,...,23.0,25.0,24.0,32.0,24.0,37.0,27.0,55.0,42.0,38.0
imp_ent_var16_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,600.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imp_op_var39_comer_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1086.48,...,0.0,0.0,0.0,0.0,0.0,572.19,0.0,0.0,0.0,0.0


Check if there are duplicated rows (the columns of the initial dataset) by applying .duplicated(), and then sum the number of rows being duplicates. This may take a while.

In [34]:
data_t.duplicated().sum()

183

And visually, the duplicated rows

In [35]:
data_t[data_t.duplicated()].head()

Unnamed: 0,141,383,135,493,122,22,68,20,382,14,...,211,9,359,195,251,323,192,117,47,172
imp_op_var39_efect_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,360.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imp_sal_var16_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var2_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var6_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
duplicated_features = data_t[data_t.duplicated()].index.values
duplicated_features

array(['imp_op_var39_efect_ult1', 'imp_sal_var16_ult1', 'ind_var2_0',
       'ind_var2', 'ind_var6_0', 'ind_var6', 'ind_var13_largo',
       'ind_var13_medio_0', 'ind_var13_medio', 'ind_var14', 'ind_var17',
       'ind_var18_0', 'ind_var18', 'ind_var19', 'ind_var20', 'ind_var24',
       'ind_var26_cte', 'ind_var26', 'ind_var25_0', 'ind_var25',
       'ind_var27_0', 'ind_var28_0', 'ind_var28', 'ind_var27',
       'ind_var29_0', 'ind_var29', 'ind_var31', 'ind_var32_cte',
       'ind_var32_0', 'ind_var32', 'ind_var33_0', 'ind_var33',
       'ind_var34_0', 'ind_var34', 'ind_var37', 'ind_var40_0',
       'ind_var40', 'ind_var41', 'ind_var39', 'ind_var44', 'ind_var46_0',
       'ind_var46', 'num_var6_0', 'num_var6', 'num_var13_medio_0',
       'num_var13_medio', 'num_var14', 'num_var17', 'num_var18_0',
       'num_var18', 'num_var20', 'num_var24', 'num_var26', 'num_var25_0',
       'num_var25', 'num_var27_0', 'num_var28_0', 'num_var28',
       'num_var27', 'num_var29_0', 'num_var29', 'num_va

In [37]:
data_unique = data_t.drop_duplicates(keep='first').T
data_unique.shape

(350, 187)

There are now only 187 features!

Transposing a dataframe is costly if the dataframe is big. Therefore, we can use the alternative loop to find duplicated columns in bigger datasets.

In [38]:
X_train, X_test, y_train, y_test, = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((350, 370), (150, 370))

In [39]:
duplicated_features = []

for column in range(len(X_train.columns)):
    if column % 10 == 0:
        print(column)
    
    column_1 = X_train.columns[column]
    
    for column_2 in X_train.columns[column + 1:]:
        if X_train[column_2].equals(X_train[column_1]):
            duplicated_features.append(column_2)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360


In [43]:
print(len(set(duplicated_features))) # use set to filter non uniques

182


We had 183 previously, so this makes sense, as we are using the whole dataset.

In [44]:
set(duplicated_features)

{'delta_imp_amort_var18_1y3',
 'delta_imp_amort_var34_1y3',
 'delta_imp_aport_var17_1y3',
 'delta_imp_aport_var33_1y3',
 'delta_imp_reemb_var13_1y3',
 'delta_imp_reemb_var17_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_imp_trasp_var17_in_1y3',
 'delta_imp_trasp_var17_out_1y3',
 'delta_imp_trasp_var33_in_1y3',
 'delta_imp_trasp_var33_out_1y3',
 'delta_num_aport_var13_1y3',
 'delta_num_aport_var17_1y3',
 'delta_num_aport_var33_1y3',
 'delta_num_compra_var44_1y3',
 'delta_num_reemb_var13_1y3',
 'delta_num_reemb_var17_1y3',
 'delta_num_reemb_var33_1y3',
 'delta_num_trasp_var17_in_1y3',
 'delta_num_trasp_var17_out_1y3',
 'delta_num_trasp_var33_in_1y3',
 'delta_num_trasp_var33_out_1y3',
 'delta_num_venta_var44_1y3',
 'imp_amort_var18_hace3',
 'imp_amort_var18_ult1',
 'imp_amort_var34_hace3',
 'imp_amort_var34_ult1',
 'imp_aport_var17_hace3',
 'imp_aport_var17_ult1',
 'imp_aport_var33_hace3',
 'imp_aport_var33_ult1',
 'imp_op_var39_efect_ult1',
 'imp_reemb_var13_hace3',
 'imp_reemb_var13_ult1'