## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify constant features using the Santander Customer Satisfaction dataset from Kaggle. 

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves. I will show 2 snippets of code with both procedures.

In [31]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

## Removing constant features

In [36]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
import os
os.chdir("C:\\Users\\prudi\\Desktop\\Data Sets\\santander-customer-transaction-prediction")
data = pd.read_csv('train.csv', nrows=50000)
data.shape

(50000, 202)

In [37]:
data=data.drop('ID_code',axis=1)

In [38]:
data.head()

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,-4.92,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,3.1468,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,-4.9193,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,-5.8609,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,6.2654,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [39]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [40]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 200), (15000, 200))

### Using variance threshold from sklearn

Variance threshold from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [41]:
sel = VarianceThreshold(threshold=0)
sel.fit(X_train)  # fit finds the features with zero variance

VarianceThreshold(threshold=0)

In [42]:
sel.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [43]:
X_train.shape

(35000, 200)

In [44]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant
sum(sel.get_support())

200

In [45]:
# another way of finding non-constant features is like this:
len(X_train.columns[sel.get_support()])

200

In [46]:
# finally we can print the constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

0


[]

We can see that 58 columns / variables are constant. This means that 58 variables show the same value, just one value, for all the observations of the training set.

In [47]:
# let's visualise the values of one of the constant variables
# as an example

X_train['ind_var2_0'].unique()

KeyError: 'ind_var2_0'

We then use the transform function to reduce the training and testing sets. See below.

In [48]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 200), (15000, 200))

### Coding it ourselves

In the following cells, I will show an alternative to the VarianceThreshold function of sklearn.

In [52]:
# load the dataset again
import os
os.chdir("C:\\Users\\prudi\\Desktop\\Data Sets\\santander-customer-transaction-prediction")
data = pd.read_csv('train.csv', nrows=50000)
data.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [53]:
data=data.drop('ID_code',axis=1)

In [55]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 200), (15000, 200))

In [56]:
# short and easy: find constant features
# in this case, all features are numeric, so this will suffice

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

len(constant_features)

0

In [57]:
# we can then drop these columns from the train and test sets
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 200), (15000, 200))

We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both varianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternatively is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

Alternatively, you can use the code below.

### Removing constant features for categorical variables

In [58]:
# load the dataset again
data = pd.read_csv('train.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 201), (15000, 201))

In [15]:
# I will transform all these numeric features into
# categorical features for the demonstration
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

ID                               object
var3                             object
var15                            object
imp_ent_var16_ult1               object
imp_op_var39_comer_ult1          object
imp_op_var39_comer_ult3          object
imp_op_var40_comer_ult1          object
imp_op_var40_comer_ult3          object
imp_op_var40_efect_ult1          object
imp_op_var40_efect_ult3          object
imp_op_var40_ult1                object
imp_op_var41_comer_ult1          object
imp_op_var41_comer_ult3          object
imp_op_var41_efect_ult1          object
imp_op_var41_efect_ult3          object
imp_op_var41_ult1                object
imp_op_var39_efect_ult1          object
imp_op_var39_efect_ult3          object
imp_op_var39_ult1                object
imp_sal_var16_ult1               object
ind_var1_0                       object
ind_var1                         object
ind_var2_0                       object
ind_var2                         object
ind_var5_0                       object


In [16]:
# and now find those columns that contain only 1 label:
constant_features = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features)

58

Same as before, we observe 58 variables that show only 1 value across all the observations of the dataset. We can appreciate the usefulness of looking out for constant variables at the beginning of any modeling exercise.

That is all for this lecture, I hope you enjoyed it and see you in the next one!