In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from feature_selection import FilterMethodConstantFeatures

In [2]:
obj1 = FilterMethodConstantFeatures('../data/dataset_1.csv')

In [3]:
obj1.dataframe_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 301 entries, var_1 to target
dtypes: float64(127), int64(174)
memory usage: 114.8 MB


In [4]:
obj1.dataframe_stats()

           count         mean           std  min  25%   50%  75%           max
var_1    50000.0     0.002220      0.108145  0.0  0.0  0.00  0.0  9.000000e+00
var_2    50000.0     0.000060      0.007746  0.0  0.0  0.00  0.0  1.000000e+00
var_3    50000.0    15.593002   1280.571855  0.0  0.0  0.00  0.0  2.079013e+05
var_4    50000.0     3.149633      2.740114  0.0  0.0  2.85  3.0  3.528000e+01
var_5    50000.0   608.681764  10951.361737  0.0  0.0  0.00  0.0  4.455000e+05
...          ...          ...           ...  ...  ...   ...  ...           ...
var_297  50000.0     0.000000      0.000000  0.0  0.0  0.00  0.0  0.000000e+00
var_298  50000.0     0.003060      0.078808  0.0  0.0  0.00  0.0  3.000000e+00
var_299  50000.0    12.462960    832.417622  0.0  0.0  0.00  0.0  1.346667e+05
var_300  50000.0  5683.960293  47364.820421  0.0  0.0  0.00  0.0  2.857673e+06
target   50000.0     0.039820      0.195538  0.0  0.0  0.00  0.0  1.000000e+00

[301 rows x 8 columns]


## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. In other words, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features is an easy first step towards feature selection and more easily interpretable machine learning models.


To identify constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If using the VarianceThreshold, all our variables need to be numerical. If we do it manually however, we can apply the code to both numerical and categorical variables.


### Using VarianceThreshold from Scikit-learn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [5]:
obj1.varinance_threshold(target='target')

----Fit Summary------
---------------------
A total of 266 features are not constant
A total of 34 features are constant
The train shape before fit is (35000, 300)
The train shape after fit is (15000, 266)
The test shape before fit is (15000, 300)
The test shape after fit is (35000, 266)


### Manual code 1: only works with numerical

In the following cells, I will show an alternative to the VarianceThreshold transformer of sklearn, were we write the code to find out constant variables, using the standard deviation from pandas.

In [6]:
obj1.pandas_std(target='target')

----Fit Summary------
---------------------
A total of 266 features are not constant
A total of 34 features are constant
The train shape before fit is (35000, 300)
The train shape after fit is (15000, 266)
The test shape before fit is (15000, 300)
The test shape after fit is (35000, 266)


We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both the VarianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternative is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

The code below offers a better solution:

### Manual Code 2 - works also with categorical variables

In [7]:
obj1.pandas_nunique(target='target')

----Fit Summary------
---------------------
A total of 266 features are not constant
A total of 34 features are constant
The train shape before fit is (35000, 300)
The train shape after fit is (15000, 266)
The test shape before fit is (15000, 300)
The test shape after fit is (35000, 266)
