# Feature Selection - Dropping Constant Features
In this notebook we will focus on removing the features that have constant values which are actually not important for solving the problem statement.

In [1]:
# Importing pandas
import pandas as pd

# Creating a DataFrame
dataset = pd.DataFrame({'A': [1,2,3,57,2,23],
                        'B': [4,5,6,7,8,9],
                        'C': [0,0,0,0,0,0],
                        'D': [1,1,1,1,1,1]})

In [2]:
dataset.head()

Unnamed: 0,A,B,C,D
0,1,4,0,1
1,2,5,0,1
2,3,6,0,1
3,57,7,0,1
4,2,8,0,1


## Variance threshold
Feature selector that removes all low-variance features.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Features with a training set variance lower than a specified threshold value will be removed.

By default this algorithm keeps all features with non-zero variance i.e. remove the features that have the same value in all samples.

In [3]:
from sklearn.feature_selection import VarianceThreshold

# Creating VarianceThreshold instance
var_threshold = VarianceThreshold(threshold=0)
var_threshold.fit(dataset)

VarianceThreshold(threshold=0)

In [6]:
var_threshold.get_support()

array([ True,  True, False, False])

In [7]:
dataset.columns[var_threshold.get_support()]

Index(['A', 'B'], dtype='object')

In [8]:
constant_columns = [column for column in dataset.columns 
                    if column not in dataset.columns[var_threshold.get_support()]]

constant_columns

['C', 'D']

In [9]:
dataset.drop(constant_columns, axis=1)

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
3,57,7
4,2,8
5,23,9


##### Let's practice this on a bigger dataset

In [10]:
url = 'https://www.kaggle.com/c/santander-customer-satisfaction/data?select=train.csv'

## Importing the dataset

In [12]:
!pip install opendatasets --upgrade --quiet
import opendatasets as od

od.download(url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: venupodugu
Your Kaggle Key: ········


 12%|█████████▎                                                                   | 1.00M/8.25M [00:00<00:01, 6.72MB/s]

Downloading santander-customer-satisfaction.zip to .\santander-customer-satisfaction


100%|█████████████████████████████████████████████████████████████████████████████| 8.25M/8.25M [00:00<00:00, 14.4MB/s]







In [13]:
from zipfile import ZipFile

with ZipFile('./santander-customer-satisfaction/santander-customer-satisfaction.zip') as f:
    f.extractall()

In [16]:
from sklearn.feature_selection import VarianceThreshold

dataset = pd.read_csv('./santander-customer-satisfaction/train.csv', nrows = 10000)
dataset.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [17]:
dataset.shape

(10000, 371)

In [19]:
X = dataset.drop(labels=['TARGET'], axis=1)
y = dataset['TARGET']

In [20]:
# Splitting the datasets into Training and Test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((7000, 370), (3000, 370))

### Let's apply the variance threshold

In [21]:
var_thres = VarianceThreshold(threshold=0)
var_thres.fit(X_train)

VarianceThreshold(threshold=0)

In [23]:
# Finding non constant features

sum(var_thres.get_support())

284

In [24]:
len(X_train.columns[var_thres.get_support()])

284

In [25]:
constant_columns = [column for column in X_train.columns
                   if column not in X_train.columns[var_thres.get_support()]]

len(constant_columns)

86

In [27]:
X_train = X_train.drop(constant_columns, axis=1)
X_train.shape

(7000, 284)