## Feature Selection

#### Constant, Quasi Constant and Duplicate Feature removal







Unneccesary and redundant features not only slow down the training time of an algorithm, but they also affect the performance of the alogorithm.


There are several advantages of feature selection-

- Model have high explainablility.
- Easier to implement ML model with reduced features.
- Reduce Overfitting
- Reduce data redundancy.
- Training time is lower.
- Less prone to errors.

## 1. Filter Method

This is usually the first step in feature selection step and used as screening method to find out best features without implementing any ML Algorithm.

Univariate : Fisher Score, Mutual Information gain, Variance etc.

Multivariate: Pearson's correlation


Disadvantage of univariate feature is that it doesn't take relationship of other feature into an account and hence, it is not able to remove redundant features.

While Multivariate filter method is capable of identifying redundant features from the data since they take mutual relationship between the features into account.

In [1]:
# Importing all libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold

In [3]:
# Read the data

df=pd.read_csv('santander-train.csv')

df.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [4]:
X=df.drop('TARGET', axis=1)

y=df['TARGET']

X.shape, y.shape

((76020, 370), (76020,))

In [5]:
# Do train test split at start only to avoid overfitiing problem.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

## Variance Threshold

Feature selector that removes all low-variance features.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

## Constant Feature Removal

In [6]:
const_filter=VarianceThreshold(threshold=0)
const_filter.fit(X_train)

VarianceThreshold(threshold=0)

In [7]:
const_filter.get_support().sum()

332

In [8]:
# get list of constant features
const_list=[not temp for temp in const_filter.get_support()]
X.columns[const_list]

Index(['ind_var2_0', 'ind_var2', 'ind_var27_0', 'ind_var28_0', 'ind_var28',
       'ind_var27', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var27_0',
       'num_var28_0', 'num_var28', 'num_var27', 'num_var41', 'num_var46_0',
       'num_var46', 'saldo_var28', 'saldo_var27', 'saldo_var41', 'saldo_var46',
       'delta_imp_reemb_var33_1y3', 'delta_num_reemb_var33_1y3',
       'imp_amort_var18_hace3', 'imp_amort_var34_hace3',
       'imp_reemb_var13_hace3', 'imp_reemb_var33_hace3',
       'imp_reemb_var33_ult1', 'imp_trasp_var17_out_hace3',
       'imp_trasp_var33_out_hace3', 'num_var2_0_ult1', 'num_var2_ult1',
       'num_reemb_var13_hace3', 'num_reemb_var33_hace3',
       'num_reemb_var33_ult1', 'num_trasp_var17_out_hace3',
       'num_trasp_var33_out_hace3', 'saldo_var2_ult1',
       'saldo_medio_var13_medio_hace3'],
      dtype='object')

In [9]:
X_train_filter=const_filter.transform(X_train)
X_test_filter=const_filter.transform(X_test)

## Below Function can summarise RemoveConstantFeatures

In [10]:
def RemoveConstantFeatures(X_train, X_test): 
    const_filter=VarianceThreshold(threshold=0)
    const_filter.fit(X_train)
    X_train_filter=const_filter.transform(X_train)
    X_test_filter=const_filter.transform(X_test)

    return X_train_filter, X_test_filter

X_train_filter,X_test_filter=RemoveConstantFeatures(X_train, X_test)

In [11]:
X_train_filter.shape, X_test_filter.shape, X_train.shape

((60816, 332), (15204, 332), (60816, 370))

# Quasi Constant Filter

Quasi constant are those features which has nearly same (repeated data) values throughout the distributions. These features does't contribute much into the prediction. So, it is better to remove them from the dataset. 

It's variance are almost equal to 0. we usually take variance thresold = 0.01 i.e. 99.99% values are same throught the distribution.

In [12]:
const_filter_quasi=VarianceThreshold(threshold=0.01)
const_filter_quasi.fit(X_train_filter)

VarianceThreshold(threshold=0.01)

In [13]:
const_filter_quasi.get_support().sum()

271

In [14]:
X_train_quasi_filter=const_filter_quasi.transform(X_train_filter)
X_test_quasi_filter=const_filter_quasi.transform(X_test_filter)

## Below Function can summarise RemoveQuasiConstantFeatures

In [15]:
def RemoveQuasiConstantFeatures(X_train, X_test): 
    const_filter=VarianceThreshold(threshold=0.01)
    const_filter.fit(X_train)
    X_train_filter=const_filter.transform(X_train)
    X_test_filter=const_filter.transform(X_test)

    return X_train_filter, X_test_filter

# Calling function : RemoveQuasiConstantFeatures
X_train_quasi_filter,X_test_quasi_filter=RemoveQuasiConstantFeatures(X_train_filter, X_test_filter)

In [16]:
X_train_quasi_filter.shape, X_test_quasi_filter.shape, X_train.shape

((60816, 271), (15204, 271), (60816, 370))

## Remove Duplicate Features

In [17]:
X_train_Dup_T = X_train_quasi_filter.T
X_test_Dup_T = X_test_quasi_filter.T

In [18]:
X_train_Dup_T.shape, X_test_Dup_T.shape

((271, 60816), (271, 15204))

In [19]:
X_train_Dup_T=pd.DataFrame(X_train_Dup_T)
X_test_Dup_T=pd.DataFrame(X_test_Dup_T)

In [20]:
X_train_Dup_T.duplicated().sum()

16

In [21]:
X_test_Dup_T.duplicated().sum()

49

In [22]:
X_train_Dup_T[[not index for index in X_train_Dup_T.duplicated()]].T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,261,262,263,264,265,266,267,268,269,270
0,99382.0,2.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,149677.350000
1,53154.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,86102.460000
2,82346.0,2.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,107643.720000
3,139550.0,2.0,65.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,270251.340000
4,49164.0,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60811,63380.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,77920.290000
60812,125747.0,2.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
60813,133088.0,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,163432.470000
60814,36696.0,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,176638.350000


In [23]:
def RemoveDuplicateFeatures(X_train): 
    # Transpose the input
    X_train_Dup_T = X_train.T
    # Convert into Pandas DataFrames
    X_train_Dup_T=pd.DataFrame(X_train_Dup_T)
    # Remove duplicate from DF
    X_Unique=X_train_Dup_T[[not index for index in X_train_Dup_T.duplicated()]].T
    return X_Unique

In [24]:
X_train_dup_filter = RemoveDuplicateFeatures(X_train_quasi_filter)
X_test_dup_filter =RemoveDuplicateFeatures(X_test_quasi_filter)

In [25]:
X_train_dup_filter.shape, X_test_dup_filter.shape

((60816, 255), (15204, 222))

## Build ML Model

In [26]:
def randomForestClassifier(X_train, X_test, y_train, y_test): 
    
    clf=RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred=clf.predict(X_test)
    print("Accuracy on test set:", accuracy_score(y_test, y_pred))

In [27]:
%%time
randomForestClassifier(X_train, X_test, y_train, y_test)

Accuracy on test set: 0.9594185740594581
Wall time: 9.4 s


In [28]:
%%time
randomForestClassifier(X_train_quasi_filter, X_test_quasi_filter, y_train, y_test)

Accuracy on test set: 0.9591554853985793
Wall time: 8.06 s
