<a href="https://colab.research.google.com/github/plaban1981/Feature_Selection/blob/master/Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Feature Selection

It is the process of selecting a subset of relevant features (variables / predictors) for use in machine learning model.

https://www.trainindata.com/post/best-resources-to-learn-python-for-data-science

https://www.trainindata.com/post/best-resources-to-learn-machine-learning

https://www.kaggle.com/c/santander-customer-satisfaction/data

## Filter Selection Methods

* Filter
    - Variance
    - Correlation
    - Univariate Selection
* Wrapper
    - step forward selection 
    - step backward selection
    - Exhaustive Search (less frequently used)
    - feature shuffling
* Embeded
    - LASSO Regression
    - Decision Tree Derived Importance
    - Regression Coefficients

* Hybrid Method
    - Recursive Feature Elimination( wrapper + embedded )

## Why to select features :-

* Simpler models are easier to interpret
* Shorter training time
* Enhanced generalization by reducing overfitting
* Easy Implementation. During deployment easy to write code for fewer variables than many features or variables.
* Reduced Risk of Data Errors during model use
* Variable Redundancy - highly correlated data features
* Bad learning behaviour in high dimensional spaces.

## Feature Selection Procedure:
Combination of search technique for proposing new features alongwith evaluation measures with scores different feature subsets.

- It is computationally expensive
- Different feature subsets for all possible combinations render different optimal performances for different machine learning models.

## 1. FILTER METHOD

* Rely on characteristics of the data
* Do not use any machine learning algorithms
* Model agnostic
* computaionally less expensive
* Usually give lower prediction performance compared to Wrapper Methods
* Well suited for quick screening and removal of irrelevant features

## 2. WRAPPER METHOD

* Use a predictive Machine Learning Model  to score the feature subset
* Train a new model on each feature subset.
* Computaionally very expensive as separate machine learning model will be built for each feature subset.
* Usually provide best performing feature subset for a given machine learning algorithm
* They may not produce the best feature combination for a different machine learning model.

## 3. Embedded Features 

* Perform feature selection as a part of model construction process
* Consider interaction between Feature and Model
* They are less computaionally expensive than Wrapper methods because they tend to fit the machine learnng model only once

####Note:
Preferred choice for feature selection

## FILTER METHODS

* Select variables independently of the machine learning algorithm.
* Rely on characteristics of the data


#### Note :
Advantages : 

- Model Agnostic - Features selected can be used as a part of any machine learning algorithm.
- Fast Computation, in expensive


A typical FILTER method (typically UNIVARIATE)  involves two step procedure :

* Rank Features according to a certain criteria
  - each feature is ranked independently of the feature space
* Select Highest Ranking Features


Because features are selected on the basis of ranking ,the relationship between features is not considered.

## RANKING CRITERIA

#### Feature scores on various statistical tests

* Chi-Square | Fischer Score
* ANNOVA
* Mutual Information
* Varaince (eliminate)
  - Constant Features
  - Quasi-Constant Features

## FILTER METHOD (MULTIVARIATE)

* Handle Redundant Feature
* Duplicated Features
* Corelated Features

Filter method in general provide simple yet powerful technique to quickly remove redundant and irrelevant features. Infact they are the first step in selction procedure. Always evaluate variables individually.

##WRAPPER METHOD

FILTER method selects features independent of any Machine Learning Algorithm.

While WRAPPER METHOD evaluates the features in the light using a specific machine learnng algorithm.It finds the optimal feature subset for the desired classifier.

Optimal Feature subset depends on the corresponding machine learning algorithm.

Detect interaction between feature and model.

## PROCEDURE

* Search for a subset of Features
* Build a machine learning model for  the selected feature subset
* Evaluate Model Performance
* Repeat Steps until the model score no longer improves:-
   - search for feature subset
   - Build Machine Learnng Model
   - Evaluate Model Performance

## Three Mechanisms to search for features

- Forward Feature Selection
     - Add one feature at a time
- Backward Feature Elimination
     - Removes one feature at a time
- Exhaustive Feature Search
    - Searches across all feature combinations- creates a machine learning model for each feature combination

####Specific properities of Wrapper Method:-
* Greedy Algorithm
* Aim to find best possible combinations
* Computationally Expensive
* often impracticable(Exhaustive Search) in case feature space is big
* Biggest challenge is when to stop the search, when performance of Machine Learning model does not improve further.
* performance Threshold is somewhat arbitary and need to be decided by the user.
* Another parameter as to when to stop the search is to when the desired number of features is reached.This also needs to be predefined by the user.


####Wrapper Method Advantages
* Better Predictive Accuracy compared to Filter Methods.
* Best performing feature subset for predefined classifier or regressor.
* Computationally Expensive.
* Stopping Criteria is relatively arbitary which needs to be set by the user.

##EMBEDDED METHODS

FILTER methods in general disregard the interaction between the features.They treat each feature independently

WRAPPER method however use a predefined machine learning model to evaluate the Feature importance.

#### Embedded Methods perform feature selection during the modeling algorithm's execution.
#### These methods are embedded in the algorithm either as its normal or extended functionality.

Constrained to the limitations of the algorithm,

####Advantages

* Greedy algorithm
* Faster than wrapper methods
* More accurate than filter methods
* Detect interaction between variables
* Find the feature subset in the algorithm being trained.

##PROCEDURE

* Train a Machine Learning Algorithm
* Derive the feature importance
* Remove non-important features

LASSO Regression is the simplest EMBEDDED feature selection method.

##BASIC FILTER METHODS

They help to identify and remove :-
- Constant Features
- Quasi Constant Features (single value occupying majority of the population)
- Duplicated Features
    - Duplicate Features may arise out of one hot encoding of categorical values.

## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify constant features using the Santander Customer Satisfaction dataset from Kaggle. 

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves. I will show 2 snippets of code with both procedures.


In [0]:
import  numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [0]:
# Install Kaggle library
!pip install -q kaggle

In [0]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"plabannayak","key":"a383bc8d3a60e9b49d06d88b70c32173"}'}

In [0]:
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c santander-customer-transaction-prediction

Downloading sample_submission.csv.zip to /content
  0% 0.00/462k [00:00<?, ?B/s]
100% 462k/462k [00:00<00:00, 62.7MB/s]
Downloading test.csv.zip to /content
 99% 123M/125M [00:02<00:00, 64.3MB/s]
100% 125M/125M [00:02<00:00, 53.8MB/s]
Downloading train.csv.zip to /content
 91% 114M/125M [00:02<00:00, 52.9MB/s]
100% 125M/125M [00:02<00:00, 47.6MB/s]


In [0]:
! mkdir santander_data
! unzip train.csv.zip -d santander_train
! unzip test.csv.zip -d santander_test

Archive:  train.csv.zip
  inflating: santander_train/train.csv  
Archive:  test.csv.zip
  inflating: santander_test/test.csv  


In [0]:
!kaggle competitions download -c bnp-paribas-cardif-claims-management

train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


##Import Santander training data

In [0]:
from google.colab import files
files.upload()

Saving test.csv to test.csv


In [0]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration

data = pd.read_csv('/content/train.csv',nrows=50000)
data.shape

(50000, 371)

##Check for missing values

In [0]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset
missing_features = data.isnull().sum()[data.isnull().sum() > 0]

In [0]:
missing_features

Series([], dtype: int64)

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [0]:
# Separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['TARGET'], axis=1),data['TARGET'],test_size=0.3,random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

### Using variance threshold from sklearn

**Variance threshold** from sklearn is a simple baseline approach to feature selection. 

It removes all features whose variance doesn’t meet some threshold specified. 

By default, **it removes all zero-variance features**, i.e., features that have the same value in all samples.

In [0]:
sel = VarianceThreshold(threshold=0)
sel.fit(X_train,y_train)

VarianceThreshold(threshold=0)

In [0]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant
sum(sel.get_support())

312

In [0]:
# another way of finding non-constant features is like this:
len(X_train.columns[sel.get_support()])

312

In [0]:
# finally we can print the constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

58


['ind_var2_0',
 'ind_var2',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var34_0',
 'ind_var34',
 'ind_var41',
 'ind_var46_0',
 'ind_var46',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var34_0',
 'num_var34',
 'num_var41',
 'num_var46_0',
 'num_var46',
 'saldo_var13_medio',
 'saldo_var28',
 'saldo_var27',
 'saldo_var34',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_num_reemb_var33_1y3',
 'imp_amort_var18_hace3',
 'imp_amort_var34_hace3',
 'imp_amort_var34_ult1',
 'imp_reemb_var13_hace3',
 'imp_reemb_var17_hace3',
 'imp_reemb_var33_hace3',
 'imp_reemb_var33_ult1',
 'imp_trasp_var17_out_hace3',
 'imp_trasp_var33_out_hace3',
 'imp_venta_var44_hace3',
 'num_var2_0_ult1',
 'num_var2_ult1',
 'num_meses_var13_medio_ult3',
 'num_reemb_var13_hace3',
 'num_reemb_var17_hace3',
 'num_reemb_var33_hace3',
 'num_reemb_var

We can see that 58 columns / variables are constant. This means that 58 variables show the same value, just one value, for all the observations of the training set.

In [0]:
# let's visualise the values of one of the constant variables
# as an example

X_train['ind_var2_0'].unique()

array([0])

#### All values for "ind_var2_0" feature is 0 ,so its a constant feature which can be detected by VarianceThreshold

We then use the transform function to reduce the training and testing sets. See below.

In [0]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 312), (15000, 312))

##Instead of using VarianceThreshold coding directly - using standard deviation of each feature to detect constant values

In [0]:
# load the dataset again
data = pd.read_csv('/content/train.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [0]:
X_train.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,imp_op_var41_comer_ult1,imp_op_var41_comer_ult3,imp_op_var41_efect_ult1,imp_op_var41_efect_ult3,imp_op_var41_ult1,imp_op_var39_efect_ult1,imp_op_var39_efect_ult3,imp_op_var39_ult1,imp_sal_var16_ult1,ind_var1_0,ind_var1,ind_var2_0,ind_var2,ind_var5_0,ind_var5,ind_var6_0,ind_var6,ind_var8_0,ind_var8,ind_var12_0,ind_var12,ind_var13_0,ind_var13_corto_0,ind_var13_corto,ind_var13_largo_0,ind_var13_largo,ind_var13_medio_0,ind_var13_medio,ind_var13,...,saldo_medio_var5_hace3,saldo_medio_var5_ult1,saldo_medio_var5_ult3,saldo_medio_var8_hace2,saldo_medio_var8_hace3,saldo_medio_var8_ult1,saldo_medio_var8_ult3,saldo_medio_var12_hace2,saldo_medio_var12_hace3,saldo_medio_var12_ult1,saldo_medio_var12_ult3,saldo_medio_var13_corto_hace2,saldo_medio_var13_corto_hace3,saldo_medio_var13_corto_ult1,saldo_medio_var13_corto_ult3,saldo_medio_var13_largo_hace2,saldo_medio_var13_largo_hace3,saldo_medio_var13_largo_ult1,saldo_medio_var13_largo_ult3,saldo_medio_var13_medio_hace2,saldo_medio_var13_medio_hace3,saldo_medio_var13_medio_ult1,saldo_medio_var13_medio_ult3,saldo_medio_var17_hace2,saldo_medio_var17_hace3,saldo_medio_var17_ult1,saldo_medio_var17_ult3,saldo_medio_var29_hace2,saldo_medio_var29_hace3,saldo_medio_var29_ult1,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
17967,35993,2,24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89769.81
32391,64721,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.87,3.0,2.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126531.21
9341,18794,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65613.36
7929,15983,2,47,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,0,0,1,...,5323.65,3.0,1776.54,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,165000.0,5322.57,165000.0,111774.18,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,59580.39
46544,93096,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,7.56,15.0,12.51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75652.23


In [0]:
# short and easy: find constant features
# in this case, all features are numeric, so this will suffice
constant_features = [feat for feat in X_train.columns if X_train[feat].std() == 0]
#
len(constant_features)

58

#### Here we have also detected 58 features using standard deviation

In [0]:
# we can then drop these columns from the train and test sets
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 312), (15000, 312))

We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both varianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternatively is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

Alternatively, you can use the code below.

### Removing constant features for categorical variables

In [0]:
# load the dataset again
data = pd.read_csv('train.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [0]:
# I will transform all these numeric features into
# categorical features for the demonstration
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

ID                         object
var3                       object
var15                      object
imp_ent_var16_ult1         object
imp_op_var39_comer_ult1    object
                            ...  
saldo_medio_var44_hace2    object
saldo_medio_var44_hace3    object
saldo_medio_var44_ult1     object
saldo_medio_var44_ult3     object
var38                      object
Length: 370, dtype: object

In [0]:
# and now find those columns that contain only 1 label:
constant_features = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features)

58

Same as before, we observe 58 variables that show only 1 value across all the observations of the dataset. We can appreciate the usefulness of looking out for constant variables at the beginning of any modeling exercise.

## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.


In [2]:
from google.colab import files
files.upload()

Saving Santander_train.csv to Santander_train.csv


In [0]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [31]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
data = pd.read_csv('Santander_train.csv', nrows=50000)
data.shape

(50000, 371)

## Check for Null Data

In [5]:
data.isnull().sum()[data.isnull().sum() > 0]

Series([], dtype: int64)

#### In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [6]:
data.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,imp_op_var41_comer_ult1,imp_op_var41_comer_ult3,imp_op_var41_efect_ult1,imp_op_var41_efect_ult3,imp_op_var41_ult1,imp_op_var39_efect_ult1,imp_op_var39_efect_ult3,imp_op_var39_ult1,imp_sal_var16_ult1,ind_var1_0,ind_var1,ind_var2_0,ind_var2,ind_var5_0,ind_var5,ind_var6_0,ind_var6,ind_var8_0,ind_var8,ind_var12_0,ind_var12,ind_var13_0,ind_var13_corto_0,ind_var13_corto,ind_var13_largo_0,ind_var13_largo,ind_var13_medio_0,ind_var13_medio,ind_var13,...,saldo_medio_var5_ult1,saldo_medio_var5_ult3,saldo_medio_var8_hace2,saldo_medio_var8_hace3,saldo_medio_var8_ult1,saldo_medio_var8_ult3,saldo_medio_var12_hace2,saldo_medio_var12_hace3,saldo_medio_var12_ult1,saldo_medio_var12_ult3,saldo_medio_var13_corto_hace2,saldo_medio_var13_corto_hace3,saldo_medio_var13_corto_ult1,saldo_medio_var13_corto_ult3,saldo_medio_var13_largo_hace2,saldo_medio_var13_largo_hace3,saldo_medio_var13_largo_ult1,saldo_medio_var13_largo_ult3,saldo_medio_var13_medio_hace2,saldo_medio_var13_medio_hace3,saldo_medio_var13_medio_ult1,saldo_medio_var13_medio_ult3,saldo_medio_var17_hace2,saldo_medio_var17_hace3,saldo_medio_var17_ult1,saldo_medio_var17_ult3,saldo_medio_var29_hace2,saldo_medio_var29_hace3,saldo_medio_var29_ult1,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,300.0,122.22,300.0,240.75,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,3.0,2.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,195.0,195.0,0.0,0.0,195.0,0.0,0.0,195.0,0.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,91.56,138.84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,...,40501.08,13501.47,0.0,0.0,0.0,0.0,0.0,0.0,85501.89,85501.89,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [0]:
X = data.drop('TARGET',axis=1)
Y = data['TARGET']

## Train test split data

In [0]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.30,random_state=1)

In [9]:
X_train.shape, X_test.shape

((35000, 370), (15000, 370))

##Remove constant features

In [0]:
constant_features = [col for col in X_train.columns if X_train[col].std() == 0]

In [14]:
len(constant_features)

60

In [12]:
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [13]:
X_train.shape, X_test.shape

((35000, 310), (15000, 310))

##Removing quasi-constant features

**Variance threshold from sklearn is a simple baseline approach to feature selection.**

 It removes all features which variance doesn’t meet some threshold. 
 
 By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

We will change the default threshold to remove almost / quasi-constant features.

In [16]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01) # 0.01 indicates 99% of observations approximately
sel.fit(X_train)# fit finds the features with low variance

VarianceThreshold(threshold=0.01)

In [18]:
# get_support is a boolean vector that indicates which features are retained. If we sum over get_support, we get the number
# of features that are not quasi-constant

print(sum(sel.get_support()))

257


In [19]:
# another way of doing the above operation:
len(X_train.columns[sel.get_support()])

257

## Print quasi constant features

In [21]:
print(len([x for x in X_train.columns if x not in X_train.columns[sel.get_support()] ]))

53


In [20]:
print([x for x in X_train.columns if x not in X_train.columns[sel.get_support()] ])

['ind_var1', 'ind_var6_0', 'ind_var6', 'ind_var13_largo', 'ind_var14', 'ind_var17_0', 'ind_var17', 'ind_var19', 'ind_var20_0', 'ind_var20', 'ind_var29_0', 'ind_var29', 'ind_var30_0', 'ind_var31_0', 'ind_var31', 'ind_var32_cte', 'ind_var32_0', 'ind_var32', 'ind_var33_0', 'ind_var33', 'ind_var34_0', 'ind_var34', 'ind_var40', 'ind_var39', 'ind_var44_0', 'ind_var44', 'num_var6_0', 'num_var6', 'num_op_var40_hace3', 'num_var29_0', 'num_var29', 'num_var33_0', 'num_var33', 'num_var34_0', 'num_var34', 'delta_imp_aport_var33_1y3', 'delta_num_aport_var33_1y3', 'ind_var7_emit_ult1', 'ind_var7_recib_ult1', 'num_aport_var33_hace3', 'num_aport_var33_ult1', 'num_var7_emit_ult1', 'num_meses_var17_ult3', 'num_meses_var29_ult3', 'num_meses_var33_ult3', 'num_meses_var44_ult3', 'num_reemb_var13_ult1', 'num_trasp_var17_in_hace3', 'num_trasp_var17_in_ult1', 'num_trasp_var17_out_ult1', 'num_trasp_var33_in_hace3', 'num_trasp_var33_in_ult1', 'num_venta_var44_hace3']


#### We can see that 53 columns / variables are almost constant. This means that 53 variables show predominantly one value for ~99% the observations of the training set.

The below statistics proves that reasoning behind quasi constant features

In [24]:
print(X_train['ind_var1'].value_counts() / np.float(len(X_train)))

0    0.995857
1    0.004143
Name: ind_var1, dtype: float64


We can see that > 99% of the observations show one value, 0. Therefore, this features is almost constant.

In [25]:
X_train.shape, X_test.shape

((35000, 310), (15000, 310))

## Removing Quasi constant features from train and test data

In [0]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

In [27]:
X_train.shape, X_test.shape

((35000, 257), (15000, 257))

By removing constant and almost constant features, we reduced the feature space from 371 to 257.

## Removing Quasi Constant features manually through coding

reload the data again

In [32]:
# load the dataset
data = pd.read_csv('Santander_train.csv', nrows=50000)
X = data.drop('TARGET', axis=1)
Y = data['TARGET']
# separate train and test
X_train, X_test, y_train, y_test = train_test_split( X,Y,
    test_size=0.3,
    random_state=0)

# remove constant features
# using the code from the previous lecture
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


((35000, 312), (15000, 312))

## Selct Quasi constant feature

In [33]:
quasi_constant_feat = []
for feature in X_train.columns:

    # find the predominant value
    predominant = (X_train[feature].value_counts() / np.float(len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature
    if predominant > 0.995:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

151

Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 119 features that show predominantly 1 value for the majority of the observations. Let's see how some of the quasi constant features look like.

In [34]:
quasi_constant_feat[0]

'imp_op_var40_comer_ult1'

In [35]:
X_train['imp_op_var40_efect_ult1'].value_counts() / np.float(len(X_train))

0.00       0.999400
900.00     0.000086
60.00      0.000057
1800.00    0.000057
600.00     0.000057
930.00     0.000029
420.00     0.000029
74.28      0.000029
270.00     0.000029
1200.00    0.000029
6600.00    0.000029
870.00     0.000029
750.00     0.000029
300.00     0.000029
120.00     0.000029
210.00     0.000029
150.00     0.000029
Name: imp_op_var40_efect_ult1, dtype: float64

In [38]:
if 'ind_var1' in quasi_constant_feat:
  print(True)

True


In [39]:
X_train['ind_var1'].value_counts() / np.float(len(X_train))

0    0.995771
1    0.004229
Name: ind_var1, dtype: float64