## Screening features for high-dimensional datasets
## Demonstration

This notebook provides demonstration of how to use the unsupersived screening of features based on their variance. See HTML files for theoretical discussion on alternative implementations of the proposed procedure.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Unsupervised screening of features](#unsupervised_screening)<a href='#unsupervised_screening'></a>.
    * [Importing datasets](#import)<a href='#import'></a>.
    * [Features selection based on variance](#variance_based_selection)<a href='#variance_based_selection'></a>.
        * [Default implementation](#default)<a href='#default'></a>.
        * [Variance thresholding](#var_thres)<a href='#var_thres'></a>.
        * [Winsorize treatment](#winsorize)<a href='#winsorize'></a>.
        * [Dropping outliers](#drop_outliers)<a href='#drop_outliers'></a>.
        * [Handling collinearity](#collinearity)<a href='#collinearity'></a>.
    <br>
    <br>
    * [Features selection based on correlation](#corr_based_selection)<a href='#corr_based_selection'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import screening_features
from screening_features import VarScreeningNumerical, CorrScreeningNumerical

<a id='unsupervised_screening'></a>

## Unsupervised screening of features

<a id='import'></a>

### Importing datasets

In [3]:
df_train = pd.read_csv('../Datasets/demo_dataset.csv', dtype={'order_id': str})

# Accessory variables:
drop_vars = drop_vars = ['y', 'id']

print('\033[1mShape of df_train:\033[0m ' + str(df_train.shape) + '.')
df_train.head()

[1mShape of df_train:[0m (1282, 1652).


Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,feat_10,...,feat_1643,feat_1644,feat_1645,feat_1646,feat_1647,feat_1648,feat_1649,feat_1650,y,id
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11912002374
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12001000011
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12001000012
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12001000015
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12001000019


#### Scaling data

In [4]:
# Means for each feature:
means_dict = dict(zip(df_train.drop(drop_vars, axis=1).mean().index,
                      df_train.drop(drop_vars, axis=1).mean().values))

# Loop over features:
for f in means_dict.keys():
    df_train[f] = df_train[f].apply(lambda x: x/means_dict[f])

<a id='variance_based_selection'></a>

### Features selection based on variance

The screening of features based on variances involves two classes of selection: one for categorical variables, and the other for numerical (continuous) variables. The screening of categorical features considers one-hot encoding and a criterion for excluding dummy variables with very low variance (i.e., categories with rare or extremely common occurence). Since this procedure is rather straightforward, only the screening of numerical features is exemplified here.

Below, the collinearity filter will be applied to select numerical features together with the sorting of features by variance.

In [5]:
# Number of features to be selected:
new_p = 351

# Names of numerical features:
cont_vars = [c for c in df_train.columns if c not in drop_vars]

<a id='default'></a>

#### Default implementation

In [6]:
# Declaring object for screening of feature:
screening_cont = VarScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                       select_k=True, k=new_p,
                                       thresholding=False, variance_threshold=0,
                                       winsorize=False, winsorize_param=0.025,
                                       drop_outliers=False, drop_outliers_param=0.01,
                                       collinearity=False, collinearity_param=0.9)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

  'variation': [np.nanvar(data[f]) for f in self.features]})


In [7]:
# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

print('updated_new_p: ' + str(screening_cont.k) + '.')
print('Number of selected features: ' + str(len(selected_feat)) + '.')
print('Number of features with zero variance: ' + str(len(zero_variance_feat)) + '.')

updated_new_p: 351.
Number of selected features: 351.
Number of features with zero variance: 272.


In [8]:
# Variance by numerical variable:
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0


<a id='var_thres'></a>

#### Variance thresholding

In [9]:
# Declaring object for screening of feature:
screening_cont = VarScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                       select_k=False, k=new_p,
                                       thresholding=True, variance_threshold=0,
                                       winsorize=False, winsorize_param=0.025,
                                       drop_outliers=False, drop_outliers_param=0.01,
                                       collinearity=False, collinearity_param=0.9)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

In [10]:
# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

print('updated_new_p: ' + str(screening_cont.k) + '.')
print('Number of selected features: ' + str(len(selected_feat)) + '.')
print('Number of features with zero variance: ' + str(len(zero_variance_feat)) + '.')

updated_new_p: 351.
Number of selected features: 1199.
Number of features with zero variance: 272.


In [11]:
# Variance by numerical variable:
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0


<a id='winsorize'></a>

#### Winsorize treatment

In [12]:
# Declaring object for screening of feature:
screening_cont = VarScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                       select_k=True, k=new_p,
                                       thresholding=False, variance_threshold=0,
                                       winsorize=True, winsorize_param=0.025,
                                       drop_outliers=False, drop_outliers_param=0.01,
                                       collinearity=False, collinearity_param=0.9)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

In [13]:
# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

print('updated_new_p: ' + str(screening_cont.k) + '.')
print('Number of selected features: ' + str(len(selected_feat)) + '.')
print('Number of features with zero variance: ' + str(len(zero_variance_feat)) + '.')

updated_new_p: 351.
Number of selected features: 351.
Number of features with zero variance: 272.


In [14]:
# Variance by numerical variable:
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0


<a id='drop_outliers'></a>

#### Dropping outliers

In [15]:
# Declaring object for screening of feature:
screening_cont = VarScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                       select_k=True, k=new_p,
                                       thresholding=False, variance_threshold=0,
                                       winsorize=False, winsorize_param=0.025,
                                       drop_outliers=True, drop_outliers_param=0.01,
                                       collinearity=False, collinearity_param=0.9)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

In [16]:
# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

print('updated_new_p: ' + str(screening_cont.k) + '.')
print('Number of selected features: ' + str(len(selected_feat)) + '.')
print('Number of features with zero variance: ' + str(len(zero_variance_feat)) + '.')

updated_new_p: 351.
Number of selected features: 351.
Number of features with zero variance: 272.


In [17]:
# Variance by numerical variable:
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0


<a id='collinearity'></a>

#### Handling collinearity

In [18]:
# Declaring object for screening of feature:
screening_cont = VarScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                       select_k=True, k=new_p,
                                       thresholding=False, variance_threshold=0,
                                       winsorize=False, winsorize_param=0.025,
                                       drop_outliers=False, drop_outliers_param=0.01,
                                       collinearity=True, collinearity_param=0.9)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

-------------------------------------------------------------------------
Number of selected features: 1
Candidate feature: feat_592
R2 from regression of candidate feature against selected features: 0.0
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 2
Candidate feature: feat_910
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 2
Candidate feature: feat_984
R2 from regression of candidate feature against selected features: 0.0
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


------------------------------------------------

R2 from regression of candidate feature against selected features: 0.8189
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 20
Candidate feature: feat_1541
R2 from regression of candidate feature against selected features: 0.0612
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 21
Candidate feature: feat_1622
R2 from regression of candidate feature against selected features: 0.0002
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 22
Candidate feature: feat_1619
R2 from regression of candidate featu

R2 from regression of candidate feature against selected features: 0.837
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 40
Candidate feature: feat_176
R2 from regression of candidate feature against selected features: 0.7808
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 41
Candidate feature: feat_1065
R2 from regression of candidate feature against selected features: 0.9316
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 41
Candidate feature: feat_658
R2 from regression of candidate feat

R2 from regression of candidate feature against selected features: 0.9671
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 57
Candidate feature: feat_184
R2 from regression of candidate feature against selected features: 0.9357
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 57
Candidate feature: feat_985
R2 from regression of candidate feature against selected features: 0.8851
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 58
Candidate feature: feat_188
R2 from regression of candidate 

R2 from regression of candidate feature against selected features: 0.9662
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 68
Candidate feature: feat_1069
R2 from regression of candidate feature against selected features: 0.8896
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 69
Candidate feature: feat_792
R2 from regression of candidate feature against selected features: 0.9854
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 69
Candidate feature: feat_800
R2 from regression of candidate

R2 from regression of candidate feature against selected features: 0.9742
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 75
Candidate feature: feat_297
R2 from regression of candidate feature against selected features: 0.9742
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 75
Candidate feature: feat_367
R2 from regression of candidate feature against selected features: 0.9742
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 75
Candidate feature: feat_1375
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.945
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 82
Candidate feature: feat_1519
R2 from regression of candidate feature against selected features: 0.945
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 82
Candidate feature: feat_1605
R2 from regression of candidate feature against selected features: 0.9959
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 82
Candidate feature: feat_230
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.9129
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 89
Candidate feature: feat_1520
R2 from regression of candidate feature against selected features: 0.9312
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 89
Candidate feature: feat_653
R2 from regression of candidate feature against selected features: 0.9617
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 89
Candidate feature: feat_1289
R2 from regression of cand

R2 from regression of candidate feature against selected features: 0.9018
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 101
Candidate feature: feat_1535
R2 from regression of candidate feature against selected features: 0.9823
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 101
Candidate feature: feat_850
R2 from regression of candidate feature against selected features: 0.6564
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 102
Candidate feature: feat_63
R2 from regression of candida

R2 from regression of candidate feature against selected features: 0.9987
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 116
Candidate feature: feat_189
R2 from regression of candidate feature against selected features: 0.9957
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 116
Candidate feature: feat_1378
R2 from regression of candidate feature against selected features: 0.9693
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 116
Candidate feature: feat_171
R2 from regression of ca

R2 from regression of candidate feature against selected features: 0.7094
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 129
Candidate feature: feat_661
R2 from regression of candidate feature against selected features: 0.7358
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 130
Candidate feature: feat_887
R2 from regression of candidate feature against selected features: 0.9738
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 130
Candidate feature: feat_904
R2 from regression of candidate f

R2 from regression of candidate feature against selected features: 0.8237
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 143
Candidate feature: feat_902
R2 from regression of candidate feature against selected features: 0.9426
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 143
Candidate feature: feat_973
R2 from regression of candidate feature against selected features: 0.7842
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 144
Candidate feature: feat_871
R2 from regression of candidate f

R2 from regression of candidate feature against selected features: 0.9512
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 151
Candidate feature: feat_815
R2 from regression of candidate feature against selected features: 0.9377
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 151
Candidate feature: feat_1116
R2 from regression of candidate feature against selected features: 0.7888
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 152
Candidate feature: feat_958
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.9974
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 157
Candidate feature: feat_811
R2 from regression of candidate feature against selected features: 0.9137
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 157
Candidate feature: feat_214
R2 from regression of candidate feature against selected features: 0.9945
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 157
Candidate feature: feat_1114
R2 from regression of ca

R2 from regression of candidate feature against selected features: 0.921
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 165
Candidate feature: feat_934
R2 from regression of candidate feature against selected features: 0.9031
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 165
Candidate feature: feat_1112
R2 from regression of candidate feature against selected features: 0.9484
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 165
Candidate feature: feat_895
R2 from regression of can

R2 from regression of candidate feature against selected features: 0.9721
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 176
Candidate feature: feat_1084
R2 from regression of candidate feature against selected features: 0.989
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 176
Candidate feature: feat_975
R2 from regression of candidate feature against selected features: 0.8788
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 177
Candidate feature: feat_1107
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.7323
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 185
Candidate feature: feat_1127
R2 from regression of candidate feature against selected features: 0.9983
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 185
Candidate feature: feat_968
R2 from regression of candidate feature against selected features: 0.9399
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 185
Candidate feature: feat_929
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.949
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 194
Candidate feature: feat_549
R2 from regression of candidate feature against selected features: 0.93
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 194
Candidate feature: feat_1581
R2 from regression of candidate feature against selected features: 0.9042
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 194
Candidate feature: feat_933
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.9754
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 204
Candidate feature: feat_670
R2 from regression of candidate feature against selected features: 0.8373
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 205
Candidate feature: feat_1108
R2 from regression of candidate feature against selected features: 0.9863
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 205
Candidate feature: feat_1087
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.9887
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 215
Candidate feature: feat_598
R2 from regression of candidate feature against selected features: 0.9114
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 215
Candidate feature: feat_209
R2 from regression of candidate feature against selected features: 0.8524
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 216
Candidate feature: feat_541
R2 from regression of candida

R2 from regression of candidate feature against selected features: 0.9862
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 225
Candidate feature: feat_203
R2 from regression of candidate feature against selected features: 0.9729
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 225
Candidate feature: feat_1290
R2 from regression of candidate feature against selected features: 0.6455
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 226
Candidate feature: feat_1557
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.9286
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 238
Candidate feature: feat_1098
R2 from regression of candidate feature against selected features: 0.9247
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 238
Candidate feature: feat_1529
R2 from regression of candidate feature against selected features: 0.9534
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 238
Candidate feature: feat_612
R2 from regression of c

R2 from regression of candidate feature against selected features: 0.9346
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 246
Candidate feature: feat_624
R2 from regression of candidate feature against selected features: 0.9292
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 246
Candidate feature: feat_1362
R2 from regression of candidate feature against selected features: 0.9714
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 246
Candidate feature: feat_868
R2 from regression of ca

R2 from regression of candidate feature against selected features: 0.9335
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 250
Candidate feature: feat_1499
R2 from regression of candidate feature against selected features: 0.9697
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 250
Candidate feature: feat_275
R2 from regression of candidate feature against selected features: 0.9695
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 250
Candidate feature: feat_345
R2 from regression of ca

R2 from regression of candidate feature against selected features: 0.4147
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 258
Candidate feature: feat_138
R2 from regression of candidate feature against selected features: 0.8092
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 259
Candidate feature: feat_1189
R2 from regression of candidate feature against selected features: 0.9192
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 259
Candidate feature: feat_1187
R2 from regression of candidate

R2 from regression of candidate feature against selected features: 0.9687
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 262
Candidate feature: feat_1186
R2 from regression of candidate feature against selected features: 0.9666
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 262
Candidate feature: feat_835
R2 from regression of candidate feature against selected features: 0.8835
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 263
Candidate feature: feat_420
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.991
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 269
Candidate feature: feat_814
R2 from regression of candidate feature against selected features: 0.9087
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 269
Candidate feature: feat_942
R2 from regression of candidate feature against selected features: 0.951
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 269
Candidate feature: feat_1350
R2 from regression of cand

R2 from regression of candidate feature against selected features: 0.9335
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 276
Candidate feature: feat_1231
R2 from regression of candidate feature against selected features: 0.9335
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 276
Candidate feature: feat_244
R2 from regression of candidate feature against selected features: 0.9516
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 276
Candidate feature: feat_242
R2 from regression of ca

R2 from regression of candidate feature against selected features: 0.6698
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 280
Candidate feature: feat_623
R2 from regression of candidate feature against selected features: 0.8518
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 281
Candidate feature: feat_737
R2 from regression of candidate feature against selected features: 0.9474
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 281
Candidate feature: feat_1334
R2 from regression of candidate 

R2 from regression of candidate feature against selected features: 0.9537
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 283
Candidate feature: feat_693
R2 from regression of candidate feature against selected features: 0.9568
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 283
Candidate feature: feat_583
R2 from regression of candidate feature against selected features: 0.5824
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 284
Candidate feature: feat_1460
R2 from regression of candid

R2 from regression of candidate feature against selected features: 0.8734
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 292
Candidate feature: feat_326
R2 from regression of candidate feature against selected features: 0.9544
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 292
Candidate feature: feat_205
R2 from regression of candidate feature against selected features: 0.8638
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 293
Candidate feature: feat_1254
R2 from regression of candidate 

R2 from regression of candidate feature against selected features: 0.9752
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 306
Candidate feature: feat_1096
R2 from regression of candidate feature against selected features: 0.9219
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 306
Candidate feature: feat_830
R2 from regression of candidate feature against selected features: 0.8565
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 307
Candidate feature: feat_1261
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.9261
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 316
Candidate feature: feat_1485
R2 from regression of candidate feature against selected features: 0.9261
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 316
Candidate feature: feat_1477
R2 from regression of candidate feature against selected features: 0.9534
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 316
Candidate feature: feat_253
R2 from regression of c

R2 from regression of candidate feature against selected features: 0.9167
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 324
Candidate feature: feat_1330
R2 from regression of candidate feature against selected features: 0.9712
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 324
Candidate feature: feat_893
R2 from regression of candidate feature against selected features: 0.8786
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 325
Candidate feature: feat_1169
R2 from regression of candi

R2 from regression of candidate feature against selected features: 0.9577
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 333
Candidate feature: feat_76
R2 from regression of candidate feature against selected features: 0.9577
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 333
Candidate feature: feat_93
R2 from regression of candidate feature against selected features: 0.9144
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 333
Candidate feature: feat_260
R2 from regression of candi

R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 337
Candidate feature: feat_1203
R2 from regression of candidate feature against selected features: 0.8953
[1mCandidate feature selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 338
Candidate feature: feat_1168
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 338
Candidate feature: feat_1201
R2 from regression of candidate 

R2 from regression of candidate feature against selected features: 0.9156
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 344
Candidate feature: feat_1331
R2 from regression of candidate feature against selected features: 0.9279
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 344
Candidate feature: feat_94
R2 from regression of candidate feature against selected features: 0.9064
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 344
Candidate feature: feat_267
R2 from regression of can

R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_416
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_437
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_412
R2 from regression of candidate fe

R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_760
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_761
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_786
R2 from regression of candidate fe

R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_400
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_405
R2 from regression of candidate feature against selected features: 1.0
[1mCandidate feature not selected![0m
-------------------------------------------------------------------------


-------------------------------------------------------------------------
Number of selected features: 349
Candidate feature: feat_404
R2 from regression of candidate fe

In [19]:
# List of features with the highest variance:
selected_feat = screening_cont.selected_feat

# List of features with no variance:
zero_variance_feat = screening_cont.no_var_continuous_feat

print('updated_new_p: ' + str(screening_cont.k) + '.')
print('Number of selected features: ' + str(len(selected_feat)) + '.')
print('Number of features with zero variance: ' + str(len(zero_variance_feat)) + '.')

updated_new_p: 351.
Number of selected features: 351.
Number of features with zero variance: 272.


In [20]:
# Variance by numerical variable:
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0


<a id='corr_based_selection'></a>

### Features selection based on correlation

In [21]:
cont_vars = [c for c in df_train.columns if c not in drop_vars]

# Declaring object for screening of feature:
screening_cont = CorrScreeningNumerical(features = cont_vars, na_features = [], stat='variance',
                                        corr_threshold=0.8,
                                        winsorize=False, winsorize_param=0.025,
                                        drop_outliers=False, drop_outliers_param=0.01)

# Screening features:
screening_cont.select_feat(df_train.drop(drop_vars, axis=1))

# Dataframe with variance by feature:
variation_num_feat = screening_cont.var_continuous_feat

print('Number of selected features: ' + str(len(screening_cont.selected_feat)) + '.')
print('\033[1mShape of variation_num_feat:\033[0m ' + str(variation_num_feat.shape) + '.')
variation_num_feat.head(10)

  'variation': [np.nanvar(data[f]) for f in self.features]})


Number of selected features: 516.
[1mShape of variation_num_feat:[0m (1378, 3).


variance,feature,variation,missing
636,feat_637,1281.0,0
591,feat_592,1281.0,0
909,feat_910,1281.0,0
983,feat_984,1281.0,0
1123,feat_1124,1281.0,0
1121,feat_1122,1281.0,0
661,feat_662,1281.0,0
1533,feat_1534,1273.454465,0
1067,feat_1068,1027.765432,0
1065,feat_1066,898.180556,0
