# Correlation 

**Correlation** 
* A meausre of the statistical relationship of 2 or more variable 
* Correlation predictor variables provide redundant information 
* Variables should be correlated with the target but not among themselves 


**Correlation and ML** 

* Correlation features do not necessarily affect model accuracy per se
* High **dimensionality** does 
* If 2 features are highly correlated, the second one will add little information: removing it helps reduce dimension 
* Correlation affects model **interpretability**: linear models 
* Different classifier show different sensitivityt to correlation


**Types** 

* Pearson's correlation coefficient (linear relationship)
* Spearman's rank correlation coefficient 
* Kendall rank correlation coefficient


**Pearson's correlation** 

$$
  \frac{S_{xy}}{S_xS_y}=\frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum{(x_i-\bar{x})^2}\sum{(y_i-\bar{y})^2}}}
  $$
 
 
**Correlated variable removal methods**
* Method1: Brute force method
  * Scan features as they appear. If a feature is correlated, remove the correlated feature
  * Pro: Fast 
  * Cons: We may remove the feature more important than the other if it appears later
* Method 2: 
  * Steps
    * Identify groups of correlated features 
    * Select the most predictive feature
      * Build a small machine learning model using the features in the group 
      * Other criteria, e.g. variance, number of missing values 
    * Discard the rest 

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection

In [2]:
data = pd.read_csv('../datasets/dataset_2.csv')

In [3]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


In [4]:
data.shape

(50000, 109)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1),
                                                    data['target'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((35000, 108), (15000, 108))

# Remove correlated

## Brute force approach

In [6]:
sel = DropCorrelatedFeatures(
    threshold=0.8,
    method='pearson',
    missing_values='ignore')

sel.fit(X_train)

DropCorrelatedFeatures(variables=['var_1', 'var_2', 'var_3', 'var_4', 'var_5',
                                  'var_6', 'var_7', 'var_8', 'var_9', 'var_10',
                                  'var_11', 'var_12', 'var_13', 'var_14',
                                  'var_15', 'var_16', 'var_17', 'var_18',
                                  'var_19', 'var_20', 'var_21', 'var_22',
                                  'var_23', 'var_24', 'var_25', 'var_26',
                                  'var_27', 'var_28', 'var_29', 'var_30', ...])

In [7]:
sel.correlated_feature_sets_

[{'var_3', 'var_80'},
 {'var_28', 'var_5', 'var_75'},
 {'var_11', 'var_33'},
 {'var_13', 'var_17'},
 {'var_15', 'var_57'},
 {'var_18', 'var_43'},
 {'var_19', 'var_29'},
 {'var_21', 'var_70', 'var_88'},
 {'var_22', 'var_24', 'var_32', 'var_39', 'var_42', 'var_76'},
 {'var_102', 'var_23'},
 {'var_26', 'var_59'},
 {'var_108', 'var_30'},
 {'var_35', 'var_87'},
 {'var_101', 'var_105', 'var_40', 'var_74', 'var_85'},
 {'var_46', 'var_94'},
 {'var_50', 'var_72'},
 {'var_52', 'var_66'},
 {'var_109', 'var_56'},
 {'var_104', 'var_60'},
 {'var_63', 'var_64', 'var_84', 'var_97'},
 {'var_106', 'var_77'},
 {'var_90', 'var_95'},
 {'var_100', 'var_98'}]

In [8]:
# only the first (one that appears first in the dataframe)
# will be taken from each group. the rest will be dropped
len(sel.features_to_drop_)

34

In [9]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

In [10]:
X_train.shape, X_test.shape

((35000, 74), (15000, 74))

In [11]:
X_train.columns

Index(['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7', 'var_8',
       'var_9', 'var_10', 'var_11', 'var_12', 'var_13', 'var_14', 'var_15',
       'var_16', 'var_18', 'var_19', 'var_20', 'var_21', 'var_22', 'var_23',
       'var_25', 'var_26', 'var_27', 'var_30', 'var_31', 'var_34', 'var_35',
       'var_36', 'var_37', 'var_38', 'var_40', 'var_41', 'var_44', 'var_45',
       'var_46', 'var_47', 'var_48', 'var_49', 'var_50', 'var_51', 'var_52',
       'var_53', 'var_54', 'var_55', 'var_56', 'var_58', 'var_60', 'var_62',
       'var_63', 'var_65', 'var_67', 'var_68', 'var_69', 'var_71', 'var_73',
       'var_77', 'var_78', 'var_79', 'var_81', 'var_82', 'var_83', 'var_86',
       'var_89', 'var_90', 'var_91', 'var_92', 'var_93', 'var_96', 'var_98',
       'var_99', 'var_103', 'var_107'],
      dtype='object')

# SmartCorrelationSelection

## Model Performance
select features from each group based on the performance of a random forest

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1),
                                                    data['target'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((35000, 108), (15000, 108))

In [13]:
rf = RandomForestClassifier(
    n_estimators=10,
    random_state=20, 
    n_jobs=4
)

# correlation selector 
sel = SmartCorrelatedSelection(
    variables=None, # if none, selector examines all numerical variables
    method='pearson',
    threshold=0.8,
    missing_values='raise',
    selection_method='model_performance',
    estimator=rf,
    scoring='roc_auc',
    cv=3
)

# find the most performant features 
# the model uses only one feature at a time, finds the most performant feature 
# from each correlation group and move on to the next group
sel.fit(X_train, y_train)

SmartCorrelatedSelection(estimator=RandomForestClassifier(n_estimators=10,
                                                          n_jobs=4,
                                                          random_state=20),
                         missing_values='raise',
                         selection_method='model_performance',
                         variables=['var_1', 'var_2', 'var_3', 'var_4', 'var_5',
                                    'var_6', 'var_7', 'var_8', 'var_9',
                                    'var_10', 'var_11', 'var_12', 'var_13',
                                    'var_14', 'var_15', 'var_16', 'var_17',
                                    'var_18', 'var_19', 'var_20', 'var_21',
                                    'var_22', 'var_23', 'var_24', 'var_25',
                                    'var_26', 'var_27', 'var_28', 'var_29',
                                    'var_30', ...])

In [14]:
sel.correlated_feature_sets_

[{'var_3', 'var_80'},
 {'var_28', 'var_5', 'var_75'},
 {'var_11', 'var_33'},
 {'var_13', 'var_17'},
 {'var_15', 'var_57'},
 {'var_18', 'var_43'},
 {'var_19', 'var_29'},
 {'var_21', 'var_70', 'var_88'},
 {'var_22', 'var_24', 'var_32', 'var_39', 'var_42', 'var_76'},
 {'var_102', 'var_23'},
 {'var_26', 'var_59'},
 {'var_108', 'var_30'},
 {'var_35', 'var_87'},
 {'var_101', 'var_105', 'var_40', 'var_74', 'var_85'},
 {'var_46', 'var_94'},
 {'var_50', 'var_72'},
 {'var_52', 'var_66'},
 {'var_109', 'var_56'},
 {'var_104', 'var_60'},
 {'var_63', 'var_64', 'var_84', 'var_97'},
 {'var_106', 'var_77'},
 {'var_90', 'var_95'},
 {'var_100', 'var_98'}]

## Variance
Alternatively, we can select the feature with the highest variance from each correlated gruop. instead of fitting a model for each feature.

In [15]:
# correlation selector 
sel = SmartCorrelatedSelection(
    variables=None, # if none, selector examines all numerical variables
    method='pearson',
    threshold=0.8,
    missing_values='raise',
    selection_method='variance',  # this
    estimator=None,  # this 
    scoring='roc_auc',
    cv=3
)

sel.fit(X_train, y_train)

SmartCorrelatedSelection(missing_values='raise', selection_method='variance',
                         variables=['var_1', 'var_2', 'var_3', 'var_4', 'var_5',
                                    'var_6', 'var_7', 'var_8', 'var_9',
                                    'var_10', 'var_11', 'var_12', 'var_13',
                                    'var_14', 'var_15', 'var_16', 'var_17',
                                    'var_18', 'var_19', 'var_20', 'var_21',
                                    'var_22', 'var_23', 'var_24', 'var_25',
                                    'var_26', 'var_27', 'var_28', 'var_29',
                                    'var_30', ...])

In [16]:
sel.correlated_feature_sets_

[{'var_3', 'var_80'},
 {'var_28', 'var_5', 'var_75'},
 {'var_11', 'var_33'},
 {'var_13', 'var_17'},
 {'var_15', 'var_57'},
 {'var_18', 'var_43'},
 {'var_19', 'var_29'},
 {'var_21', 'var_70', 'var_88'},
 {'var_22', 'var_24', 'var_32', 'var_39', 'var_42', 'var_76'},
 {'var_102', 'var_23'},
 {'var_26', 'var_59'},
 {'var_108', 'var_30'},
 {'var_35', 'var_87'},
 {'var_101', 'var_105', 'var_40', 'var_74', 'var_85'},
 {'var_46', 'var_94'},
 {'var_50', 'var_72'},
 {'var_52', 'var_66'},
 {'var_109', 'var_56'},
 {'var_104', 'var_60'},
 {'var_63', 'var_64', 'var_84', 'var_97'},
 {'var_106', 'var_77'},
 {'var_90', 'var_95'},
 {'var_100', 'var_98'}]

In [17]:
group = sel.correlated_feature_sets_[1]

X_train[group].std()  # var_75 is selected

var_5     0.875302
var_28    1.024728
var_75    3.539938
dtype: float64