## Feature Engineer

This section covers some libraries for feature engineering. 

### Drop Correlated Features

In [None]:
!pip install feature_engine 

If you want to remove the correlated variables from a dataframe, use `feature_engine.DropCorrelatedFeatures`. 

In [22]:
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures

# make dataframe with some correlated variables
X, y = make_classification(
        n_samples=1000,
        n_features=6,
        n_redundant=3,
        n_clusters_per_class=1,
        class_sep=2,
        random_state=0,
    )

# trabsform arrays into pandas df and series
colnames = ["var_" + str(i) for i in range(6)]
X = pd.DataFrame(X, columns=colnames)

<IPython.core.display.Javascript object>

In [23]:
X.columns

Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5'], dtype='object')

<IPython.core.display.Javascript object>

In [28]:
X[["var_0", "var_1", "var_2"]].corr()

Unnamed: 0,var_0,var_1,var_2
var_0,1.0,0.938936,0.874845
var_1,0.938936,1.0,0.654745
var_2,0.874845,0.654745,1.0


<IPython.core.display.Javascript object>

Drop the variables with a correlation above 0.8. 

In [25]:
tr = DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.8)

Xt = tr.fit_transform(X)

tr.correlated_feature_sets_

[{'var_0', 'var_1', 'var_2'}]

<IPython.core.display.Javascript object>

In [26]:
Xt.columns

Index(['var_0', 'var_3', 'var_4', 'var_5'], dtype='object')

<IPython.core.display.Javascript object>