## Duplicated features with Feature-engine

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropDuplicateFeatures, DropConstantFeatures

**Load dataset**

In [3]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

**Separate dataset into train and test!**

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant and quasi-constant

In [6]:
sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')
sel.fit(X_train)

DropConstantFeatures(tol=0.998)

In [7]:
X_train = sel.transform(X_train)  # remove them!
X_test = sel.transform(X_test)
X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

**Set up the selector and find the duplicate features!**

In [8]:
sel = DropDuplicateFeatures(variables=None, missing_values='raise')
sel.fit(X_train)

DropDuplicateFeatures(missing_values='raise')

**The pairs of duplicated features!**

In [9]:
sel.duplicated_feature_sets_

[{'var_148', 'var_37'},
 {'var_199', 'var_84'},
 {'var_143', 'var_296'},
 {'var_177', 'var_250'},
 {'var_226', 'var_232'},
 {'var_229', 'var_269'}]

**The features that will be dropped**

In [11]:
sel.features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}

**Explore our list of duplicated features!**

In [13]:
len(sel.features_to_drop_)

6

**Remove the duplicated features!**

In [10]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## Stack Feature selection in a Pipeline

We can perform both steps together by setting up the transformers within a pipeline.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape         # separated dataset

((35000, 300), (15000, 300))

In [18]:
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
])
pipe.fit(X_train)

Pipeline(steps=[('constant', DropConstantFeatures(tol=0.998)),
                ('duplicated', DropDuplicateFeatures())])

**Remove features!**

In [19]:
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

**Navigate the pipeline transformers!**

In [20]:
len(pipe.named_steps['constant'].features_to_drop_)

142

In [21]:
pipe.named_steps['duplicated'].features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}