## Duplicated features

Often datasets contain one or more features that show the same values across all the observations. This means that both features are in essence identical. In addition, it is not unusual to introduce duplicated features after performing **one hot encoding** of categorical variables, particularly when using several highly cardinal variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using the Santander Customer Satisfaction dataset from Kaggle. 

There is no function in python and pandas to find duplicated columns. The code, is computationally costly, so your computer might run out of memory.

**Note**
Finding duplicated features is a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to perform it.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

## Removing duplicate features

In [None]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
data = pd.read_csv('santander.csv', nrows=15000)
data.shape

(15000, 371)

In [None]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [None]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((10500, 370), (4500, 370))

Pandas has the function 'duplicated' that evaluates if the dataframe contains duplicated rows. We can use this function to check for duplicated columns if we transpose the dataframe first. By transposing the dataframe, we obtain a new dataframe where the columns are now rows, and with the 'duplicated' method we can go ahead an identify those that are duplicated. 

Once we identify them, we can remove the duplicated rows. See below.

### Code Snippet for small datasets

Using pandas transpose is computationally expensive, so the computer may run out of memory. That is why we can only use this code block on small datasets. How small will depend of your computer specifications.

In [None]:
# transpose the dataframe, so that the columns are the rows of the new dataframe
data_t = X_train.T
data_t.head()

Unnamed: 0,10439,9236,818,11504,11722,5276,6863,13463,10228,11462,...,4373,7891,9225,14019,4859,13123,3264,9845,10799,2732
ID,20941.0,18583.0,1623.0,23060.0,23512.0,10564.0,13779.0,26969.0,20502.0,22981.0,...,8783.0,15901.0,18564.0,28142.0,9723.0,26306.0,6557.0,19796.0,21653.0,5441.0
var3,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
var15,23.0,39.0,22.0,23.0,37.0,23.0,27.0,43.0,23.0,27.0,...,23.0,24.0,33.0,45.0,24.0,37.0,24.0,38.0,28.0,23.0
imp_ent_var16_ult1,0.0,0.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imp_op_var39_comer_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# check if there are duplicated rows (the columns of the original dataframe)
# this is a computionally expensive operation, so it might take a while
# sum indicates how many rows are duplicated

data_t.duplicated().sum()

105

We can see that 105 columns / variables are duplicated. This means that 105 variables are identical to at least another variable within a dataset.

In [None]:
# visualise the duplicated rows (the columns of the original dataframe)
data_t[data_t.duplicated()]

Unnamed: 0,10439,9236,818,11504,11722,5276,6863,13463,10228,11462,...,4373,7891,9225,14019,4859,13123,3264,9845,10799,2732
ind_var2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var13_medio_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var13_medio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var18_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
ind_var25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
ind_var27_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var28_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# we can capture the duplicated features, by capturing the
# index values of the transposed dataframe like this:
duplicated_features = data_t[data_t.duplicated()].index.values
duplicated_features

array(['ind_var2', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var18_0',
       'ind_var18', 'ind_var26', 'ind_var25', 'ind_var27_0', 'ind_var28_0',
       'ind_var28', 'ind_var27', 'ind_var29_0', 'ind_var29', 'ind_var32',
       'ind_var34_0', 'ind_var34', 'ind_var37', 'ind_var40_0', 'ind_var40',
       'ind_var41', 'ind_var39', 'ind_var44', 'ind_var46_0', 'ind_var46',
       'num_var13_medio_0', 'num_var13_medio', 'num_var18_0', 'num_var18',
       'num_var26', 'num_var25', 'num_var27_0', 'num_var28_0', 'num_var28',
       'num_var27', 'num_var29_0', 'num_var29', 'num_var32', 'num_var34_0',
       'num_var34', 'num_var37', 'num_var40_0', 'num_var40', 'num_var41',
       'num_var39', 'num_var46_0', 'num_var46', 'saldo_var13_medio',
       'saldo_var18', 'saldo_var28', 'saldo_var27', 'saldo_var29',
       'saldo_var34', 'saldo_var40', 'saldo_var41', 'saldo_var46',
       'delta_imp_amort_var18_1y3', 'delta_imp_amort_var34_1y3',
       'delta_imp_reemb_var17_1y3', 'delta_imp_reemb_var3

In [None]:
# alternatively, we can remove the duplicated rows,
# transpose the dataframe back to the variables as columns
# keep first indicates that we keep the first of a set of
# duplicated variables

data_unique = data_t.drop_duplicates(keep='first').T
data_unique.shape

(10500, 265)

We can see immediately how removing duplicated features helps reduce the feature space. We passed from 370 to 265 non-duplicated features.

In [None]:
# to find those columns in the original dataframe that were removed:

duplicated_features = [col for col in data.columns if col not in data_unique.columns]
duplicated_features 

['ind_var2',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var18_0',
 'ind_var18',
 'ind_var26',
 'ind_var25',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var29_0',
 'ind_var29',
 'ind_var32',
 'ind_var34_0',
 'ind_var34',
 'ind_var37',
 'ind_var40_0',
 'ind_var40',
 'ind_var41',
 'ind_var39',
 'ind_var44',
 'ind_var46_0',
 'ind_var46',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var18_0',
 'num_var18',
 'num_var26',
 'num_var25',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var29_0',
 'num_var29',
 'num_var32',
 'num_var34_0',
 'num_var34',
 'num_var37',
 'num_var40_0',
 'num_var40',
 'num_var41',
 'num_var39',
 'num_var46_0',
 'num_var46',
 'saldo_var13_medio',
 'saldo_var18',
 'saldo_var28',
 'saldo_var27',
 'saldo_var29',
 'saldo_var34',
 'saldo_var40',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var18_1y3',
 'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var17_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_imp_trasp_var17_out_1y3