#### Duplicated features

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

In addition, we may often introduce duplicated features when performing one hot encoding of categorical variables, particularly if our datasets have many and /or highly cardinal categorical variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using a dataset that I created for this course.

There is no function in Pandas to find duplicated columns. So we need to write a bit code to do so.

Note Finding duplicated features can be a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to do it.

Following method on duplicated features works for both numerical and categorical variables.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv('../datasets/dataset_1.csv')
print(data.shape)
data.head()

(50000, 301)


Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
1,0,0,0.0,3.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.0,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.1,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0


In [4]:
# check the presence of missing data.
# (there are no missing data in this dataset)

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

*Note*

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [5]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0
)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

**Apply constant and quasi-constant**

In [6]:
# remove constant and quasi-constant features first:
# we can remove the 2 types of features together with this code
# (we used it in our previous notebook)

# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(


142

In [7]:
# we can then drop these columns from the train and test sets:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

**Remove duplicated features**

To identify duplicated variables we need to iterate through all features of our dataset, and for each and every feature, try and find others that are identical, or duplicates.

We will create a dictionary of {variable: duplicated variables} pairs to identify them more easily throughout the demo. Keep in mind that in a dataset, there could be 2 or more features that are identical to each other.

In [8]:
# check for duplicated features in the training set:

# create an empty dictionary, where we will store 
# the groups of duplicates
duplicated_feat_pairs = {}

# create an empty list to collect features
# that were found to be duplicated
_duplicated_feat = []


# iterate over every feature in our dataset:
for i in range(0, len(X_train.columns)):
    
    # this bit helps me understand where the loop is at:
    if i % 10 == 0:  
        print(i)
    
    # choose 1 feature:
    feat_1 = X_train.columns[i]
    
    # check if this feature has already been identified
    # as a duplicate of another one. If it was, it should be stored in
    # our _duplicated_feat list.
    
    # If this feature was already identified as a duplicate, we skip it, if
    # it has not yet been identified as a duplicate, then we proceed:
    if feat_1 not in _duplicated_feat:
    
        # create an empty list as an entry for this feature in the dictionary:
        duplicated_feat_pairs[feat_1] = []

        # now, iterate over the remaining features of the dataset:
        for feat_2 in X_train.columns[i + 1:]:

            # check if this second feature is identical to the first one
            if X_train[feat_1].equals(X_train[feat_2]):

                # if it is identical, append it to the list in the dictionary
                duplicated_feat_pairs[feat_1].append(feat_2)
                
                # and append it to our monitor list for duplicated variables
                _duplicated_feat.append(feat_2)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150


In [10]:
_duplicated_feat

['var_148', 'var_199', 'var_296', 'var_250', 'var_232', 'var_269']

In [11]:
# let's explore the number of keys in our dictionary

# we see it is 152, because 6 of the 158 were duplicates,
# so they were not included as keys

print(len(duplicated_feat_pairs.keys()))

152
