## Duplicated features

This method works for both **numerical and categorical** variables.

In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

**load dataset!**

In [24]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

**Check the presence of missing data! No missing data!**

In [25]:
[col for col in data.columns if data[col].isnull().sum() > 0]

[]

**Select the features from training set is to avoid overfit!**

In [26]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant and quasi-constant

In [27]:
quasi_constant_feat = []
for feature in X_train.columns:
    predominant = (X_train[feature].value_counts() / float(
        len(X_train))).sort_values(ascending=False).values[0]
    if predominant > 0.998:
        quasi_constant_feat.append(feature)
len(quasi_constant_feat)

142

**Drop them from the train and test sets!**

In [28]:
X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

**Create an empty dictionary and an empty list to collect groups and seperately!**

In [29]:
duplicated_feat_pairs = {}
_duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # To understand where the loop is!
        print(i)
    feat_1 = X_train.columns[i]
    if feat_1 not in _duplicated_feat:
        duplicated_feat_pairs[feat_1] = []
        for feat_2 in X_train.columns[i + 1:]:
            if X_train[feat_1].equals(X_train[feat_2]):
                duplicated_feat_pairs[feat_1].append(feat_2)
                _duplicated_feat.append(feat_2)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150


**let's explore our list of duplicated features!**

In [30]:
len(_duplicated_feat)

6

**These are the duplicate ones!**

In [31]:
_duplicated_feat

['var_148', 'var_199', 'var_296', 'var_250', 'var_232', 'var_269']

**Explore the dictionary we created!**

In [32]:
duplicated_feat_pairs

{'var_4': [],
 'var_5': [],
 'var_8': [],
 'var_13': [],
 'var_15': [],
 'var_17': [],
 'var_18': [],
 'var_19': [],
 'var_21': [],
 'var_22': [],
 'var_25': [],
 'var_26': [],
 'var_27': [],
 'var_29': [],
 'var_30': [],
 'var_31': [],
 'var_35': [],
 'var_37': ['var_148'],
 'var_38': [],
 'var_41': [],
 'var_46': [],
 'var_47': [],
 'var_49': [],
 'var_50': [],
 'var_51': [],
 'var_52': [],
 'var_54': [],
 'var_55': [],
 'var_57': [],
 'var_58': [],
 'var_62': [],
 'var_63': [],
 'var_64': [],
 'var_68': [],
 'var_70': [],
 'var_74': [],
 'var_75': [],
 'var_76': [],
 'var_79': [],
 'var_82': [],
 'var_83': [],
 'var_84': ['var_199'],
 'var_85': [],
 'var_86': [],
 'var_88': [],
 'var_91': [],
 'var_93': [],
 'var_94': [],
 'var_96': [],
 'var_100': [],
 'var_101': [],
 'var_103': [],
 'var_105': [],
 'var_107': [],
 'var_108': [],
 'var_109': [],
 'var_110': [],
 'var_114': [],
 'var_117': [],
 'var_118': [],
 'var_119': [],
 'var_121': [],
 'var_123': [],
 'var_128': [],
 'var_131'

We see that for every feature, if it had duplicates, we have entries in the list, otherwise, we have empty lists. Let's explore those features with duplicates now:

**Print the number of keys in our dictionary!**

In [33]:
print(len(duplicated_feat_pairs.keys()))

152


**Print the features with its duplicates**

In [34]:
for feat in duplicated_feat_pairs.keys():
    if len(duplicated_feat_pairs[feat]) > 0:
        print(feat, duplicated_feat_pairs[feat])
        print()

var_37 ['var_148']

var_84 ['var_199']

var_143 ['var_296']

var_177 ['var_250']

var_226 ['var_232']

var_229 ['var_269']



**Check var_37 and var_148**

In [35]:
X_train[['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
17967,0,0
32391,0,0
9341,0,0
7929,0,0
46544,0,0
4149,0,0
33426,0,0
3002,0,0
6974,0,0
16864,0,0


In [36]:
X_train['var_37'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [37]:
X_train['var_148'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [38]:
X_train['var_37'].unique() == X_train['var_148'].unique()

array([ True,  True,  True,  True,  True,  True,  True,  True])

**Explore parts of the dataframe where the values in**

In [39]:
X_train[X_train['var_37'] != 0][['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
37493,3,3
20251,6,6
4264,6,6
48480,3,3
31607,3,3
41172,3,3
13502,3,3
7759,3,3
46118,3,3
2638,3,3


As we see, these features are indeed identical :)

**Remove the duplicates!**

In [40]:
X_train = X_train[duplicated_feat_pairs.keys()]
X_test = X_test[duplicated_feat_pairs.keys()]
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

We removed 6 features, and finally from **300** features, **152** remained!