## Quasi-constant features

**Quasi-constant features** show the **same value** for the **great majority of the observations**. It has usually **little information**. But there can be exceptions. So you should be careful when removing these type of features. the **VarianceThreshold** and **manual method**!

In [26]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

**Load dataset!**

In [27]:
data = pd.read_csv('dataset_1.csv')
data.shape

(50000, 301)

**Select the features from the training set to avoid overfit!**

In [28]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

In [29]:
constant_features = [feat for feat in X_train.columns if X_train[feat].std() == 0]
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Remove quasi-constant features

### **1. VarianceThreshold Method**

In [30]:
sel = VarianceThreshold(threshold=0.01)  
sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.01)

**Get_support indicates the retained features! Then sum them!**

In [31]:
sum(sel.get_support())

215

**Print the number of quasi-constant features!**

In [32]:
quasi_constant = X_train.columns[~sel.get_support()]
len(quasi_constant)

51

**Print the variable names!**

In [33]:
quasi_constant

Index(['var_1', 'var_2', 'var_7', 'var_9', 'var_10', 'var_19', 'var_28',
       'var_36', 'var_43', 'var_45', 'var_53', 'var_56', 'var_59', 'var_66',
       'var_67', 'var_69', 'var_71', 'var_104', 'var_106', 'var_116',
       'var_133', 'var_137', 'var_141', 'var_146', 'var_177', 'var_187',
       'var_189', 'var_194', 'var_197', 'var_198', 'var_202', 'var_218',
       'var_219', 'var_223', 'var_233', 'var_234', 'var_235', 'var_245',
       'var_247', 'var_249', 'var_250', 'var_251', 'var_256', 'var_260',
       'var_267', 'var_274', 'var_282', 'var_285', 'var_287', 'var_289',
       'var_298'],
      dtype='object')

**Percentage of observations showing each of the different values of the variable!**

In [34]:
X_train['var_1'].value_counts() / float(len(X_train))

0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

**> 99% of the observations show one value! Then, they are fairly constant.**

In [35]:
X_train['var_2'].value_counts() / float(len(X_train))

0    0.999971
1    0.000029
Name: var_2, dtype: float64

**Explore the rest of the quasi-constant variables.**

In [36]:
feat_names = X_train.columns[sel.get_support()]

**Remove the quasi-constant features**

In [37]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
X_train.shape, X_test.shape

((35000, 215), (15000, 215))

**Trasnform the array into a dataframe!**

In [38]:
X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)
X_test.head()

Unnamed: 0,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,var_15,...,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
3,0.0,2.76,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **Manual Method!**

It can be used for both **numerical and categorical** variables.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)
constant_features = [feat for feat in X_train.columns if X_train[feat].std() == 0]
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 266), (15000, 266))

**Create an empty list, then iterate over every feature!**

In [40]:
quasi_constant_feat = []
for feature in X_train.columns:
    predominant = (X_train[feature].value_counts() / float(
        len(X_train))).sort_values(ascending=False).values[0]
    if predominant > 0.998:
        quasi_constant_feat.append(feature)
len(quasi_constant_feat)

108

**Print the feature names!**

In [41]:
quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

**Select one feature from the list**

In [42]:
quasi_constant_feat[2]

'var_3'

In [43]:
X_train['var_3'].value_counts() / float(len(X_train))

0.0000         0.999629
207901.3365    0.000029
15028.0560     0.000029
25905.4866     0.000029
35685.9459     0.000029
3583.3941      0.000029
52105.7901     0.000029
86718.0000     0.000029
861.0900       0.000029
2641.0164      0.000029
5209.9500      0.000029
10281.6000     0.000029
12542.3100     0.000029
27.3000        0.000029
Name: var_3, dtype: float64

**The feature shows 0 for more than 99.9% of the observations*. But, it also shows a **few different values for a very tiny proportion of the observations**. This fact, will **increase the feature variance**, that is why, this feature is **not captured by the VarianceThreshold** in our previous cell. Yet, we can see that **it is quasi-constant**.
Keep in mind that **the thresholds are arbitrary and decided by the user**.

**Drop the quasi-constant features!**

In [44]:
X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 158), (15000, 158))

**We see, how, we removed almost half of the original variables!!! We passed from 300 variables to 158.**