# Removing low variance

One way to improve the selection of features, is to remove those that contain little information. Consider a Boolean variable, that is almost always 0 or almost always 1. We can detect those by computing the variance and by removing those below a set threshold.

In [1]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score

Load the houses dataset. Because we want to focus on features, we simply remove records with missing values. Note that you should normall do this with more care! We also convert categories to dummy variables and create a train and test set.

In [15]:
df = pd.read_csv('/data/datasets/bank-additional-full.csv', delimiter=';')

In [16]:
df = df.drop(columns=df.columns[df.isnull().any()])

In [17]:
df.y = (df.y == 'yes').astype(int)

Although the original dataset does not contain Boolean variables, when we construct Dummy variables from the categorical values, we get a lot of Boolean variables.

In [18]:
df = pd.get_dummies(df, columns=df.select_dtypes(include=['object']).columns, drop_first=True)

In [19]:
train_X, valid_X, train_y, valid_y = train_test_split(df[df.columns.drop('y')], df.y, test_size=0.2)

We first compute a baseline, simply try a model with all features, which obtains an rmse of over 50.000.

In [20]:
model = LogisticRegression(penalty='none')
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
f1_score(valid_y, pred_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.4877384196185286

# Variance of Boolean variables

Suppose that we wish to exclude Boolean variables who have at least 80% 0's or 1's. Consider that the distribution follows a Bernoulli random variable the variance can be computed as $\sigma^2 = p \cdot (1-p)$. In other words, when the variance is less than 0.8 * (1-0.8) the likelihood of one class is more than 80%.

In [2]:
1e24 + 1e4

1e+24

In [21]:
sel = VarianceThreshold(threshold=(.9 * (1 - .9)))
sel.fit(train_X)
train_X = sel.transform(train_X)
valid_X = sel.transform(valid_X)

These are the remaining columns

In [22]:
df.drop(columns='y').columns[sel.get_support()]

Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_blue-collar', 'job_technician', 'marital_married',
       'marital_single', 'education_basic.9y', 'education_high.school',
       'education_professional.course', 'education_university.degree',
       'default_unknown', 'housing_yes', 'loan_yes', 'contact_telephone',
       'month_aug', 'month_jul', 'month_jun', 'month_may', 'day_of_week_mon',
       'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed',
       'poutcome_nonexistent'],
      dtype='object')

In [23]:
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
valid_X = scaler.transform(valid_X)

And when we run the validation, we see that removing low variance features often improves results. Of course, the threshold is arbitrary and should be tuned.

In [24]:
model.fit(train_X, train_y)
f1_score(valid_y, model.predict(valid_X))

0.4881995954146999