Feature selection reduces the dimensionalit of data for the following reasons:
- Reduces overfitting by removing noise introduced by some of the features
- Reduces training time, which allows you to experiment more with different models and hyperparameters
- Reduces data acquisition requirements
- Improves comprehensibility of the model because a smaller set of features is more comprehendible to humans. This enables us to focus on the main sources of predictability

Feature selection methods generally fall into 2 categories. Filter Methods and Wrapper Methods. 

- Filter Methods: Apply a statistical measure and assign a score to each feature one at a time. Pearson's X2 and ANOVA F-Value based feature selection. 

- Wrapper Methods: Use a subset of features. Based on the results drawn from the previous model trained on that subset of features, they are either added or removed from the subset. The problem is essentially reduced to a search problem. Greedy algos (https://en.wikipedia.org/wiki/Greedy_algorithm) are the most desirable in multivariate feature selection scenarios because the wrapper methods are usually computationally very expensive and greedy algos don't necessarily provide the optimal solution, which is a good thing because it makes them less prone to overfitting. Forward Selection, Backward Elimination, Recursive Feature Elimination. 

In [12]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import RFE
from sklearn.base import clone
import itertools

In [4]:
datasource = "datasets/winequality-red.csv"
print(os.path.exists(datasource))

True


In [6]:
df = pd.read_csv(datasource).sample(frac = 1).reset_index(drop = True)
del df["Unnamed: 0"]
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.3,0.69,0.32,2.2,0.069,35.0,104.0,0.99632,3.33,0.51,9.5,5
1,7.2,0.57,0.05,2.3,0.081,16.0,36.0,0.99564,3.38,0.6,10.3,6
2,13.2,0.38,0.55,2.7,0.081,5.0,16.0,1.0006,2.98,0.54,9.4,5
3,9.0,0.53,0.49,1.9,0.171,6.0,25.0,0.9975,3.27,0.61,9.4,6
4,7.7,0.965,0.1,2.1,0.112,11.0,22.0,0.9963,3.26,0.5,9.5,5


In [7]:
X = np.array(df.iloc[:, :-1])
y = np.array(df["quality"])

## Feature selection solution space

From algorithm analysus' point of view, a solution for feature selection problems can be represented as a boolean vector, each component indicating whether the corresponding feature has been selected.

In [10]:
selected = np.array([False, True, True, False, False, True, True, False, False, False, True])
print(selected)

[False  True  True False False  True  True False False False  True]


scikit-learn calls the corresponding indices to feature columns selected "support", which can be obtained using np.flatnonzero(). 

https://docs.scipy.org/doc/numpy/reference/generated/numpy.flatnonzero.html

Return indices that are non-zero in the flattened version of a.


In [11]:
support = np.flatnonzero(selected)
print(support)

[ 1  2  5  6 10]


Thus, a naive approach that exhaustively search all subsets of features would have to verify 2^p solutions. This is very inefficient. However, we will run an exhaustive search for all solutions that provide 5 features to establish a baseline. This limits our time complexity. 

In [13]:
def search_combinations(estimator, X, y, k = 5):
    # fit and score model based on some subset of features
    score = lambda X_features: clone(estimator).fit(X_features, y).score(X_features, y)
    
    # enumerate all combinations of 5 features
    for subset in itertools.combinations(range(X.shape[1]), 5):
        yield score(X[:, subset]), subset
        
sorted(search_combinations(LinearRegression(), X, y), reverse = True)[:3]

[(0.35149423850184036, (1, 4, 6, 9, 10)),
 (0.34831403567393349, (1, 4, 8, 9, 10)),
 (0.34695788821029638, (1, 6, 8, 9, 10))]