Feature selection reduces the dimensionalit of data for the following reasons:
- Reduces overfitting by removing noise introduced by some of the features
- Reduces training time, which allows you to experiment more with different models and hyperparameters
- Reduces data acquisition requirements
- Improves comprehensibility of the model because a smaller set of features is more comprehendible to humans. This enables us to focus on the main sources of predictability

Feature selection methods generally fall into 2 categories. Filter Methods and Wrapper Methods. 

- Filter Methods: Apply a statistical measure and assign a score to each feature one at a time. Pearson's X2 and ANOVA F-Value based feature selection. 

- Wrapper Methods: Use a subset of features. Based on the results drawn from the previous model trained on that subset of features, they are either added or removed from the subset. The problem is essentially reduced to a search problem. Greedy algos (https://en.wikipedia.org/wiki/Greedy_algorithm) are the most desirable in multivariate feature selection scenarios because the wrapper methods are usually computationally very expensive and greedy algos don't necessarily provide the optimal solution, which is a good thing because it makes them less prone to overfitting. Forward Selection, Backward Elimination, Recursive Feature Elimination. 

In [2]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import RFE
from sklearn.base import clone
import itertools

In [3]:
datasource = "datasets/winequality-red.csv"
print(os.path.exists(datasource))

True


In [4]:
df = pd.read_csv(datasource).sample(frac = 1).reset_index(drop = True)
del df["Unnamed: 0"]
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.6,0.39,0.31,2.3,0.082,23.0,71.0,0.9982,3.52,0.65,9.7,5
1,7.6,0.685,0.23,2.3,0.111,20.0,84.0,0.9964,3.21,0.61,9.3,5
2,8.0,0.52,0.25,2.0,0.078,19.0,59.0,0.99612,3.3,0.48,10.2,5
3,7.6,0.55,0.21,2.2,0.071,7.0,28.0,0.9964,3.28,0.55,9.7,5
4,7.7,0.57,0.21,1.5,0.069,4.0,9.0,0.99458,3.16,0.54,9.8,6


In [5]:
X = np.array(df.iloc[:, :-1])
y = np.array(df["quality"])

## Feature selection solution space

From algorithm analysus' point of view, a solution for feature selection problems can be represented as a boolean vector, each component indicating whether the corresponding feature has been selected.

In [6]:
selected = np.array([False, True, True, False, False, True, True, False, False, False, True])
print(selected)

[False  True  True False False  True  True False False False  True]


scikit-learn calls the corresponding indices to feature columns selected "support", which can be obtained using np.flatnonzero(). 

https://docs.scipy.org/doc/numpy/reference/generated/numpy.flatnonzero.html

Return indices that are non-zero in the flattened version of a.


In [7]:
support = np.flatnonzero(selected)
print(support)

[ 1  2  5  6 10]


Thus, a naive approach that exhaustively search all subsets of features would have to verify 2^p solutions. This is very inefficient. However, we will run an exhaustive search for all solutions that provide 5 features to establish a baseline. This limits our time complexity. 

In [8]:
def search_combinations(estimator, X, y, k = 5):
    # fit and score model based on some subset of features
    score = lambda X_features: clone(estimator).fit(X_features, y).score(X_features, y)
    
    # enumerate all combinations of 5 features
    for subset in itertools.combinations(range(X.shape[1]), 5):
        yield score(X[:, subset]), subset

In [9]:
scores = search_combinations(LinearRegression(), X, y)

# feed it a model, X, and y. 
# it'll iterate through all possible variations (up to 5)
# and fit/score on those variations

In [10]:
sorted(scores, reverse = True)[:5]

[(0.35149423850184036, (1, 4, 6, 9, 10)),
 (0.34831403567393349, (1, 4, 8, 9, 10)),
 (0.34695788821029638, (1, 6, 8, 9, 10)),
 (0.34667490118495603, (0, 1, 4, 9, 10)),
 (0.34580097696162626, (0, 1, 6, 9, 10))]

## Wrapper Methods

### Forward Selection

Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model

In [11]:
def forward_select(estimator, X, y, k = 2):
    # this array holds indicators of whether each feature is currently selected
    selected = np.zeros(X.shape[1]).astype(bool)
    
    # fit and score the model based on some subset of features
    score = lambda X_features: clone(estimator).fit(X_features, y).score(X_features, y)
    
    # find indices to selected columns
    selected_indices = lambda: list(np.flatnonzero(selected))
    
    # repeated til k features are selected
    while np.sum(selected) < k:
        # indices to unselected column
        rest_indices = list(np.flatnonzero(~selected))
        
        # compute model scores with an additional feature
        scores = [score(X[:, selected_indices() + [i]]) for i in rest_indices]
        
        print("\n * accuracy if adding column: \n    ", {i: int(s * 100) for i, s in zip(rest_indices, scores)})
        
        # find index within "rest_indices" that points to the most predictive feature not yet selected
        idx_to_add = rest_indices[np.argmax(scores)]
        print("add column", idx_to_add)
        
        # select this new feature
        selected[idx_to_add] = True
        print("================================")
        
    return selected_indices()

In [12]:
support = sorted(forward_select(LinearRegression(), X, y))


 * accuracy if adding column: 
     {0: 1, 1: 15, 2: 5, 3: 0, 4: 1, 5: 0, 6: 3, 7: 3, 8: 0, 9: 6, 10: 22}
add column 10

 * accuracy if adding column: 
     {0: 25, 1: 31, 2: 25, 3: 22, 4: 22, 5: 22, 6: 23, 7: 23, 8: 25, 9: 26}
add column 1


## Backwards Elimination

In backwards elimination, we basically just do the opposite. We start with ALL the features and remove the least significant feature at each iteration

In [13]:
def backwards_eliminate(estimator, X, y, k = 5):
    # this array holds indicators of whether each feature is currently selected
    selected = np.ones(X.shape[1]).astype(bool)
    
    # fit and score model based on some subset of features
    score = lambda X_features: clone(estimator).fit(X_features, y).score(X_features, y)
    
    # find indices to selected columns
    selected_indices = lambda: list(np.flatnonzero(selected))
    
    # repeat til k features are selected
    while np.sum(selected) > k:
        # compute model scores with one of the features removed
        scores = [score(X[:, list(set(selected_indices()) - {i})]) for i in selected_indices()]
        print("\n accuracy if removing column: \n", {i: int(s*100) for i, s in zip(selected_indices(), scores)})
        
        # find index that points to the least predictive feature
        idx_to_remove = selected_indices()[np.argmax(scores)]
        print("remove column", idx_to_remove)
        
        # remove this feature
        selected[idx_to_remove] = False
        print("================================")
        
    return selected_indices()

In [14]:
support = sorted(backwards_eliminate(LinearRegression(), X, y))


 accuracy if removing column: 
 {0: 36, 1: 32, 2: 35, 3: 36, 4: 35, 5: 35, 6: 35, 7: 36, 8: 35, 9: 33, 10: 31}
remove column 7

 accuracy if removing column: 
 {0: 36, 1: 32, 2: 35, 3: 36, 4: 35, 5: 35, 6: 35, 8: 35, 9: 33, 10: 24}
remove column 0

 accuracy if removing column: 
 {1: 32, 2: 35, 3: 35, 4: 35, 5: 35, 6: 35, 8: 35, 9: 33, 10: 24}
remove column 3

 accuracy if removing column: 
 {1: 32, 2: 35, 4: 35, 5: 35, 6: 35, 8: 35, 9: 33, 10: 24}
remove column 2

 accuracy if removing column: 
 {1: 31, 4: 34, 5: 35, 6: 34, 8: 35, 9: 33, 10: 24}
remove column 5

 accuracy if removing column: 
 {1: 31, 4: 34, 6: 34, 8: 35, 9: 33, 10: 23}
remove column 8


## Recursive Feature Elimination

Recursive feature elimination is an even more greedy algo. It finds good performing feature subset with high efficiency. The importance of each feature is obtained either through a coef attribute or through a feature_importances attribute. So in order for recursive feature elimination to work, the model is required to provide either of these attributes. 

We typically start off using a low complexity model and use it as a benchmark for feature selection

In [15]:
model = LinearRegression()
selector = RFE(model, 5)
selector.fit(X, y)

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
  n_features_to_select=5, step=1, verbose=0)

In [16]:
print("Number of Features:", selector.n_features_)

Number of Features: 5


In [17]:
print("Selected Features:", np.flatnonzero(selector.support_))

Selected Features: [1 4 7 8 9]


Then we transform the dataset to include only these features

In [18]:
X_new = selector.transform(X)
print(X_new.shape)

(1599, 5)


## Filter Methods

These methods rank feature predictiveness one by one, as opposed to considering a subset. They incorporate statistical methods to rank each feature instead of measuring accuracy of a model trained on selected features.

### Peason's X^2 test based feature selection
The following constructs the approximate X^2 distribution and scores each feature vs the label in order to determine which feature is more relevant, one at a time...and then selects features according to the score. 

In [19]:
selector = SelectKBest(chi2, k = 5)

In [20]:
selector.fit(X, y)

SelectKBest(k=5, score_func=<function chi2 at 0x00000181D77B2158>)

In [23]:
print("X^2 statistic: \n", selector.scores_)

X^2 statistic: 
 [  1.12606524e+01   1.55802891e+01   1.30256651e+01   4.12329474e+00
   7.52425579e-01   1.61936036e+02   2.75555798e+03   2.30432045e-04
   1.54654736e-01   4.55848775e+00   4.64298922e+01]


In [25]:
print("Selected indices: \n", selector.get_support(True))

Selected indices: 
 [ 1  2  5  6 10]


We'll then call transform() to select those feature columns from the dataset and store them into a new variable called X_selected

In [26]:
X_selected = selector.transform(X)
X_selected.shape

(1599, 5)

Or we can slice and dice. This does the same thing as transform

In [28]:
np.allclose(X_selected, X[:, [1, 2, 5, 6, 10]])

True

In [33]:
X_selected[:, 0:2]

array([[ 0.39 ,  0.31 ],
       [ 0.685,  0.23 ],
       [ 0.52 ,  0.25 ],
       ..., 
       [ 0.64 ,  0.07 ],
       [ 0.43 ,  0.29 ],
       [ 0.58 ,  0.54 ]])

Let's take a closer look at this procedure so we can better distinguish different feature selection methods. X^2 test has many applications. For feature selection, we utilize X^2 statistic to test for dependence of each feature towards determining label. 

##### Step 1: Encode labels into orthogonal vector space
This is also known as one-hot encoding. It's applicable to classification problems. Consider an example where you have defined 3 categories for possible outcomes: A/B/C. In order for machine learning algos to be able to handle this type of data, we have to convert them to numbers. One-hot encoding uses a vector (y1, y2, y3) where yi = [result falls into the ith category]. 

Therefore, one-hot encoding for A, B, and C categories becomes (1, 0, 0), (0, 1, 0), and (0, 0, 1) respectively. This has an advantage over plainly translating A, B, C into 1, 2, 3 (aka sparse encoding) in a way that orthogonal vectors do not impose assumptions of their order or magnitudes between categories like numbers would. 

For example, 3 > 1 is true. However, it doesn't mean to imply that C > A or C is superior to A in any way. However, this would affect the model's numerical stability. 

Therefore, one-hot encoding is a widely adopted technique for processing categories. Sparse encoding could be used when persisting a dataset in order to save storage space. One-hot encoding is how the categorical/nominal variables are encoded as independent predictors in the regression formula. 