# Feature Selection

In this case study, my objective is to code basic machine learning feature selection techniques in Python using Scikit-Learn (reducing overfitting, increasing model accuracy, avoiding the dreaded "curse of dimensionality," etc.).

The objectives are to:
1. Demonstrate univariate filtering methods of feature selection such as SelectKBest.
2. Demonstrate wrapper-based feature selection methods such as Recursive Feature Elimination.
3. Demonstrate feature importance estimation, dimensionality reduction, and lasso regularization techniques.

## Defining Terms related to Feature Selection and Dimensionality Reduction

**Feature Selection vs. Dimensionality Reduction (Feature Extraction)**

Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of features in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes whereas feature selection methods includes/excludes features without changing them.

**Feature Selection:**

1. Filter Methods: apply statistical measures, features are ranked by score and kept/removed from dataset. These methods are usually univariate, consider the feature independently, or with regard to the dependent variable. Example: Pearson correlation.
2. Wrapper Methods: consider selection features as a search problem, different combinations are evaluated and compared on model metric, like accuracy. Example: backward sequential feature selection
3. Embedded Methods: learn which features best contribute to a model metric, like accuracy, while the model is training. Example: reglarization (LASSO, Elastic Net and Ridge Regression). Other examples: Random Forest + decision tree techniques

**Dimensionality Reduction (Feature Extraction):**

1. Principal Component Analysis
2. Singular Value Decomposition
3. Linear Discriminant Analysis

**Benefits of Feature Engineering**

Why perform feature selection?

1. Reduces overfitting, which enhances generalization
2. Speeds up model training
3. Permits better understanding of model dynamics and relationships between datapoints, allowing for better interpretation
4. Improves model accuracy (when done correctly)

**Side Note**
Review of some univariate statistical measures that can be used for filter-based feature selection:

- Numerical input, numerical output
    - Pearson’s correlation coefficient (linear)
    - Spearman’s rank coefficient (nonlinear)
- Numerical input, categorical output
    - ANOVA correlation coefficient (linear)
    - Kendall’s rank coefficient (nonlinear)
- Categorical input, numerical output
    - rare, but could use "numerical input, categorical output" methods but in reverse
- Categorical input, categorical output
    - Chi-Squared test (contingency tables)
    - Mutual Information

In [2]:
import pandas as pd
import numpy as np

## Introduce Algorithms with Embedded Feature Selection

In [3]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [1]:
URL = 'https://pkgstore.datahub.io/machine-learning/diabetes/diabetes_csv/data/e5ef1d87d57240919ec9990c580355c2/diabetes_csv.csv'

In [4]:
data = pd.read_csv(URL)

In [5]:
data.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


In [6]:
X = data.drop('class', axis=1).values
y = data['class'].values
print(X.shape, y.shape)

(768, 8) (768,)


In [None]:
# SelectKBest


In [None]:
# f_classif

## Demonstrate two Univariate Selection Methods: Pearson Correlation Filtering and SelectKBest f_classif

## Demonstrate two Wrapper Methods: Backward Sequential and RFE

## Demonstrate Feature Importance Estimation using Bagged Decision Trees

## Dimensionality Reduction using Principal Component Analysis

## Demonstrate Lasso Regularization

## Expanding concepts to hyperparameter optimization and model selection