# Feature Selection


{{ badge }}

Feature selection is a process where you automatically or manually select those features in your data that contribute most to the prediction variable or output in which you are interested.

There are many ways to perform feature selection. Some methods include:

- Using a correlation matrix to select features that are highly correlated with the output variable
- Using a statistical tests (e.g., $x^2$) to measure the statistical significance of each feature in relation to the label. 
- Using a recursive feature elimination algorithm to automatically select features that are most relevant to the output variable

Feature selection is important in machine learning because it can help you reduce the amount of data you need to work with, which in turn can reduce the amount of time and resources required to train and tune your machine learning model. Additionally, by selecting only the most relevant features, you can improve the interpretability of your machine learning model and make overfitting less likley to occure.

There are many feature selection techniques supported by Scikit Learn. We will go through some of them and if you wish to learn more visit [Scikit Learn's Documentation](https://scikit-learn.org/stable/modules/feature_selection.html).

In [8]:
from sklearn import datasets, model_selection, feature_selection, svm, metrics, pipeline, preprocessing

In [9]:
# Load iris dataset
x, y = datasets.load_breast_cancer(return_X_y=True)

x.shape

(569, 30)

## Feature Selection Methods

### Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator.

We will be using the `SelectKBest` class from Scikit Learn which returing the best $k$ features using some scoring method based on a statistical significance test like $x^2$ (chi-square).

Scikit Learn includes additional scoring functions like:

- For classification: `chi2`, `f_classif`, `mutual_info_classif`
- For regression: `f_regression`, `mutual_info_regression`



In [10]:
# Define the feature selector by selecing the scoring function and the number of features to select
feature_selector = feature_selection.SelectKBest(feature_selection.chi2, k=10)

# Use fit transform to train the feature selector and return the best 10 features
x_new = feature_selector.fit_transform(x, y)

# New X will have 10 features instead of 30
x_new.shape

(569, 10)

## Training & Evaluating

In [11]:

x_train, x_test, y_train, y_test = model_selection.train_test_split(x_new, y, test_size=0.3, stratify=y, random_state=42)

In [12]:
print(f"Training model on {x_train.shape[1]} features")

# Define a Support Vector Machine classifier with default configuration
model = pipeline.Pipeline([
                          ('scaler',preprocessing.StandardScaler()),
                          ('model',svm.SVC(kernel='rbf',random_state=10,max_iter=10))
])

# Train model using the training dataset
model.fit(x_train, y_train)

Training model on 10 features




Pipeline(steps=[('scaler', StandardScaler()),
                ('model', SVC(max_iter=10, random_state=10))])

In [13]:
# Use the model to predict the testing set to prepare for calculating the metrics
y_pred = model.predict(x_test)

# Calculate and print relevant scores
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.7777777777777778
Precision: 0.8876404494382022
Recall: 0.7383177570093458


# Feature selection using Select From Model

[SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as coef_, feature_importances_) or via an importance_getter callable after fitting. 

The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter. 

Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”. 

In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.

## Define and Train the Linear Model

In [14]:
lsvc = svm.LinearSVC()
lsvc.fit(x, y)



LinearSVC()

## Use the Trained Linear Model to Get the Best Features

In [16]:
feature_selector = feature_selection.SelectFromModel(lsvc, prefit=True)
x_new =feature_selector.transform(x)# get the best features using the pretrained model
x_new.shape

(569, 8)

## Split the Datase to Train and Test 

In [17]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x_new, y, test_size=0.3, stratify=y, random_state=42)

## Training & Evaluating

In [20]:
print(f"Training model on {x_train.shape[1]} features")

# Define a Support Vector Machine classifier with default configuration
model = pipeline.Pipeline([
                          ('scaler',preprocessing.StandardScaler()),
                          ('model',svm.SVC(kernel='rbf',random_state=10,max_iter=10))
])

# Train model using the training dataset
model.fit(x_train, y_train)

Training model on 8 features




Pipeline(steps=[('scaler', StandardScaler()),
                ('model', SVC(max_iter=10, random_state=10))])

In [21]:
# Use the model to predict the testing set to prepare for calculating the metrics
y_pred = model.predict(x_test)

# Calculate and print relevant scores
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.9298245614035088
Precision: 0.9279279279279279
Recall: 0.9626168224299065


# Recursive Feature Elimination

Given an external estimator that assigns weights to features (e.g., the coefficients/weights of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute (such as `coef_`, `feature_importances_`). Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.



In [26]:
# We use a linear kernel because RFE requires a model that has a coef_ or feature_importances_ attributes
svc = svm.SVC(kernel='linear')

# Define the feature selector by selecing the estimator and the number of features to select, and the number of features to remove in each iteration
rfe = feature_selection.RFE(estimator=svc, n_features_to_select=12, step=1)

# Use fit transform to train the feature selector and return the best 10 features
x_new = rfe.fit_transform(x, y)

# New X will have 12 features instead of 30
x_new.shape

(569, 12)

## Training & Evaluating

In [27]:
use_pruned_features = True #@param {type:"boolean"}
x_final = x_new if use_pruned_features else x

x_train, x_test, y_train, y_test = model_selection.train_test_split(x_new, y, test_size=0.3, stratify=y, random_state=42)

In [28]:
print(f"Training model on {x_train.shape[1]} features")

# Define a Support Vector Machine classifier with default configuration
model = pipeline.Pipeline([
                          ('scaler',preprocessing.StandardScaler()),
                          ('model',svm.SVC(kernel='rbf',random_state=10,max_iter=10))
])

# Train model using the training dataset
model.fit(x_train, y_train)

Training model on 12 features




Pipeline(steps=[('scaler', StandardScaler()),
                ('model', SVC(max_iter=10, random_state=10))])

In [29]:
# Use the model to predict the testing set to prepare for calculating the metrics
y_pred = model.predict(x_test)

# Calculate and print relevant scores
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.935672514619883
Precision: 0.9363636363636364
Recall: 0.9626168224299065


# Recursive Feature Elimination CV

Recursive feature elimination with cross-validation to select the number of features.



In [35]:
# We use a linear kernel because RFECV requires a model that has a coef_ or feature_importances_ attributes
svc = svm.SVC(kernel='linear')

# Define the feature selector by selecing the estimator and the number of features to select, and the score for the features to be selected
feature_selector= feature_selection.RFECV(lsvc,scoring=metrics.make_scorer(metrics.precision_score))
# Use fit transform to train the feature selector 
feature_selector.fit(x,y)

new_x = feature_selector.transform(x)

# New X will have 9 features instead of 30
new_x.shape


(569, 9)

In [36]:

x_train,x_test,y_train,y_test = model_selection.train_test_split(new_x,y,test_size=0.2,random_state=42, stratify=y)



## Training & Evaluating

In [39]:
# Define a Support Vector Machine classifier with default configuration
model = pipeline.Pipeline([
                          ('scaler',preprocessing.StandardScaler()),
                          ('model',svm.SVC(kernel='rbf',random_state=10,max_iter=10))
])
# Train model using the training dataset
model.fit(x_train,y_train)



Pipeline(steps=[('scaler', StandardScaler()),
                ('model', SVC(max_iter=10, random_state=10))])

In [40]:
# Use the model to predict the testing set to prepare for calculating the metrics
y_pred = model.predict(x_test)

# Calculate and print relevant scores
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.9210526315789473
Precision: 0.9565217391304348
Recall: 0.9166666666666666
