# Sequential Feature Selectioin (SFS) or Sequential Feature Algorithms (SFA)

Sequential feature selection algorithms are basically part of the wrapper methods where it adds and removes features from the dataset sequentially

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d

The motivation behind feature selection algorithms is to automatically select a subset of features most relevant to the problem

Goal:

to improve the computational efficiency and 

to reduce the model's generalization error by removing irrelevant features or noise

There are four different flavors of SFAs available via the SequentialFeatureSelector:

1.Sequential Forward Selection (SFS)

2.Sequential Backward Selection (SBS)

3.Sequential Forward Floating Selection (SFFS)

4.Sequential Backward Floating Selection (SBFS)

Syntax:

sklearn.feature_selection.SequentialFeatureSelector(estimator, *, n_features_to_select='warn', tol=None, direction='forward', scoring=None, cv=5, n_jobs=None)[source]¶

 from sklearn.feature_selection import SequentialFeatureSelector

This Sequential Feature Selector adds (forward selection) or removes (backward selection) features to form a feature subset in a greedy fashion. At each stage, this estimator chooses the best feature to add or remove based on the cross-validation score of an estimator. In the case of unsupervised learning, this Sequential Feature Selector looks only at the features (X), not the desired outputs (y).

Try: https://www.kaggle.com/datasets/mirichoi0218/insurance

In [30]:
#import libreries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import cufflinks as cf
cf.go_offline()

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import PolynomialFeatures

from sklearn.utils import shuffle

from sklearn.feature_selection import SequentialFeatureSelector

In [2]:
#from github
url='https://raw.githubusercontent.com/Munees11/Auto-MPG-prediction/master/Scripts/auto_mpg_dataset.csv'

In [3]:
#from kaggle
#data=pd.read_csv('auto-mpg.csv',sep=',')
data=pd.read_csv(url)

In [4]:
data.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg
0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu,18.0
1,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320,15.0
2,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite,18.0
3,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst,16.0
4,8,302.0,140.0,3449.0,10.5,70,1,ford torino,17.0


In [8]:
X=data[['cylinders','displacement','horsepower','weight','acceleration']]
y=data['mpg']

In [15]:
poly3 =PolynomialFeatures(degree=3, include_bias=False)
all_degree_3_combinations=pd.DataFrame(poly3.fit_transform(X),
                                      columns=poly3.get_feature_names())
all_degree_3_combinations

Unnamed: 0,x0,x1,x2,x3,x4,x0^2,x0 x1,x0 x2,x0 x3,x0 x4,...,x2^3,x2^2 x3,x2^2 x4,x2 x3^2,x2 x3 x4,x2 x4^2,x3^3,x3^2 x4,x3 x4^2,x4^3
0,8.0,307.0,130.0,3504.0,12.0,64.0,2456.0,1040.0,28032.0,96.0,...,2197000.0,59217600.0,202800.0,1.596142e+09,5466240.0,18720.00,4.302217e+10,147336192.0,504576.00,1728.000
1,8.0,350.0,165.0,3693.0,11.5,64.0,2800.0,1320.0,29544.0,92.0,...,4492125.0,100541925.0,313087.5,2.250311e+09,7007467.5,21821.25,5.036605e+10,156839863.5,488399.25,1520.875
2,8.0,318.0,150.0,3436.0,11.0,64.0,2544.0,1200.0,27488.0,88.0,...,3375000.0,77310000.0,247500.0,1.770914e+09,5669400.0,18150.00,4.056575e+10,129867056.0,415756.00,1331.000
3,8.0,304.0,150.0,3433.0,12.0,64.0,2432.0,1200.0,27464.0,96.0,...,3375000.0,77242500.0,270000.0,1.767823e+09,6179400.0,21600.00,4.045958e+10,141425868.0,494352.00,1728.000
4,8.0,302.0,140.0,3449.0,10.5,64.0,2416.0,1120.0,27592.0,84.0,...,2744000.0,67600400.0,205800.0,1.665384e+09,5070030.0,15435.00,4.102793e+10,124903810.5,380252.25,1157.625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393,4.0,140.0,86.0,2790.0,15.6,16.0,560.0,344.0,11160.0,62.4,...,636056.0,20634840.0,115377.6,6.694326e+08,3743064.0,20928.96,2.171764e+10,121431960.0,678974.40,3796.416
394,4.0,97.0,52.0,2130.0,24.6,16.0,388.0,208.0,8520.0,98.4,...,140608.0,5759520.0,66518.4,2.359188e+08,2724696.0,31468.32,9.663597e+09,111607740.0,1288990.80,14886.936
395,4.0,135.0,84.0,2295.0,11.6,16.0,540.0,336.0,9180.0,46.4,...,592704.0,16193520.0,81849.6,4.424301e+08,2236248.0,11303.04,1.208782e+10,61097490.0,308815.20,1560.896
396,4.0,120.0,79.0,2625.0,18.6,16.0,480.0,316.0,10500.0,74.4,...,493039.0,16382625.0,116082.6,5.443594e+08,3857175.0,27330.84,1.808789e+10,128165625.0,908145.00,6434.856


In [16]:
all_degree_3_combinations.shape

(398, 55)

We raised our dimension of our data to 55 from 5. 

In [17]:
all_indices=range(0,len(data))
all_indices

range(0, 398)

# shuffle the data

In [20]:
all_indices=shuffle(all_indices)
all_indices

[117,
 65,
 130,
 221,
 75,
 228,
 224,
 205,
 155,
 285,
 227,
 175,
 237,
 301,
 350,
 296,
 314,
 295,
 10,
 360,
 306,
 23,
 67,
 259,
 375,
 212,
 207,
 111,
 195,
 299,
 16,
 102,
 311,
 145,
 171,
 297,
 226,
 251,
 325,
 50,
 335,
 370,
 60,
 158,
 236,
 292,
 184,
 69,
 322,
 263,
 63,
 239,
 73,
 340,
 119,
 284,
 352,
 127,
 249,
 97,
 304,
 273,
 290,
 18,
 357,
 209,
 362,
 70,
 46,
 134,
 74,
 179,
 157,
 37,
 389,
 21,
 356,
 64,
 274,
 203,
 365,
 42,
 84,
 337,
 5,
 176,
 131,
 321,
 98,
 164,
 53,
 318,
 250,
 385,
 347,
 143,
 232,
 163,
 380,
 58,
 293,
 377,
 317,
 14,
 110,
 165,
 124,
 342,
 196,
 312,
 76,
 126,
 146,
 233,
 269,
 261,
 27,
 252,
 310,
 243,
 88,
 271,
 166,
 218,
 384,
 154,
 191,
 9,
 148,
 80,
 59,
 168,
 367,
 276,
 29,
 192,
 234,
 268,
 107,
 349,
 214,
 160,
 105,
 393,
 241,
 248,
 395,
 345,
 369,
 116,
 246,
 216,
 194,
 343,
 33,
 43,
 139,
 302,
 230,
 188,
 61,
 358,
 95,
 225,
 247,
 186,
 256,
 170,
 381,
 190,
 376,
 320,
 220,
 

# Randomly split the data

In [23]:
training_indices,dev_indices=np.split(all_indices,[320])

In [28]:
print(f'training data {training_indices.shape}')
print(f'development data {dev_indices.shape}')

training data (320,)
development data (78,)


# Apply Sequential Feature Selection

In [35]:
feature_selection=SequentialFeatureSelector(estimator=LinearRegression(),
                                           n_features_to_select=4,
                                           scoring='neg_mean_squared_error',
                                           cv=[[training_indices,dev_indices]])

In [36]:
from sklearn import metrics
metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [43]:
#columns=feature_selection.get_feature_names_out()
best_four=pd.DataFrame(feature_selection.fit_transform(all_degree_3_combinations,y)
                      )
best_four

Unnamed: 0,0,1,2,3
0,3504.0,12278016.0,753992.0,4.302217e+10
1,3693.0,13638249.0,980000.0,5.036605e+10
2,3436.0,11806096.0,808992.0,4.056575e+10
3,3433.0,11785489.0,739328.0,4.045958e+10
4,3449.0,11895601.0,729632.0,4.102793e+10
...,...,...,...,...
393,2790.0,7784100.0,78400.0,2.171764e+10
394,2130.0,4536900.0,37636.0,9.663597e+09
395,2295.0,5267025.0,72900.0,1.208782e+10
396,2625.0,6890625.0,57600.0,1.808789e+10
