# SeqFeatSelection Example

In [1]:
import sys
sys.path.append('../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import SeqFeatSelection

from notebooks.download import download_datasets

## 1 - Dataset with Headers

In [2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54803,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0


This class implements the sequential feature selection method. It represents a wrapper of the **SequentialFeatureSelector** class from the **mlxtend** library, offering certain simplifications and abstractions.

We can call this subclass using the default parameters and passing the dataframe only when calling the .fit() method. We can choose to pass the whole dataset along the label column using the "df=" and "label_col=" parameters.

In [3]:
feat_sel = SeqFeatSelection(n_jobs=4)
feat_sel.fit(df=dataset, label_col='is_promoted')
feat_sel.get_selected_features()

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  11 out of  11 | elapsed:    1.5s finished

[2022-06-28 15:59:07] Features: 1/11 -- score: 0.6708481343121734[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.4s finished

[2022-06-28 15:59:08] Features: 2/11 -- score: 0.7217186897769179[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   7 out of   9 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:    0.4s finished

[2022-06-28 15:59:08] Features: 3/11 -- score: 0.7623125973733199[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of   8 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   8 out of   8 | elapsed:    0.4s finished

[2022-06-28 15:59:09] Features: 4/11 -- score: 0.755662689727

['department', 'previous_year_rating', 'avg_training_score']

After calling the fit() method, we can access the summary generated by the **SequentialFeatureSelector** class (from the mlxtend library) used inside our **SeqFeatSelection**. This summary dictionary can be accessed by using the **get_summary()** method, and it follows the following structure: each key is assigned to a different set of features tested, and for each key we have a secondary dictionary that informs all the relevant data for that particular run, such as the features used in that run, the results obtained for each fold (using cross-validation), where the results are associated to the metric specified by the **scoring** parameter.

In [4]:
feat_sel.get_summary()

{1: {'feature_idx': (10,),
  'cv_scores': array([0.67026519, 0.67113818, 0.67114104]),
  'avg_score': 0.6708481343121734,
  'feature_names': ('avg_training_score',)},
 2: {'feature_idx': (0, 10),
  'cv_scores': array([0.72540505, 0.71855693, 0.72119409]),
  'avg_score': 0.7217186897769179,
  'feature_names': ('department', 'avg_training_score')},
 3: {'feature_idx': (0, 7, 10),
  'cv_scores': array([0.7618294 , 0.76159125, 0.76351714]),
  'avg_score': 0.7623125973733199,
  'feature_names': ('department',
   'previous_year_rating',
   'avg_training_score')},
 4: {'feature_idx': (0, 3, 7, 10),
  'cv_scores': array([0.76027635, 0.75871003, 0.74800169]),
  'avg_score': 0.755662689727591,
  'feature_names': ('department',
   'gender',
   'previous_year_rating',
   'avg_training_score')},
 5: {'feature_idx': (0, 3, 7, 9, 10),
  'cv_scores': array([0.74519965, 0.75171545, 0.73644768]),
  'avg_score': 0.7444542599508724,
  'feature_names': ('department',
   'gender',
   'previous_year_rating',

It is also possible to save this summary automatically after calling the fit() method by using the **save_json** and **json_summary** parameters. By default, **save_json** is set to False, which means that no JSON files are saved. By setting it to True, the summary will be saved in the file specified by **json_summary**.

In [5]:
feat_sel = SeqFeatSelection(n_jobs=4, save_json=True, json_summary="json_files/seq_feat.json")
feat_sel.fit(df=dataset, label_col='is_promoted')

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  11 out of  11 | elapsed:    2.3s finished

[2022-06-28 15:59:20] Features: 1/11 -- score: 0.6708481343121734[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.4s finished

[2022-06-28 15:59:20] Features: 2/11 -- score: 0.7214478415293852[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   7 out of   9 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:    0.5s finished

[2022-06-28 15:59:21] Features: 3/11 -- score: 0.7638985560394768[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of   8 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   8 out of   8 | elapsed:    0.4s finished

[2022-06-28 15:59:21] Features: 4/11 -- score: 0.754233734757

<raimitigations.dataprocessing.feat_selection.sequential_select.SeqFeatSelection at 0x7f4004591e50>

The user can then open this JSON file using any JSON viewing tool and check all the results presented there. In the following images, we show an example of the JSON file generated using the "JSON Viewer" extension for VSCode.

![seq_feat](./imgs/seq_feat_1.png)

In the previous image, we can see that each run is associated to a different numerical key. The key "6" represents a run that used 5 features, specified in the *feature_names* sub-key. We can also look into the results obtained by each model in the cross-validation scenario (in this case, using 3 folds), as well as the mean score. This JSON file allows users to inspect the finer details of the feature selection process and decide by themselves if the final features selected is the ones they want or if they want to change it.

If the users disagrees (partially or fully) with the selected features, they can manually choose the features to be selected by using the *set_selected_features* method. This method accepts a list of columns, which will be set as the columns to be selected (can be different from the features selected by the fit method). If we then call the *get_selected_features*, we can see that the selected features are now the ones defined by the user:

In [6]:
feat_sel.set_selected_features(["department", "previous_year_rating"])
feat_sel.get_selected_features()

['department', 'previous_year_rating']

We can also set the selected features using the column's indexes instead of their names:

In [7]:
feat_sel.set_selected_features([1, 2])
feat_sel.get_selected_features()

['region', 'education']

If we want to reset the selected features using the ones defined by the fit method, we only need to call *set_selected_features* again, but this time don't provide any list of columns. By doing this, the selected features will be set as those originally selected by the fit method.

In [8]:
feat_sel.set_selected_features()
feat_sel.get_selected_features()

['department', 'previous_year_rating', 'avg_training_score']

We can also separate the whole dataframe into the X datframe containing the features, and the Y dataframe containing the labels. This way, we use the "X=" and "y=" parameters and ignore the "df=" and "label_col=" parameters. We can also change the scoring function used.

In [9]:
X = dataset.drop(columns=['is_promoted'])
Y = dataset['is_promoted']

feat_sel = SeqFeatSelection(scoring='f1', n_jobs=4)
feat_sel.fit(X=X, y=Y)
feat_sel.get_selected_features()

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  11 out of  11 | elapsed:    1.3s finished

[2022-06-28 15:59:30] Features: 1/11 -- score: 0.19351656953830262[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.4s finished

[2022-06-28 15:59:30] Features: 2/11 -- score: 0.4975869106742392[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   7 out of   9 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:    0.3s finished

[2022-06-28 15:59:31] Features: 3/11 -- score: 0.5037381587361772[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of   8 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   8 out of   8 | elapsed:    0.4s finished

[2022-06-28 15:59:31] Features: 4/11 -- score: 0.49968649610

['department', 'awards_won?', 'avg_training_score']

The *SeqFeatSelection* implements the Sequential Feature Selection approach using the *mlxtend* library. This method uses an estimator, which is used to test the performance of the model using different sets of features. The default estimator used is a decision tree classifier (DecisionTreeClassifier from sklearn). But the user might be interested in using other sklearn estimators to see if they can achieve better results. Therefore, we created the *estimator* parameter, which accepts a sklearn classifier or *None* if the user wants to use the default one. Let's see how we can use the *SeqFeatSelection* subclass while specifying a different classifier. Note that in the following cell we also (i) specify the "label_col" using the index of the label column instead of its name (just to show a different approach when specifying this attribute), and (ii) provide the dataset when instantiating the subclass instead of providing it during the fit method.

In [10]:
from sklearn.neighbors import KNeighborsClassifier

estimator = KNeighborsClassifier(n_neighbors=4)
feat_sel = SeqFeatSelection(df=dataset, label_col=11, estimator=estimator, scoring='accuracy', n_jobs=4)
feat_sel.fit()
feat_sel.get_selected_features()

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  11 out of  11 | elapsed:   33.9s finished

[2022-06-28 16:00:11] Features: 1/11 -- score: 0.9223835796027996[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    7.1s finished

[2022-06-28 16:00:18] Features: 2/11 -- score: 0.9404649089117408[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   7 out of   9 | elapsed:    7.8s remaining:    2.2s
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:   10.1s finished

[2022-06-28 16:00:28] Features: 3/11 -- score: 0.940793329119512[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of   8 | elapsed:    7.4s remaining:    2.5s
[Parallel(n_jobs=4)]: Done   8 out of   8 | elapsed:    7.8s finished

[2022-06-28 16:00:36] Features: 4/11 -- score: 0.9405743809827

['department', 'recruitment_channel', 'avg_training_score']

Finally, in order to actually transform the desired dataset by selecting only the chosen features, we call the *transform* method. Following the same pattern of other subclasses, we must always provide a valid dataset for this method and this dataset doesn't need to be the same as the one used during the *fit* method.

In [11]:
new_df = feat_sel.transform(dataset)
new_df.head()

Unnamed: 0,department,recruitment_channel,avg_training_score,is_promoted
0,7,2,49.0,0
1,4,0,60.0,0
2,7,2,50.0,0
3,7,0,50.0,0
4,8,0,73.0,0


### Setting a list of transformations before using feature selection

In [12]:
print(dataset.info())
dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   department            54808 non-null  object 
 1   region                54808 non-null  object 
 2   education             52399 non-null  object 
 3   gender                54808 non-null  object 
 4   recruitment_channel   54808 non-null  object 
 5   no_of_trainings       54808 non-null  int64  
 6   age                   54808 non-null  int64  
 7   previous_year_rating  50684 non-null  float64
 8   length_of_service     54808 non-null  int64  
 9   awards_won?           54808 non-null  int64  
 10  avg_training_score    54808 non-null  int64  
 11  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 5.0+ MB
None


Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...
54803,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,78,0
54804,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,56,0
54805,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,0,79,0
54806,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,45,0


In [13]:
dataset['education'].unique()

array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
      dtype=object)

In [14]:
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal
from raimitigations.dataprocessing import BasicImputer

imputer = BasicImputer(
				col_impute=None,
				specific_col={'previous_year_rating': {	'missing_values':np.nan, 
														'strategy':'constant', 
														'fill_value':-100 } }
			)
ordinal = EncoderOrdinal(
				col_encode=['education'],
				categories={'education': ["Below Secondary", "Bachelor's", "Master's & above"]}
			)
ohe = EncoderOHE(col_encode=["department", "region", "gender", "recruitment_channel"])

transform_pipe = [imputer, ordinal, ohe]

feat_sel = SeqFeatSelection(transform_pipe=transform_pipe, n_jobs=4)
feat_sel.fit(df=dataset, label_col='is_promoted')
feat_sel.get_selected_features()

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    2.5s
[Parallel(n_jobs=4)]: Done  51 out of  51 | elapsed:    2.9s finished

[2022-06-28 16:01:27] Features: 1/51 -- score: 0.6708481343121734[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    1.4s finished

[2022-06-28 16:01:28] Features: 2/51 -- score: 0.7192425089600949[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 out of  49 | elapsed:    1.5s remaining:    0.3s
[Parallel(n_jobs=4)]: Done  49 out of  49 | elapsed:    1.7s finished

[2022-06-28 16:01:30] Features: 3/51 -- score: 0.7509862134959849[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  48 out of  48 | elapsed:    1.7s finished

[2022-06-28 16:01:32] Features: 4/51 -- score: 0.7657707008956866[Parallel(n_job

['previous_year_rating',
 'avg_training_score',
 'department_Finance',
 'department_HR',
 'department_Operations',
 'department_Procurement',
 'department_R&D',
 'department_Sales & Marketing',
 'region_region_18',
 'region_region_33',
 'region_region_9']

In [15]:
new_df = feat_sel.transform(dataset)
new_df.head()

Unnamed: 0,previous_year_rating,avg_training_score,department_Finance,department_HR,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,region_region_18,region_region_33,region_region_9,is_promoted
0,5.0,49.0,0,0,0,0,0,1,0,0,0,0
1,5.0,60.0,0,0,1,0,0,0,0,0,0,0
2,3.0,50.0,0,0,0,0,0,1,0,0,0,0
3,1.0,50.0,0,0,0,0,0,1,0,0,0,0
4,3.0,73.0,0,0,0,0,0,0,0,0,0,0


## 2 - DataFrame without column names

In [16]:
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...
54803,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,78,0
54804,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,56,0
54805,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,0,79,0
54806,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,45,0


In [17]:
feat_sel = SeqFeatSelection(n_jobs=1)
feat_sel.fit(df=dataset, label_col=11)
feat_sel.get_selected_features()

No columns specified for imputation. These columns have been automatically identified:
['2', '7']
No columns specified for encoding. These columns have been automatically identfied as the following:
['0', '1', '2', '3', '4']


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    0.4s finished

[2022-06-28 16:02:42] Features: 1/11 -- score: 0.6708481343121734[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.6s finished

[2022-06-28 16:02:43] Features: 2/11 -- score: 0.7215177306908688[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.6s finished

[2022-06-28 16:02:43] Features: 3/11 -- score: 0.7659266975856713[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  

['0', '7', '10']

In [18]:
feat_sel.set_selected_features([1,8])
feat_sel.get_selected_features()

['1', '8']

In [19]:
feat_sel.set_selected_features()
feat_sel.get_selected_features()

['0', '7', '10']

In [20]:
new_df = feat_sel.transform(dataset)
new_df.head()

Unnamed: 0,0,7,10,11
0,7,5.0,49.0,0
1,4,5.0,60.0,0
2,7,3.0,50.0,0
3,7,1.0,50.0,0
4,8,3.0,73.0,0


## 3 - Regression Task

So far, we only showed examples of the **SeqFeatSelection** for classification tasks. However, this class also works for regression tasks. First of all, let's create a dummy regression dataset so we can build a few examples. For this, we'll use the **create_dummy_dataset** function:

In [21]:
from raimitigations.dataprocessing import create_dummy_dataset

df = create_dummy_dataset(
        samples=1000,
        n_features=6,
        n_num_num=2,
        n_cat_num=2,
        n_cat_cat=0,
        num_num_noise=[0.01, 0.02],
        pct_change=[0.03, 0.05],
        regression=True,
    )
df

Unnamed: 0,num_0,num_1,num_2,num_3,num_4,num_5,label,num_c0_num_0,num_c1_num_1,CN_0_num_0,CN_1_num_1
0,1.645050,0.820673,-0.076645,-0.348990,0.632837,-0.875703,4.389251,1.644850,0.833223,val0_1,val1_3
1,0.678568,-0.367539,-0.367792,-1.129640,-0.593254,-1.524611,-185.576109,0.674129,-0.359268,val0_1,val1_1
2,-1.157812,0.367974,-0.415309,-0.444206,0.826171,1.108397,43.396172,-1.149322,0.386040,val0_0,val1_2
3,0.237365,-0.410995,1.173034,0.517627,0.182100,1.084928,124.268866,0.258908,-0.418348,val0_2,val1_0
4,-1.063191,0.692095,-0.132441,-0.442190,-0.233501,2.587980,126.513825,-1.061361,0.696987,val0_0,val1_0
...,...,...,...,...,...,...,...,...,...,...,...
995,-1.184065,0.124871,-0.145648,-0.100195,-3.132149,-0.919149,-175.063192,-1.194115,0.115665,val0_0,val1_1
996,-1.487044,0.073849,-1.272878,-0.082000,-0.457470,1.872967,36.319771,-1.507923,0.075584,val0_0,val1_1
997,-2.162915,1.573873,-0.312675,-0.780355,-0.932768,-0.903910,-126.774701,-2.144132,1.597560,val0_0,val1_2
998,-0.037309,-0.603274,0.499101,0.240289,0.051228,2.086783,136.295389,-0.030960,-0.597075,val0_1,val1_1


The **SeqFeatSelection** class will automatically detect if a problem is a classification or regression task by looking at the label column: if the data type of the label column is a variation of the float data type, then the task is considered to be a regression. Otherwise, it is considered a classification task. Note that we can explicitly determine if we want to solve a classification or a regression task by setting the 'regression' parameter when instantiating the **SeqFeatSelection** class: if the 'regression' parameter is set to True, then the task will be considered a regression task, and if set to False, it will be treated as a classification task. The default value of this parameter is None, and in this case, the task will be determined by looking at the data type of the label column, as previously mentioned. If we have a classification problem, but the label column is set with float values (1.0 for class 1, 2.0 for class 2, and so on), then we must set the 'regression' parameter to True.

In [22]:
feat_sel = SeqFeatSelection(verbose=False)
feat_sel.fit(df=df, label_col="label")
feat_sel.get_selected_features()

['num_0',
 'num_2',
 'num_3',
 'num_4',
 'num_5',
 'num_c1_num_1',
 'CN_0_num_0',
 'CN_1_num_1']

The internal variable 'regression' will indicate if the task is a regression or a classification:

In [23]:
feat_sel.regression

True

We can also specify which regression model we want to use when doing the sequential feature selection procedure. The default regressor used is a Decision Tree Regressor.

In [24]:
from sklearn.linear_model import LinearRegression

feat_sel = SeqFeatSelection(verbose=False, estimator=LinearRegression())
feat_sel.fit(df=df, label_col="label")
feat_sel.get_selected_features()

['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'num_5', 'num_c1_num_1']