# **Feature Selection**
Here we explore methodologies to identify which features are useful provide a higher predictive power to the model. Given a dataset, a model trained on it can depend on features directly on derived features. How do we tell wich features are the most useful? Multiple approaches exist, which are based on simple ideas of univariate analysis to complex multivariate analysis. In univariate analysis we look at how a single feature contribute to the model. Although useful, it does have pitfalls as some features are better together. In multivariate analysis we can tell which features perform well and more importantly which perform well together. Various techniques exist driven differentiated by how information is extracted. When data contains label like the case here, we use supervised techniques, nevetheless, unsupervised techniques can be used for unlabelled data.

Collaborative filtering is built on the assumption that a good way to predict the
preference of an active consumer for a target product is to find other consumers
who have similar preferences and use their votes for that product to make a
prediction.
As noted in the [source page](https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/), these techniques can be classified as follows
- **Filter methods:** based on features properties highlighted via univariate analysis

- **Wrapper methods:** With a specific learning algorithm, these methose can perform a greedy search of the best feature by fitting models with possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. 
- **Embedded methods:** Here they aim to combine the power of both filters and wrapper while maintaining reasonable computational cost.
- **Hybrid method:** Hybrid methods basically select features via a global transformation reduces the data to a desided number of dimensions. The new features can bear little or no resemblance to the initial features.

References

Libraries used in the notebook:
* [pandas](https://pandas.pydata.org/docs/),
* [scikit-learn](https://scikit-learn.org/stable/),
* [optbinning](https://github.com/guillermo-navas-palencia/optbinning),
* [sklearn.feature_selection](https://scikit-learn.org/stable/modules/feature_selection.html),
* [category_encoders](https://contrib.scikit-learn.org/category_encoders/)"


In [None]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import category_encoders as ce

Get some toy data

In [None]:
with open(r'kaggle-data.pkl', 'rb') as f:
    df_application = pickle.load(f)

In [None]:
pd.options.display.max_columns = None
df_application.head(2)

In [None]:
# format columns to lower case (just for a nice look :) )
df_application.columns = [col.lower() for col in df_application.columns]
df_application.head(2)

In [None]:
#working with numerical data
X = df_application.drop('target', axis=1)
Y = df_application.target
numerical_columns = X.select_dtypes(include=np.number).columns.values

In [None]:
#identifying data types for ecoding
X = df_application.drop('target', axis=1)
Y = df_application.target
numerical_columns = X.select_dtypes(include=np.number).columns.values
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.values

In [None]:
encoder = ce.TargetEncoder()
X_cat = encoder.fit_transform(X[categorical_columns], Y)
X[categorical_columns] = X_cat

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.2, random_state=42, stratify=Y)

In [None]:
# format columns to lower case (just for a nice look :) )
df_application.columns = [col.lower() for col in df_application.columns]
df_application.head(2)

## 3 Wrapper Methods:
Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm that we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.
 - Forward Feature Selection
 - Backward Feature Elimination
 - Exhaustive Feature Selection
 - Recursive Feature Elimination
 
 Forward Feature Selection: Starts with the best and gradually adds the others
 Backward Feature Elimination: Starts will all and start elimicating, with worst the first to get out
 Exhaustive Feature Selection: Brute force method that searches each of the possible combinations
 
 We are going to address each of the methods, but for now will start with Recursive Feature Elimination as the others needs mlxtend library

### 3.1 Recursive Feature Elimination

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.

Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
# Recursive Feature Elimination
rfe = feature_selection.RFE(
    LogisticRegression(C=3, max_iter=1000, random_state=42),
    n_features_to_select=10
)
rfe.fit(X.fillna(X.mean()), Y)

In [None]:
rfe.get_feature_names_out()