# **Feature Selection**
Here we explore methodologies to identify which features are useful provide a higher predictive power to the model. Given a dataset, a model trained on it can depend on features directly on derived features. How do we tell wich features are the most useful? Multiple approaches exist, which are based on simple ideas of univariate analysis to complex multivariate analysis. In univariate analysis we look at how a single feature contribute to the model. Although useful, it does have pitfalls as some features are better together. In multivariate analysis we can tell which features perform well and more importantly which perform well together. Various techniques exist driven differentiated by how information is extracted. When data contains label like the case here, we use supervised techniques, nevetheless, unsupervised techniques can be used for unlabelled data.

Collaborative filtering is built on the assumption that a good way to predict the
preference of an active consumer for a target product is to find other consumers
who have similar preferences and use their votes for that product to make a
prediction.
As noted in the [source page](https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/), these techniques can be classified as follows
- **Filter methods:** based on features properties highlighted via univariate analysis

- **Wrapper methods:** With a specific learning algorithm, these methose can perform a greedy search of the best feature by fitting models with possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. 
- **Embedded methods:** Here they aim to combine the power of both filters and wrapper while maintaining reasonable computational cost.
- **Hybrid method:** Hybrid methods basically select features via a global transformation reduces the data to a desided number of dimensions. The new features can bear little or no resemblance to the initial features.

Libraries used in the notebook:
* [pandas](https://pandas.pydata.org/docs/),
* [scikit-learn](https://scikit-learn.org/stable/),
* [optbinning](https://github.com/guillermo-navas-palencia/optbinning),
* [sklearn.feature_selection](https://scikit-learn.org/stable/modules/feature_selection.html),
* [Category encoding](https://contrib.scikit-learn.org/category_encoders/index.html)

In [None]:
import pandas as pd
import numpy as np
import pickle
from optbinning import BinningProcess
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Get some toy data

In [None]:
with open(r'kaggle-data.pkl', 'rb') as f:
    df_application = pickle.load(f)

In [None]:
pd.options.display.max_columns = None
df_application.head(2)

In [None]:
# format columns to lower case (just for a nice look :) )
df_application.columns = [col.lower() for col in df_application.columns]
df_application.head(2)

## 1. Filter methods
Filter methods are based of techniques that uses univariate analysis. The ones covered here can be used for any model, so they are model agnostic. These filter methods includes
 - Information Value
 - Information Gain
 - Chi-square Test
 - Fisher’s Score
 - Correlation Coefficient
 - Variance Threshold
 - AUC

Those we noted that can also be used but we didn't explore includes
- Mean Absolute Difference (MAD)
- Dispersion ratio
 

### 2.1 Information Value (IV)
Here we will explore the use of opbinning to the informational value. We will use the `BinningProcess` class from `optbinning` to get the optimal bins and the corresponding scoores.
$$IV = \sum{(\% nonevent- \% event)}*WOE, \text{ with } WOE=ln(\% nonevent/ \% event)$$

The table belows shows intepretations from N. Siddiqi's book.
| Informational Value | Predictive Power |
| --- | --- |
| <0.02 | Useless for prediction | 
| 0.02-0.1 | Weak prediction | 
| 0.1-0.3 | Medium prediction | 
| 0.3-0.5 | Strong prediction |
| >0.5 | Suspecious(Too good to be true) | 

In [None]:
from optbinning import OptimalBinning, BinningProcess

Instantiate the `BinningProcess` with feature names and the fit

In [None]:
select_cols = df_application.columns[1:].to_list()
binning_process = BinningProcess(select_cols)
binning_process.fit(df_application[select_cols], df_application.target)
binning_table = binning_process.summary()
binning_table

In [None]:
binning_table[binning_table['name']=='ext_source_3']

Build a function that gets the summary table for Information Values, Jensen-Shannon entropy, Gini and quality.

In [None]:
def get_metrics(x, y):
    select_cols = x.columns.to_list()
    binning_process = BinningProcess(select_cols)
    binning_process.fit(x, y)
    binning_table = binning_process.summary()
    binning_table.sort_values(by='iv', inplace=True, ascending=False)
    binning_table['interpretation'] = binning_table['iv'].apply(interpretation)
    return binning_table

def interpretation(iv):
    if iv < 0.02:
        return 'useless'
    elif iv < 0.1:
        return 'weak'
    elif iv < 0.3:
        return 'medium'
    elif iv < 0.5:
        return 'strong'
    else:
        return 'suspicious'

In [None]:
binning_table_metrics = get_metrics(df_application[select_cols], df_application.target)

In [None]:
binning_table_metrics.head(10)

In [None]:
binning_table_metrics.tail(10)

In [None]:
binning_table_metrics[binning_table_metrics['interpretation'].isin(['strong', 'medium'])]

### 2.2. Variance threshold
Here we remove the data with smaller variance. For simplicity we will start using just numerical data

In [None]:
from sklearn import feature_selection

In [None]:
#working with numerical data
X = df_application.drop('target', axis=1)
Y = df_application.target
numerical_columns = X.select_dtypes(include=np.number).columns.values
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.values

Encode Categorical variables. For now, lets used  TargetEncoder()

In [None]:
import category_encoders as ce

In [None]:
encoder = ce.TargetEncoder()
X_cat = encoder.fit_transform(X[categorical_columns], Y)

In [None]:
X[categorical_columns] = X_cat

In [None]:
constant_threshold = feature_selection.VarianceThreshold(threshold=0.001)
constant_threshold.fit(X)

In [None]:
#reduced features is
# df_application_train, df_application_test, y_train, y_test
X_filter = constant_threshold.transform(X)
# X_tfilter = constant_threshold.transform(df_application_test[numerical_columns])

In [None]:
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()

### 2.3 Information gain
Here we will use the mutual information to streamline the features

In [None]:
select_features = X.columns[constant_threshold.get_support()]

In [None]:
importance = feature_selection.mutual_info_classif(X[select_features].fillna(X[select_features].mean()), Y)

In [None]:
feature_importances = pd.Series(importance, select_features)

In [None]:
feature_importances.sort_values(ascending=False).head(10).plot(kind='barh', color='teal')

Scikit-learn provides functionality to automatically select features when a measure and selection criteria are provided. In this case, we can use selection pipeline and metrics like `Percentile`, or top best, to select a particular number of columns. Scikit-learn untitilites for this includes
- [`SelectPercentile`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
- [`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)

These can be used with measures like mutual information (`mutual_info_classif`), Chi Square (`chi2`), Fisher information etc. We will demonstrate for mutual information. Note that this takes a long time to run.

In [None]:
# sel = feature_selection.SelectPercentile(
#     feature_selection.mutual_info_classif,
#     percentile=10
# ).fit(X_filter, Y)
# X_filter.columns[sel.get_support()]

### 2.4 AUC
AUC is good measure for model performance for various reasons. Here we want to use AUC to measure the performance of a model build on a single feature. At the end we select features with high AUC. This is a model based approach.

In [None]:
from sklearn import metrics 
from sklearn.ensemble import RandomForestClassifier

In [None]:
df_application = X[select_features].fillna(X[select_features].mean())
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.2, random_state=42, stratify=Y)

In [None]:
roc_auc = []
for feature in numerical_columns:
    clf = LogisticRegression(max_iter=100, random_state=42)
    clf.fit(X[feature].fillna(0).values.reshape(-1, 1), Y)
    y_pred = clf.predict(X_test[feature].fillna(0).values.reshape(-1, 1))
    roc_auc.append(metrics.roc_auc_score(y_test, y_pred))

In [None]:
roc_auc_series = pd.Series(roc_auc, index=numerical_columns).sort_values(ascending=False)
roc_auc_series.head()

Any feature with AUC < 0.5 are not useful.

In [None]:
roc_auc_series[roc_auc_series>0.5]

We can then used this to build a model

In [None]:
def run_logreg(X_train, y_train, X_test, y_test):
    clf =  LogisticRegression(C=3, max_iter=100, random_state=42)
    clf.fit(X_train.fillna(0), y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy on test set is: ', metrics.accuracy_score(y_test, y_pred))

In [None]:
%% time
s = roc_auc_series[roc_auc_series>0.5].index.to_list()
run_logreg(X[numerical_columns], Y)

### 2.5 Correlation coefficients.
This can be a quick and easy way to see which features are correlated with the target. Correlation compute the Perason Correlation, the logic behind its used for feature selection is that the good variables are highly correlated with the target

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# For ease of plotting we will only select 10 features
correlation_matrix = X[numerical_columns[:10]].merge(Y, right_index=True, left_index=True).corr()

In [None]:
#plotting heatmap
plt.figure(figsize=(10,6))
sns.heatmap(correlation_matrix, annot=True)

### 2.6 F-Test
This uses the F statistics

In [None]:
def get_kbest_features(X, y, metric, k=5):

    selector = feature_selection.SelectKBest(metric)
    X_reduced = selector.fit_transform(X, y)
    cols = selector.get_support(indices=True)
    selected_columns = processed_data.iloc[:,cols].columns.tolist()

    return  selected_columns

In [None]:
get_best_feature_importance(
    X.fillna(0), 
    Y, 
    metric=feature_selection.f_classif, 
    k=5)

### 2.7 Chi-square Test
This uses the Chi test

In [None]:
get_best_feature_importance(
    X.fillna(0), 
    Y, 
    metric=feature_selection.chi2, 
    k=5)