## Features analysis

In this notebook, features are analysed through different methods in order to select
and keep the most important one to reduce the feature space. This may reduce overfitting
and improve the model performance.

The selected features will then be filtered in the `prepare` stage. 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import shap
import xgboost
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

We use only trained data and split them into train and validation datasets. We keep the
test set for final evaluation of the performance only.

In [None]:
train = pd.read_csv("../data/split/train.csv")

y = train["Class"]
X = train.drop(["Class"], axis=1)

X.shape, y.shape, train.keys()

We must first defined the parameters of the model we want to use. Those can be found 
with `scripts/grid_search.py`. 

In [None]:
params = {
    "device": "gpu",
    "eta": 0.3,
    "subsample": 0.5,
    "colsample_bytree": 0.8,
    "eval_metric": "aucpr",
    "objective": "binary:logistic",
}

In [None]:
feature_names = list(X.keys())
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
model = XGBClassifier(**params).fit(X_train, y_train)
model.score(X_val, y_val)

We can start by using the default importance from the xgboost librairy. It's not really
consistent, but by keeping the top-3 of each, we get the following:
- V4
- V14
- V15
- V7
- V17
- V16

In [None]:
for importance_type in ("weight", "gain", "cover"):
    xgboost.plot_importance(model, importance_type=importance_type)
    plt.show()


Using `Scikit-learn` permutation importance (more information [here](https://scikit-learn.org/stable/modules/permutation_importance.html)), 
we can determine how much a feature degrades the model performance if it is swapped with another random one for random samples.

We can already see some differences with the basic feature importance from XGBoost. In
the top-5, only two features are in common with the one selected previousy:
- V14
- V4 
  
And the new ones:
- V10 
- V26
- V12

We also observe that the top-5 important features are the same for the training and
validation datasets.

In [None]:
r = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=42, n_jobs=4)
sorted_importances_idx = r.importances_mean.argsort()
importances = pd.DataFrame(
    r.importances[sorted_importances_idx].T,
    columns=X.columns[sorted_importances_idx],
)
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (test set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()


In [None]:
r = permutation_importance(
    model, X_train, y_train, n_repeats=10, random_state=42, n_jobs=1
)
sorted_importances_idx = r.importances_mean.argsort()
importances = pd.DataFrame(
    r.importances[sorted_importances_idx].T,
    columns=X.columns[sorted_importances_idx],
)
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (train set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()

Finally, we can try to do it with [SHAP](https://shap.readthedocs.io/en/latest/index.html)
which is a dedicated librairy to measure features importance. Their documentation also
includes an [example](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20XGBoost.html) to work with XGBoost.

Most of the top-10 features were already mentionned with the other methods, only 3 are 
new:
- Ammount
- V18
- V19

Overall, thoses methods are quite consistent with each other.

In [None]:
d_train = xgboost.DMatrix(X_train, label=y_train)
d_val = xgboost.DMatrix(X_val, label=y_val)

model = xgboost.train(
    params,
    d_train,
    5000,
    evals=[(d_val, "test")],
    verbose_eval=100,
    early_stopping_rounds=20,
)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X, plot_type="bar")