# Final feature selection

After the best candidates were selected, we will now tune the model to get the best hyperparameters for the new set of features, using 'precision' scoring. We will use both Random Forest and XGBoost models. Then we will start optimizing the feature set directly according to the score formula (keeping high precision and low number of features). For this purpose, we have a custom scroring function form the `top20_scoring.py` file.

## Data loading

In [36]:
import numpy as np
from top20_scoring import top_20_perc_scoring

x_path = "data/x_train.txt"
y_path = "data/y_train.txt"

X = np.loadtxt(x_path)
y = np.loadtxt(y_path)

selected_features = [8, 100, 101, 102, 103, 104, 105, 285, 328, 351, 403]

X = X[:, selected_features]

## XGBoost Hyperparameter tuning

For this model we tune the following hyperparameters:
- `eta` - learning rate
- `max_depth` - maximum depth of the tree
- `subsample` - subsample ratio of the training instances
- `colsample_bytree` - subsample ratio of columns when constructing each tree
- `gamma` - minimum loss reduction required to make a further partition on a leaf node of the tree
- `lambda` - L2 regularization term on weights
- `alpha` - L1 regularization term on weights

## 

In [3]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

param_grid = {
    "eta": [0.01, 0.05, 0.1, 0.2],
    "max_depth": [3, 5, 7, 9],
    "subsample": [0.7, 0.9, 1],
    "colsample_bytree": [0.7, 0.9, 1],
    "gamma": [0, 0.1, 1, 10],
    "min_child_weight": [1, 3, 5],
    "reg_alpha": [0, 0.01, 0.1, 1],
    "reg_lambda": [1, 0.1, 0.01],
}

model = xgb.XGBClassifier()
grid_search = GridSearchCV(
    model, param_grid, cv=5, scoring="precision", n_jobs=-1, verbose=1
)
grid_search.fit(X, y)

print(grid_search.best_params_)

Fitting 5 folds for each of 20736 candidates, totalling 103680 fits


{'colsample_bytree': 0.7, 'eta': 0.1, 'gamma': 0, 'max_depth': 3, 'min_child_weight': 5, 'reg_alpha': 0, 'reg_lambda': 0.1, 'subsample': 0.9}


In [9]:
from sklearn.model_selection import cross_val_score

baseline_model = xgb.XGBClassifier()
tuned_model = xgb.XGBClassifier(**grid_search.best_params_)

scores = cross_val_score(baseline_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Baseline precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

scores = cross_val_score(tuned_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Tuned precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

Baseline precision: 0.6555 (+/- 0.0179)
Tuned precision: 0.7126 (+/- 0.0104)


In [10]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)


def score_model(model, X, y, skf):
    sum = 0

    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model.fit(X_train, y_train)

        y_pred_proba = model.predict_proba(X_test)[:, 1]
        score = top_20_perc_scoring(y_test, y_pred_proba)
        sum += score

    return sum / skf.n_splits


baseline_score = score_model(baseline_model, X, y, skf)
tuned_score = score_model(tuned_model, X, y, skf)

print(f"Baseline score: {baseline_score:.4f}")
print(f"Tuned score: {tuned_score:.4f}")

Baseline score: 0.7270
Tuned score: 0.7650


## Random Forest Hyperparameter tuning

For this model we tune the same parameters we tuned in the `rf_tuning.ipynb`.

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    "n_estimators": [50, 100, 250, 500],
    "max_depth": [None, 5, 10, 15, 20],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4, 8, 16],
    "max_features": ["log2", "sqrt", None],
    "bootstrap": [True, False],
    "n_jobs": [-1],
}

rf = RandomForestClassifier()
search = GridSearchCV(
    rf,
    param_dist,
    cv=3,
    n_jobs=-1,
    scoring="precision",
    verbose=1,
)
search.fit(X, y)
print(search.best_params_)

Fitting 3 folds for each of 2400 candidates, totalling 7200 fits


{'bootstrap': True, 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500, 'n_jobs': -1}


In [11]:
baseline_model = RandomForestClassifier()
tuned_model = RandomForestClassifier(**search.best_params_)

scores = cross_val_score(baseline_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Baseline precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

scores = cross_val_score(tuned_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Tuned precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

Baseline precision: 0.6759 (+/- 0.0157)
Tuned precision: 0.6932 (+/- 0.0182)


In [14]:
baseline_model = RandomForestClassifier()
tuned_model = RandomForestClassifier(**search.best_params_)

baseline_score = score_model(baseline_model, X, y, skf)
tuned_score = score_model(tuned_model, X, y, skf)

print(f"Baseline score: {baseline_score:.4f}")
print(f"Tuned score: {tuned_score:.4f}")

Baseline score: 0.7360
Tuned score: 0.7480


It seems that both in our custom scoring and for the precision XGBoost resulted with the best score. Therefore, in the following part we will only focus on the XGBoost model.

## Feature selection methods revisited

Now we will run selected methods once again, but this time only on the top features. We do that to check if preselection of features affects the rankings.

### XGBoost feature importance

In [37]:
ranking = {}

ranking["xgboost"] = np.zeros(X.shape[1])

xgboost = xgb.XGBClassifier(**grid_search.best_params_)
xgboost.fit(X, y)

for i, idx in enumerate(np.argsort(xgboost.feature_importances_)[::-1]):
    print(f"Feature {selected_features[idx]}: {xgboost.feature_importances_[idx]:.4f}")
    ranking["xgboost"][idx] = i

Feature 105: 0.1304
Feature 100: 0.1271
Feature 102: 0.1225
Feature 8: 0.1105
Feature 103: 0.1097
Feature 101: 0.1005
Feature 104: 0.0982
Feature 403: 0.0561
Feature 328: 0.0498
Feature 285: 0.0479
Feature 351: 0.0474


### ReliefF

In [38]:
from skrebate import ReliefF

ranking["relief"] = np.zeros(X.shape[1])

relief = ReliefF(n_neighbors=1.0, n_jobs=-1)
relief.fit(X, y)

for i, idx in enumerate(np.argsort(relief.feature_importances_)[::-1]):
    print(f"Feature {selected_features[idx]}: {relief.feature_importances_[idx]:.4f}")
    ranking["relief"][idx] = i

Feature 105: 0.0016
Feature 102: 0.0015
Feature 100: 0.0013
Feature 8: 0.0013
Feature 101: 0.0013
Feature 103: 0.0012
Feature 104: 0.0009
Feature 285: 0.0006
Feature 351: 0.0005
Feature 328: 0.0004
Feature 403: 0.0002


### Recursive Feature Elimination

In [39]:
from sklearn.feature_selection import RFE

ranking["rfe"] = np.zeros(X.shape[1])

rfe = RFE(
    estimator=xgb.XGBClassifier(**grid_search.best_params_),
    step=1,
    verbose=0,
    n_features_to_select=1,
)
rfe.fit(X, y)

for i, idx in enumerate(np.argsort(rfe.ranking_)):
    print(f"Feature {selected_features[idx]}: {rfe.ranking_[idx]}")
    ranking["rfe"][idx] = i

Feature 102: 1
Feature 100: 2
Feature 105: 3
Feature 103: 4
Feature 101: 5
Feature 104: 6
Feature 8: 7
Feature 285: 8
Feature 403: 9
Feature 328: 10
Feature 351: 11


### SHAP

In [40]:
import shap

ranking["shap"] = np.zeros(X.shape[1])

xgboost = xgb.XGBClassifier(**grid_search.best_params_)
xgboost.fit(X, y)

explainer = shap.TreeExplainer(xgboost)
shap_values = explainer.shap_values(X)

shap_results = {}
for i in range(X.shape[1]):
    shap_results[i] = np.abs(shap_values[:, i]).mean(0)

sorted_results = sorted(shap_results.items(), key=lambda x: x[1], reverse=True)
for i in range(X.shape[1]):
    print(
        f"Feature {selected_features[sorted_results[i][0]]}: {sorted_results[i][1]:.4f}"
    )
    ranking["shap"][sorted_results[i][0]] = i

Feature 102: 0.2726
Feature 105: 0.2701
Feature 100: 0.2676
Feature 103: 0.2400
Feature 101: 0.2306
Feature 104: 0.1798
Feature 8: 0.1173
Feature 328: 0.0683
Feature 403: 0.0572
Feature 285: 0.0558
Feature 351: 0.0383


### Maximal Information Coefficient

In [41]:
from minepy import MINE

ranking["mic"] = np.zeros(X.shape[1])

mine = MINE(alpha=1.0, c=15)

mic_results = {}
for i in range(X.shape[1]):
    mine.compute_score(X[:, i], y)
    mic_results[i] = mine.mic()

sorted_mic_results = sorted(mic_results.items(), key=lambda x: x[1], reverse=True)
for i in range(X.shape[1]):
    print(
        f"Feature {selected_features[sorted_mic_results[i][0]]}: {sorted_mic_results[i][1]:.4f}"
    )
    ranking["mic"][sorted_mic_results[i][0]] = i

Feature 8: 1.0000
Feature 100: 1.0000
Feature 101: 1.0000
Feature 102: 1.0000
Feature 103: 1.0000
Feature 104: 1.0000
Feature 105: 1.0000
Feature 328: 1.0000
Feature 403: 1.0000
Feature 351: 1.0000
Feature 285: 0.9951


### Final ranking

In [43]:
final_ranking = np.sum(list(ranking.values()), axis=0)
print(f"Methods: {list(ranking.keys())}")

for i in np.argsort(final_ranking):
    print(f"Feature {selected_features[i]}: {final_ranking[i]}")

Methods: ['xgboost', 'relief', 'rfe', 'shap', 'mic']
Feature 102: 6.0
Feature 100: 7.0
Feature 105: 9.0
Feature 8: 18.0
Feature 101: 19.0
Feature 103: 19.0
Feature 104: 27.0
Feature 328: 40.0
Feature 403: 41.0
Feature 285: 42.0
Feature 351: 47.0


And now ranking differ a bit. Features 102, 100, and 105 scored very good, features 8, 101, 103 and 104 were a bit worse, and all the rest were mostly on the bottom of the ranking.

### Net income based selection

Even though we get some new feature importance ranking, we won't reduce it further on this basis, and instead just select the most optimal combination, based on the score calculated according to the task's formula. To make results reliable we'll use cross-validation.

In [78]:
from itertools import combinations
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)

def score_subset(X, skf, best_params):
    n_features = X.shape[1]
    n_splits = skf.n_splits

    net_income = -200 * n_features * n_splits

    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = xgb.XGBClassifier(**best_params)
        model.fit(X_train, y_train)

        y_pred_proba = model.predict_proba(X_test)[:, 1]
        TP = top_20_perc_scoring(y_test, y_pred_proba) * X.shape[0] * 0.2
        net_income += TP * 10

    return net_income / n_splits

subsets = []

for num_features in range(1, X.shape[1] + 1):
    for subset in combinations(range(X.shape[1]), num_features):
        subset = list(subset)
        X_subset = X[:, subset]
        score = score_subset(X_subset, skf, grid_search.best_params_)
        subsets.append({
            "subset": [selected_features[i] for i in subset],
            "score": score
        })

In [118]:
import pandas as pd

subsets_df = pd.DataFrame(subsets)
subsets_df = subsets_df.sort_values("score", ascending=False)
subsets_df = subsets_df.reset_index(drop=True)

subsets_df


Unnamed: 0,subset,score
0,"[101, 102, 103, 105]",6930.0
1,"[8, 100, 102, 103]",6840.0
2,"[102, 103, 105]",6820.0
3,"[100, 102, 103]",6820.0
4,"[101, 102, 103]",6810.0
...,...,...
2042,"[285, 351]",4700.0
2043,"[285, 351, 403]",4690.0
2044,"[285, 328, 351]",4620.0
2045,"[285, 328, 351, 403]",4610.0


And the subset providing the best compromise between precision and number of features is [101, 102, 103, 105]. We'll use that set for the final model. Interesting fact is that, it seems that those features outside of the group 0-9 and 100-105 are not very important for the model. This is an important insight - the feature importance methods are not ideal and sometimes they can be misleading.

In [94]:
best_subset = subsets_df.iloc[0].subset
print(best_subset)

[101, 102, 103, 105]


In [97]:
X = np.loadtxt("data/x_train.txt")
X = X[:, best_subset]

model = xgb.XGBClassifier()

param_grid = {
    "eta": [0.01, 0.05, 0.1, 0.2],
    "max_depth": [3, 5, 7, 9],
    "subsample": [0.7, 0.9, 1],
    "colsample_bytree": [0.7, 0.9, 1],
    "gamma": [0, 0.1, 1, 10],
    "min_child_weight": [1, 3, 5],
    "reg_alpha": [0, 0.01, 0.1, 1],
    "reg_lambda": [1, 0.1, 0.01],
}

model = xgb.XGBClassifier()
grid_search = GridSearchCV(
    model, param_grid, cv=5, scoring="precision", n_jobs=-1, verbose=1
)
grid_search.fit(X, y)

print(grid_search.best_params_)

Fitting 5 folds for each of 20736 candidates, totalling 103680 fits
{'colsample_bytree': 0.7, 'eta': 0.01, 'gamma': 0.1, 'max_depth': 5, 'min_child_weight': 5, 'reg_alpha': 1, 'reg_lambda': 0.1, 'subsample': 0.9}


Interestingly, after even further feature selection, and change to our custom scoring function (which still heavily relies on precision), the most optimal hyperparameters changed. Let's check the improvement in scoring.

In [100]:
skf = StratifiedKFold(n_splits=5)

final_model = xgb.XGBClassifier(**grid_search.best_params_)

score = cross_val_score(final_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Final precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

final_model = xgb.XGBClassifier(**grid_search.best_params_)

score = score_model(final_model, X, y, skf)
print(f"Final score: {score:.4f}")


Final precision: 0.6932 (+/- 0.0182)
Final score: 0.7460


With only 4 features we got a score and precision close to the top 12 features subset model.

## Challenge submission

With the final model and optimal feature selection, now we generate the predictions for the test set.

In [138]:
X_test = np.loadtxt("data/x_test.txt")
X_test = X_test[:, best_subset]

grid_search.best_params_ = {'colsample_bytree': 0.7, 'eta': 0.01, 'gamma': 0.1, 'max_depth': 5, 'min_child_weight': 5, 'reg_alpha': 1, 'reg_lambda': 0.1, 'subsample': 0.9}
final_model = xgb.XGBClassifier(**grid_search.best_params_)
final_model.fit(X, y)

y_pred_proba = final_model.predict_proba(X_test)[:, 1]
top_1000_idx = np.argsort(y_pred_proba)[::-1][:1000]

In [140]:
vars_path = "data/vars.txt"
obs_path = "data/obs.txt"

np.savetxt(vars_path, [i + 1 for i in best_subset], fmt="%d")
np.savetxt(obs_path, top_1000_idx + 1, fmt="%d")