## Basic loss approach
In the given notebook, we are going to try how would perform in our task an xgboost model with a default - RMSE objective.

We are going to try different hyperparameters and selected features from different feature selection methods.

In [2]:
import sys
sys.path.append('..')

from metrics import default_competition_metric
from metrics import make_competition_scorer


import numpy as np
import pandas as pd

import xgboost as xgb
from sklearn.preprocessing import StandardScaler

np.random.seed(44)

In [3]:
# device = 'cuda' # modify if needed

In [4]:
X_train = np.load('../../data/x_train.npy')
y_train = np.load('../../data/y_train.npy')
X_val = np.load('../../data/x_val.npy')
y_val = np.load('../../data/y_val.npy')


In [5]:
# basic xgboost model
model = xgb.XGBClassifier(n_estimators=1000, max_depth=5, verbosity=2)
model.fit(X_train, y_train)

In [8]:
y_pred = model.predict(X_val)

default_competition_metric(y_val, y_pred, k=X_train.shape[1])

-97850.0

In [9]:
print(f"Accuracy: {np.mean(y_val == y_pred)}")

Accuracy: 0.649


Visibly, the model performance is very poor, let us select some best features 

In [10]:
feature_importance = model.feature_importances_

features_to_train = np.where(feature_importance > 0.005)[0]
print(f"Number of features to train: {len(features_to_train)}")

Number of features to train: 9


In [None]:
X_selected = X_train[:, features_to_train]
X_val_selected = X_val[:, features_to_train]

small_model = xgb.XGBClassifier(n_estimators=1000, max_depth=5, verbosity=2)
small_model.fit(X_selected, y_train)

y_pred = small_model.predict(X_val_selected)
y_pred_proba = small_model.predict_proba(X_val_selected)[:, 1]

print(f"Accuracy: {np.mean(y_val == y_pred)}")
print(f"Competition metric: {default_competition_metric(y_val, y_pred, k=X_selected.shape[1])}")
print(f"Competition metric with proba: {default_competition_metric(y_val, y_pred, k=X_selected.shape[1], y_pred_proba=y_pred_proba)}")

Accuracy: 0.613
Competition metric: -8050.000000000001
Competition metric with proba: -2300.0


the results are much better, but unfortunately we still report loss. Let us try with less features.

In [None]:
feature_importance = model.feature_importances_

features_to_train = np.where(feature_importance > 0.0055)[0]
print(f"Number of features to train: {len(features_to_train)}")

Number of features to train: 4


In [None]:
X_selected = X_train[:, features_to_train]
X_val_selected = X_val[:, features_to_train]

small_model = xgb.XGBClassifier(n_estimators=1000, max_depth=5, verbosity=2)
small_model.fit(X_selected, y_train)

y_pred = small_model.predict(X_val_selected)
y_pred_proba = small_model.predict_proba(X_val_selected)[:, 1]

print(f"Accuracy: {np.mean(y_val == y_pred)}")
print(f"Competition metric: {default_competition_metric(y_val, y_pred, k=X_selected.shape[1])}")
print(f"Competition metric with proba: {default_competition_metric(y_val, y_pred, k=X_selected.shape[1], y_pred_proba=y_pred_proba)}")

Accuracy: 0.498
Competition metric: -6500.0
Competition metric with proba: 1150.0


In [None]:
competition_scorer_k = make_competition_scorer(k=X_selected.shape[1])
competition_scorer_k(small_model, X_val_selected, y_val)

1150.0

In [8]:
def perform_grid_search(features_to_train):
    # grid search
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.pipeline import Pipeline
    from time import time

    X_selected = X_train[:, features_to_train]
    X_val_selected = X_val[:, features_to_train]


    competition_scorer_k = make_competition_scorer(k=X_selected.shape[1])

    params = {
            'xgb__learning_rate': [0.01, 0.05, 0.1, 0.15, 0.2],
            'xgb__min_child_weight': [1, 5, 10],
            'xgb__gamma': [0.5, 1, 1.5, 2, 5],
            'xgb__subsample': [0.6, 0.8, 1.0],
            'xgb__colsample_bytree': [0.6, 0.8, 1.0],
            'xgb__max_depth': [3, 4, 5]
            }

    skf = StratifiedKFold(n_splits=5, shuffle = True, random_state = 44)

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('xgb', xgb.XGBClassifier(n_estimators=1000))
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, n_iter=1000, scoring=competition_scorer_k, n_jobs=-1, cv=skf, verbose=3, random_state=44)


    random_search.fit(X_selected, y_train)

    print(random_search.best_params_)

    best_model = random_search.best_estimator_
    # train new model on all data
    best_model.fit(X_selected, y_train)

    y_pred = best_model.predict(X_val_selected)
    y_pred_proba = best_model.predict_proba(X_val_selected)[:, 1]
    return default_competition_metric(y_val, y_pred, k=X_selected.shape[1], y_pred_proba=y_pred_proba)
    

## Grid search on the selected features from xgboost importance

We selected the features as below:

In [12]:
features_to_train

array([  5,   8, 100, 102, 105, 302, 321, 367, 438], dtype=int64)

In [13]:
perform_grid_search(features_to_train)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
{'xgb__subsample': 0.8, 'xgb__min_child_weight': 10, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.01, 'xgb__gamma': 2, 'xgb__colsample_bytree': 1.0}


5149.999999999999

### conclusion of the grid search
The grid search has improved results of our metric 2 times

## Grid search on features from Boruta

In [14]:
features_to_train = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 100, 101, 102, 103, 104, 105])

In [15]:
perform_grid_search(features_to_train)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
{'xgb__subsample': 0.6, 'xgb__min_child_weight': 1, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.1, 'xgb__gamma': 5, 'xgb__colsample_bytree': 0.6}


4450.0

## MRMR features

In [14]:
# minimal small set
features_to_train = np.array([100, 102, 105])

In [15]:
perform_grid_search(features_to_train)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan na

{'xgb__subsample': 0.6, 'xgb__min_child_weight': 5, 'xgb__max_depth': 5, 'xgb__learning_rate': 0.01, 'xgb__gamma': 1.5, 'xgb__colsample_bytree': 0.6}


6050.0

In [10]:
# larger training set
features_to_train = np.array([100, 102, 105,403, 466])

In [11]:
perform_grid_search(features_to_train)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
{'xgb__subsample': 1.0, 'xgb__min_child_weight': 10, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.15, 'xgb__gamma': 2, 'xgb__colsample_bytree': 0.8}


5550.0