# Initial model: Random Forest

Since some feature selection methods require a model to be trained, we will use Random Forest for this task. To make the model more robust, we will use cross-validation to tune it on all the features.

## Data loading



In [1]:
import numpy as np

x_path = "data/x_train.txt"
y_path = "data/y_train.txt"

X = np.loadtxt(x_path)
y = np.loadtxt(y_path)

## Hyperparameters tuning

There are several features worth tuning in Random Forest. We will use cross-validation to tune the following hyperparameters:
- `n_estimators`: the number of trees in the forest
- `max_depth`: the maximum depth of the tree
- `min_samples_split`: the minimum number of samples required to split an internal node
- `min_samples_leaf`: the minimum number of samples required to be at a leaf node
- `max_features`: the number of features to consider when looking for the best split
- `bootstrap`: whether bootstrap samples are used when building trees

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    "n_estimators": [100, 250, 500],
    "max_depth": [None, 5, 10, 15],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["log2", "sqrt", None],
    "bootstrap": [True, False],
}

rf = RandomForestClassifier(random_state=42)
search = GridSearchCV(
    rf,
    param_dist,
    cv=3,
    n_jobs=-1,
    scoring="precision",
    verbose=1,
)
search.fit(X, y)
print(search.best_params_)

Fitting 3 folds for each of 648 candidates, totalling 1944 fits


648 fits failed out of a total of 1944.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
188 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/advanced-ml/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/advanced-ml/lib/python3.10/site-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/opt/homebrew/Caskroom/miniconda/base/envs/advanced-ml/lib/python3.10/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/opt/homebrew/Caskroom

{'bootstrap': True, 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}


Since there was a small error in the `max_features` parameter ("auto" instead of "log2"), we will do an extra tuning for this parameter instead of repeating the whole process (very time-consuming). We also reduced some of the ranges, setting the best value from the previous stage (note it has some risk of missing the global optimum).

In [11]:
param_dist = {
    "n_estimators": [250, 500],
    "max_depth": [None, 10, 15],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["log2", "sqrt"],
    "bootstrap": [True],
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
search = GridSearchCV(
    rf,
    param_dist,
    cv=3,
    n_jobs=-1,
    scoring="precision",
    verbose=1,
)
search.fit(X, y)
print(search.best_params_)

Fitting 3 folds for each of 48 candidates, totalling 144 fits
{'bootstrap': True, 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}


## Baseline model vs tuned model

After obtaining the best hyperparameters, we will ensure that the model is indeed better, comparing it with the model with default hyperparameters using cross-validation.

In [12]:
from sklearn.model_selection import cross_val_score

baseline_model = RandomForestClassifier(random_state=42)
scores = cross_val_score(baseline_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Baseline precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

tuned_model = RandomForestClassifier(**search.best_params_)
scores = cross_val_score(tuned_model, X, y, cv=5, scoring="precision", n_jobs=-1)
print(f"Tuned precision: {scores.mean():.4f} (+/- {scores.std():.4f})")

Baseline precision: 0.6093 (+/- 0.0278)
Tuned precision: 0.6504 (+/- 0.0204)


Model indeed achieved better results, so we will use it for feature selection. Since the main goal of the task is to have the best precision in top 20% of predictions, we will use custom scoring function for one more comparison.

In [14]:
from top20_scoring import top_20_perc_scoring
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)

def score_model(model, X, y, skf):
    sum = 0
    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model.fit(X_train, y_train)

        y_pred_proba = model.predict_proba(X_test)[:, 1]
        score = top_20_perc_scoring(y_test, y_pred_proba)
        sum += score

    return sum / skf.n_splits

baseline_score = score_model(baseline_model, X, y, skf)
tuned_score = score_model(tuned_model, X, y, skf)

print(f"Baseline score: {baseline_score:.4f}")
print(f"Tuned score: {tuned_score:.4f}")



Baseline score: 0.6710
Tuned score: 0.6880


Here once again the model with tuned hyperparameters outperformed the baseline model, but the difference is not that significant. It could be expected since we only select the most probable part of the predictions.