# Machine Learning. Hyper Parameters Optimizaion

## Task Description

Data for this task has been colleted experimentally and it represents a biological process. It contains the following columns:
* Activity: actual biological response [0, 1]
* D1-D1776: molecular descriptors like size, form and chemical elements.

Data preprocessing is not required, data is already encoded and normalized.

F1-score metric must be used within the task.

Two models must be trained: logistic regression and random forest. Further a hyper parameter optimization must be performed using basic and advanced methods (GridSearchCV, RandomizedSearchCV, Hyperopt, Optuna). Maximum number of iterations must not exceed 50.

## Initialization

### Import Necessary Libraries

In [87]:
import hyperopt
import numpy as np
import optuna
import pandas as pd

from hyperopt import hp
from sklearn import ensemble
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

### Define Constants

In [88]:
max_iter = 1000
n_splits = 5
random_state = 42

## Load Data

In [89]:
data = pd.read_csv('../../data/_train_sem09.csv')
data

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.000000,0.497009,0.10,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.033300,0.480124,0.00,0.0,0.209791,0.610350,0.356453,0.517720,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.000000,0.538825,0.00,0.5,0.196344,0.724230,0.235606,0.288764,0.805110,...,0,0,0,0,0,0,0,0,0,0
4,0,0.100000,0.517794,0.00,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3746,1,0.033300,0.506409,0.10,0.0,0.209887,0.633426,0.297659,0.376124,0.727093,...,0,0,0,0,0,0,0,0,0,0
3747,1,0.133333,0.651023,0.15,0.0,0.151154,0.766505,0.170876,0.404546,0.787935,...,0,0,1,0,1,0,1,0,0,0
3748,0,0.200000,0.520564,0.00,0.0,0.179949,0.768785,0.177341,0.471179,0.872241,...,0,0,0,0,0,0,0,0,0,0
3749,1,0.100000,0.765646,0.00,0.0,0.536954,0.634936,0.342713,0.447162,0.672689,...,0,0,0,0,0,0,0,0,0,0


### Split the data into train and test datasets

In [90]:
X = data.drop('Activity', axis=1)
y = data['Activity']
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=random_state
)

## Building Initial Models

### Logistic Regression

In [91]:
lr_model = linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=lr_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 33 s, sys: 251 ms, total: 33.3 s
Wall time: 3.82 s
Train mean f1-score: 0.89
Test mean f1-score: 0.78


### Random Forest

In [92]:
rf_model = ensemble.RandomForestClassifier(random_state=random_state)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=rf_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 8.54 s, sys: 216 ms, total: 8.75 s
Wall time: 7.28 s
Train mean f1-score: 1.00
Test mean f1-score: 0.81


## Optimize Hyper Parameters

### Grid Search

#### Logistic Regression

In [96]:
# Total combinations: 3 * 4 * 4 = 48 <= 50
param_grid = {
    'penalty': ['l1', 'l2', None],
    'solver': ['liblinear', 'lbfgs', 'saga', 'sag'],
    'C': list(np.linspace(0.01, 1.0, 4, dtype=float))
}

grid_search_lr = model_selection.GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_lr.fit(X_train, y_train)

y_train_pred = grid_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_lr.best_params_)

60 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solv

CPU times: user 1.01 s, sys: 394 ms, total: 1.4 s
Wall time: 4min 13s
Train f1-score: 0.84
Test f1-score: 0.80
Best parameters:  {'C': 0.34, 'penalty': 'l1', 'solver': 'liblinear'}


#### Random Forest

In [97]:
# Total combinations: 4 * 2 * 6 = 48 <= 50
param_grid = {
    'n_estimators': list(range(80, 200, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(20, 40, 6, dtype=int))
}

grid_search_rf = model_selection.GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_rf.fit(X_train, y_train)

y_train_pred = grid_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_rf.best_params_)

CPU times: user 1.32 s, sys: 203 ms, total: 1.52 s
Wall time: 32.9 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'max_depth': 20, 'min_samples_leaf': 5, 'n_estimators': 80}


### Randomized Search

#### Logistic Regression

In [98]:
# Total combinations: 3 * 4 * 10 = 120
param_distributions = {
    'penalty': ['l1', 'l2', None],
    'solver': ['liblinear', 'lbfgs', 'saga', 'sag'],
    'C': list(np.linspace(0.01, 1.0, 10, dtype=float))
}

random_search_lr = model_selection.RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=20,
    n_jobs=-1
)

%time random_search_lr.fit(X_train, y_train)

y_train_pred = random_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_lr.best_params_)

30 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solve

CPU times: user 46.4 s, sys: 272 ms, total: 46.7 s
Wall time: 3min 40s
Train f1-score: 0.87
Test f1-score: 0.79
Best parameters:  {'solver': 'saga', 'penalty': 'l1', 'C': 0.78}




#### Random Forest

In [99]:
# Total combinations: 5 * 2 * 10 = 100
param_distributions = {
    'n_estimators': list(range(80, 201, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(10, 50, 10, dtype=int))
}

random_search_rf = model_selection.RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=20,
    n_jobs=-1
)

%time random_search_rf.fit(X_train, y_train)

y_train_pred = random_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_rf.best_params_)

CPU times: user 2.49 s, sys: 132 ms, total: 2.62 s
Wall time: 17.6 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'n_estimators': 200, 'min_samples_leaf': 5, 'max_depth': 50}


### Tree-Structured Parzen Estimators

#### Hyperopt

##### Logistic Regression

In [101]:
space = hp.choice('classifier',[{
    'param':
    {
        'hyper_param_groups': hp.choice('hyper_param_groups', [
        {
            'penalty': hp.choice('penalty_block1', ['l2']),
            'solver': hp.choice('solver_block1', ['newton-cg', 'sag', 'saga', 'lbfgs', 'liblinear'])
        },
        {
            'penalty': hp.choice('penalty_block2', ['l1']),
            'solver': hp.choice('solver_block2', ['saga', 'liblinear'])
        },]),    
        'C': hp.uniform('C', 0.01, 1)
    }
}])


def hyperopt_lr(params, cv=n_splits, X=X_train, y=y_train, random_state=random_state):
    params = {
        'penalty': params['param']['hyper_param_groups']['penalty'],
        'solver': params['param']['hyper_param_groups']['solver'],
        'C': float(params['param']['C'])
    }
    
    model = linear_model.LogisticRegression(**params, random_state=random_state, max_iter=max_iter)
    
    model.fit(X, y)
    score = model_selection.cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
    
    return -score


trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_lr, 
    space=space, 
    algo=hyperopt.tpe.suggest, 
    max_evals=20, 
    trials=trials, 
    rstate=np.random.default_rng(random_state)
)

print('Best values of hyper parameters: ', best)

  5%|▌         | 1/20 [00:00<00:16,  1.16trial/s, best loss: -0.771966158616668]




 10%|█         | 2/20 [01:21<14:23, 47.99s/trial, best loss: -0.7833737934825968]



 20%|██        | 4/20 [02:37<12:09, 45.61s/trial, best loss: -0.7859436306230595]



 45%|████▌     | 9/20 [03:06<01:38,  8.98s/trial, best loss: -0.7859436306230595]




 50%|█████     | 10/20 [04:25<05:07, 30.73s/trial, best loss: -0.7859436306230595]



 65%|██████▌   | 13/20 [04:58<02:19, 19.98s/trial, best loss: -0.7866466124784773]




 70%|███████   | 14/20 [06:17<03:47, 37.97s/trial, best loss: -0.7866466124784773]




 75%|███████▌  | 15/20 [07:42<04:19, 51.89s/trial, best loss: -0.7866466124784773]



 90%|█████████ | 18/20 [08:06<00:49, 24.81s/trial, best loss: -0.7866466124784773]




 95%|█████████▌| 19/20 [09:30<00:42, 42.57s/trial, best loss: -0.7866466124784773]



100%|██████████| 20/20 [10:00<00:00, 30.04s/trial, best loss: -0.7866466124784773]
Best values of hyper parameters:  {'C': 0.05084775379720359, 'classifier': 0, 'hyper_param_groups': 0, 'penalty_block1': 0, 'solver_block1': 0}


In [102]:
# Use the best 'penalty' and 'solver' parameters from the output above.
lr_model = linear_model.LogisticRegression(
    penalty='l2',
    solver='newton-cg',
    C=float(best['C']),
    random_state=random_state,
    max_iter=max_iter
)

lr_model.fit(X_train, y_train)

y_train_pred = lr_model.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = lr_model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

Train f1-score: 0.84
Test f1-score: 0.79


##### Random Forest

In [None]:
space = {
    'n_estimators': hp.quniform('n_estimators', 100, 200, 1),
    'max_depth': hp.quniform('max_depth', 15, 26, 1),
    'min_samples_leaf': hp.quniform('min_samples_leaf', 2, 10, 1)
}


def hyperopt_rf(params, cv=n_splits, X=X_train, y=y_train, random_state=random_state):
    params = {
        'n_estimators': int(params['n_estimators']),
        'max_depth': int(params['max_depth']),
        'min_samples_leaf': int(params['min_samples_leaf'])
    }
    
    model = ensemble.RandomForestClassifier(**params, random_state=random_state)
    
    model.fit(X, y)
    score = model_selection.cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
 
    return -score


trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_rf, 
    space=space, 
    algo=hyperopt.tpe.suggest, 
    max_evals=20, 
    trials=trials, 
    rstate=np.random.default_rng(random_state)
)

print('Best values of hyper parameters: ', best)

rf_model = ensemble.RandomForestClassifier(
    random_state=random_state,
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf'])
)

rf_model.fit(X_train, y_train)

y_train_pred = rf_model.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = rf_model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

100%|██████████| 20/20 [00:54<00:00,  2.75s/trial, best loss: -0.8097293050428462]
Best values of hyper parameters:  {'max_depth': 18.0, 'min_samples_leaf': 2.0, 'n_estimators': 103.0}
Train f1-score: 0.99
Test f1-score: 0.83


#### Optuna

##### Logistic Regression

In [None]:
def optuna_lr(trial):
    penalty = trial.suggest_categorical('penalty', ['l2', None])
    solver = trial.suggest_categorical('solver', ['newton-cg', 'sag', 'saga', 'lbfgs'])
    C = trial.suggest_float('C', low=0.01, high=1)
    
    model = linear_model.LogisticRegression(
        penalty=penalty,
        solver=solver,
        C=C,
        random_state=random_state,
        max_iter=max_iter
    )
    
    kf = model_selection.KFold(n_splits=n_splits)
    score = model_selection.cross_val_score(model, X_train, y_train, cv=kf, scoring="f1", n_jobs=-1).mean()
 
    return score

study = optuna.create_study(study_name='LogisticRegression', direction='maximize')

study.optimize(optuna_lr, n_trials=20)

print('Best values of hyper parameters: ', study.best_params)
print('Train f1-score: ', round(study.best_value, 2))

model = linear_model.LogisticRegression(**study.best_params, random_state=random_state, max_iter=max_iter)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

[I 2024-07-08 07:59:58,943] A new study created in memory with name: LogisticRegression
[I 2024-07-08 08:00:06,218] Trial 0 finished with value: 0.7237274203040306 and parameters: {'penalty': None, 'solver': 'newton-cg', 'C': 0.6477400870623531}. Best is trial 0 with value: 0.7237274203040306.
[I 2024-07-08 08:00:22,500] Trial 1 finished with value: 0.7715428457970555 and parameters: {'penalty': 'l2', 'solver': 'saga', 'C': 0.8809177298148992}. Best is trial 1 with value: 0.7715428457970555.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.
Further o

Best values of hyper parameters:  {'penalty': 'l2', 'solver': 'lbfgs', 'C': 0.02882514883949641}
Train f1-score:  0.78
Test f1-score: 0.80


##### Random Foresst

In [None]:
def optuna_rf(trial):
    n_estimators = trial.suggest_int('n_estimators', low=100, high=300, step=10)
    max_depth = trial.suggest_int('max_depth', low=15, high=40, step=1)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', low=3, high=7, step=1)
    
    model = ensemble.RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )
    
    kf = model_selection.KFold(n_splits=n_splits)
    score = model_selection.cross_val_score(model, X_train, y_train, cv=kf, scoring="f1", n_jobs=-1).mean()
    
    return score

%time

study = optuna.create_study(study_name='RandomForestClassifier', direction='maximize')

study.optimize(optuna_rf, n_trials=20)

print('Best values of hyper parameters: ', study.best_params)
print('Train f1-score: ', round(study.best_value, 2))

model = ensemble.RandomForestClassifier(**study.best_params, random_state=random_state)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

[I 2024-07-08 08:04:04,445] A new study created in memory with name: RandomForestClassifier


CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 2.86 µs


[I 2024-07-08 08:04:06,613] Trial 0 finished with value: 0.8058720733687965 and parameters: {'n_estimators': 190, 'max_depth': 25, 'min_samples_leaf': 3}. Best is trial 0 with value: 0.8058720733687965.
[I 2024-07-08 08:04:09,006] Trial 1 finished with value: 0.8020571969909603 and parameters: {'n_estimators': 230, 'max_depth': 23, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.8058720733687965.
[I 2024-07-08 08:04:11,143] Trial 2 finished with value: 0.8015520206159727 and parameters: {'n_estimators': 240, 'max_depth': 37, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.8058720733687965.
[I 2024-07-08 08:04:12,144] Trial 3 finished with value: 0.7957500433135811 and parameters: {'n_estimators': 100, 'max_depth': 27, 'min_samples_leaf': 7}. Best is trial 0 with value: 0.8058720733687965.
[I 2024-07-08 08:04:13,177] Trial 4 finished with value: 0.8030464294503246 and parameters: {'n_estimators': 110, 'max_depth': 22, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.805

Best values of hyper parameters:  {'n_estimators': 300, 'max_depth': 34, 'min_samples_leaf': 3}
Train f1-score:  0.81
Test f1-score: 0.84
