# Machine Learning. Hyper Parameters Optimizaion

## Task Description

Data for this task has been colleted experimentally and it represents a biological process. It contains the following columns:
* Activity: actual biological response [0, 1]
* D1-D1776: molecular descriptors like size, form and chemical elements.

Data preprocessing is not required, data is already encoded and normalized.

F1-score metric must be used within the task.

Two models must be trained: logistic regression and random forest. Further a hyper parameter optimization must be performed using basic and advanced methods (GridSearchCV, RandomizedSearchCV, Hyperopt, Optuna). Maximum number of iterations must not exceed 50.

## Initialization

### Import Necessary Libraries

In [87]:
import hyperopt
import numpy as np
import optuna
import pandas as pd

from hyperopt import hp
from sklearn import ensemble
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

### Define Constants

In [110]:
max_iter = 10000
n_splits = 5
random_state = 42

## Load Data

In [89]:
data = pd.read_csv('../../data/_train_sem09.csv')
data

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.000000,0.497009,0.10,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.033300,0.480124,0.00,0.0,0.209791,0.610350,0.356453,0.517720,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.000000,0.538825,0.00,0.5,0.196344,0.724230,0.235606,0.288764,0.805110,...,0,0,0,0,0,0,0,0,0,0
4,0,0.100000,0.517794,0.00,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3746,1,0.033300,0.506409,0.10,0.0,0.209887,0.633426,0.297659,0.376124,0.727093,...,0,0,0,0,0,0,0,0,0,0
3747,1,0.133333,0.651023,0.15,0.0,0.151154,0.766505,0.170876,0.404546,0.787935,...,0,0,1,0,1,0,1,0,0,0
3748,0,0.200000,0.520564,0.00,0.0,0.179949,0.768785,0.177341,0.471179,0.872241,...,0,0,0,0,0,0,0,0,0,0
3749,1,0.100000,0.765646,0.00,0.0,0.536954,0.634936,0.342713,0.447162,0.672689,...,0,0,0,0,0,0,0,0,0,0


### Split the data into train and test datasets

In [90]:
X = data.drop('Activity', axis=1)
y = data['Activity']
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=random_state
)

## Building Initial Models

### Logistic Regression

In [91]:
lr_model = linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=lr_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 33 s, sys: 251 ms, total: 33.3 s
Wall time: 3.82 s
Train mean f1-score: 0.89
Test mean f1-score: 0.78


### Random Forest

In [92]:
rf_model = ensemble.RandomForestClassifier(random_state=random_state)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=rf_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 8.54 s, sys: 216 ms, total: 8.75 s
Wall time: 7.28 s
Train mean f1-score: 1.00
Test mean f1-score: 0.81


## Optimize Hyper Parameters

### Grid Search

#### Logistic Regression

In [111]:
# Total combinations: 1 * 2 * 4 + 1 * 6 * 4 = 32 <= 50
param_grid = [
    {
        'penalty': ['l1'],
        'solver': ['liblinear', 'saga'],
        'C': list(np.linspace(0.01, 1.0, 4, dtype=float))
    },
    {
        'penalty': ['l2'],
        'solver': ['liblinear', 'lbfgs', 'newton-cg', 'newton-cholesky', 'saga', 'sag'],
        'C': list(np.linspace(0.01, 1.0, 4, dtype=float))
    }
]

grid_search_lr = model_selection.GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_lr.fit(X_train, y_train)

y_train_pred = grid_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_lr.best_params_)

CPU times: user 628 ms, sys: 284 ms, total: 912 ms
Wall time: 3min 18s
Train f1-score: 0.84
Test f1-score: 0.80
Best parameters:  {'C': 0.34, 'penalty': 'l1', 'solver': 'liblinear'}


#### Random Forest

In [97]:
# Total combinations: 4 * 2 * 6 = 48 <= 50
param_grid = {
    'n_estimators': list(range(80, 200, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(20, 40, 6, dtype=int))
}

grid_search_rf = model_selection.GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_rf.fit(X_train, y_train)

y_train_pred = grid_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_rf.best_params_)

CPU times: user 1.32 s, sys: 203 ms, total: 1.52 s
Wall time: 32.9 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'max_depth': 20, 'min_samples_leaf': 5, 'n_estimators': 80}


### Randomized Search

#### Logistic Regression

In [113]:
# Total combinations: 1 * 2 * 10 + 1 * 6 * 10 = 80
param_distributions = [
    {
        'penalty': ['l1'],
        'solver': ['liblinear', 'saga'],
        'C': list(np.linspace(0.01, 1.0, 10, dtype=float))
    },
    {
        'penalty': ['l2'],
        'solver': ['liblinear', 'lbfgs', 'newton-cg', 'newton-cholesky', 'saga', 'sag'],
        'C': list(np.linspace(0.01, 1.0, 10, dtype=float))
    }
]

random_search_lr = model_selection.RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=20,
    n_jobs=-1
)

%time random_search_lr.fit(X_train, y_train)

y_train_pred = random_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_lr.best_params_)

CPU times: user 490 ms, sys: 211 ms, total: 701 ms
Wall time: 2min 28s
Train f1-score: 0.84
Test f1-score: 0.80
Best parameters:  {'solver': 'liblinear', 'penalty': 'l1', 'C': 0.34}


#### Random Forest

In [99]:
# Total combinations: 5 * 2 * 10 = 100
param_distributions = {
    'n_estimators': list(range(80, 201, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(10, 50, 10, dtype=int))
}

random_search_rf = model_selection.RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=20,
    n_jobs=-1
)

%time random_search_rf.fit(X_train, y_train)

y_train_pred = random_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_rf.best_params_)

CPU times: user 2.49 s, sys: 132 ms, total: 2.62 s
Wall time: 17.6 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'n_estimators': 200, 'min_samples_leaf': 5, 'max_depth': 50}


### Tree-Structured Parzen Estimators

#### Hyperopt

##### Logistic Regression

In [114]:
space = hp.choice('classifier',[{
    'param':
    {
        'hyper_param_groups': hp.choice('hyper_param_groups', [
        {
            'penalty': hp.choice('penalty_block1', ['l2']),
            'solver': hp.choice('solver_block1', ['newton-cg', 'sag', 'saga', 'lbfgs', 'liblinear'])
        },
        {
            'penalty': hp.choice('penalty_block2', ['l1']),
            'solver': hp.choice('solver_block2', ['saga', 'liblinear'])
        },]),    
        'C': hp.uniform('C', 0.01, 1)
    }
}])


def hyperopt_lr(params, cv=n_splits, X=X_train, y=y_train, random_state=random_state):
    params = {
        'penalty': params['param']['hyper_param_groups']['penalty'],
        'solver': params['param']['hyper_param_groups']['solver'],
        'C': float(params['param']['C'])
    }
    
    model = linear_model.LogisticRegression(**params, random_state=random_state, max_iter=max_iter)
    
    model.fit(X, y)
    score = model_selection.cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
    
    return -score


trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_lr, 
    space=space, 
    algo=hyperopt.tpe.suggest, 
    max_evals=20, 
    trials=trials, 
    rstate=np.random.default_rng(random_state)
)

print('Best values of hyper parameters: ', best)

100%|██████████| 20/20 [16:45<00:00, 50.29s/trial, best loss: -0.7866466124784773]
Best values of hyper parameters:  {'C': 0.05084775379720359, 'classifier': 0, 'hyper_param_groups': 0, 'penalty_block1': 0, 'solver_block1': 0}


In [115]:
# Use the best 'penalty' and 'solver' parameters from the output above.
lr_model = linear_model.LogisticRegression(
    penalty='l2',
    solver='newton-cg',
    C=float(best['C']),
    random_state=random_state,
    max_iter=max_iter
)

lr_model.fit(X_train, y_train)

y_train_pred = lr_model.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = lr_model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

Train f1-score: 0.84
Test f1-score: 0.79


##### Random Forest

In [103]:
space = {
    'n_estimators': hp.quniform('n_estimators', 100, 200, 1),
    'max_depth': hp.quniform('max_depth', 15, 26, 1),
    'min_samples_leaf': hp.quniform('min_samples_leaf', 2, 10, 1)
}


def hyperopt_rf(params, cv=n_splits, X=X_train, y=y_train, random_state=random_state):
    params = {
        'n_estimators': int(params['n_estimators']),
        'max_depth': int(params['max_depth']),
        'min_samples_leaf': int(params['min_samples_leaf'])
    }
    
    model = ensemble.RandomForestClassifier(**params, random_state=random_state)
    
    model.fit(X, y)
    score = model_selection.cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
 
    return -score


trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_rf, 
    space=space, 
    algo=hyperopt.tpe.suggest, 
    max_evals=20, 
    trials=trials, 
    rstate=np.random.default_rng(random_state)
)

print('Best values of hyper parameters: ', best)

rf_model = ensemble.RandomForestClassifier(
    random_state=random_state,
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf'])
)

rf_model.fit(X_train, y_train)

y_train_pred = rf_model.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = rf_model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

100%|██████████| 20/20 [00:56<00:00,  2.84s/trial, best loss: -0.8097293050428462]
Best values of hyper parameters:  {'max_depth': 18.0, 'min_samples_leaf': 2.0, 'n_estimators': 103.0}
Train f1-score: 0.99
Test f1-score: 0.83


#### Optuna

##### Logistic Regression

In [118]:
def optuna_lr(trial):
    penalty = trial.suggest_categorical('penalty', ['l2'])
    solver = trial.suggest_categorical('solver', ['newton-cg', 'sag', 'saga', 'lbfgs', 'liblinear', 'newton-cholesky'])
    C = trial.suggest_float('C', low=0.01, high=1)
    
    model = linear_model.LogisticRegression(
        penalty=penalty,
        solver=solver,
        C=C,
        random_state=random_state,
        max_iter=max_iter
    )
    
    kf = model_selection.KFold(n_splits=n_splits)
    score = model_selection.cross_val_score(model, X_train, y_train, cv=kf, scoring="f1", n_jobs=-1).mean()
 
    return score

study = optuna.create_study(study_name='LogisticRegression', direction='maximize')

study.optimize(optuna_lr, n_trials=20)

print('Best values of hyper parameters: ', study.best_params)
print('Train f1-score: ', round(study.best_value, 2))

model = linear_model.LogisticRegression(**study.best_params, random_state=random_state, max_iter=max_iter)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

[I 2024-07-08 21:54:29,238] A new study created in memory with name: LogisticRegression
[I 2024-07-08 21:54:51,100] Trial 0 finished with value: 0.7728712460678221 and parameters: {'penalty': 'l2', 'solver': 'sag', 'C': 0.6433202420708939}. Best is trial 0 with value: 0.7728712460678221.
[I 2024-07-08 21:54:52,154] Trial 1 finished with value: 0.7813923174428822 and parameters: {'penalty': 'l2', 'solver': 'liblinear', 'C': 0.15335897571711776}. Best is trial 1 with value: 0.7813923174428822.
[I 2024-07-08 21:54:55,879] Trial 2 finished with value: 0.7731597976583198 and parameters: {'penalty': 'l2', 'solver': 'newton-cholesky', 'C': 0.43122816915861417}. Best is trial 1 with value: 0.7813923174428822.
[I 2024-07-08 21:54:56,957] Trial 3 finished with value: 0.7741190721631985 and parameters: {'penalty': 'l2', 'solver': 'liblinear', 'C': 0.6198621611956622}. Best is trial 1 with value: 0.7813923174428822.
[I 2024-07-08 21:54:58,115] Trial 4 finished with value: 0.7807690405229106 and pa

Best values of hyper parameters:  {'penalty': 'l2', 'solver': 'lbfgs', 'C': 0.016256134639307612}
Train f1-score:  0.78
Test f1-score: 0.81


##### Random Foresst

In [105]:
def optuna_rf(trial):
    n_estimators = trial.suggest_int('n_estimators', low=100, high=300, step=10)
    max_depth = trial.suggest_int('max_depth', low=15, high=40, step=1)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', low=3, high=7, step=1)
    
    model = ensemble.RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )
    
    kf = model_selection.KFold(n_splits=n_splits)
    score = model_selection.cross_val_score(model, X_train, y_train, cv=kf, scoring="f1", n_jobs=-1).mean()
    
    return score

%time

study = optuna.create_study(study_name='RandomForestClassifier', direction='maximize')

study.optimize(optuna_rf, n_trials=20)

print('Best values of hyper parameters: ', study.best_params)
print('Train f1-score: ', round(study.best_value, 2))

model = ensemble.RandomForestClassifier(**study.best_params, random_state=random_state)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

[I 2024-07-08 20:44:52,313] A new study created in memory with name: RandomForestClassifier


CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 4.29 µs


[I 2024-07-08 20:44:55,024] Trial 0 finished with value: 0.8032959630485156 and parameters: {'n_estimators': 260, 'max_depth': 38, 'min_samples_leaf': 4}. Best is trial 0 with value: 0.8032959630485156.
[I 2024-07-08 20:44:57,755] Trial 1 finished with value: 0.8017659784937461 and parameters: {'n_estimators': 270, 'max_depth': 35, 'min_samples_leaf': 4}. Best is trial 0 with value: 0.8032959630485156.
[I 2024-07-08 20:44:58,860] Trial 2 finished with value: 0.7958124144308217 and parameters: {'n_estimators': 110, 'max_depth': 39, 'min_samples_leaf': 7}. Best is trial 0 with value: 0.8032959630485156.
[I 2024-07-08 20:45:00,140] Trial 3 finished with value: 0.7975724733176891 and parameters: {'n_estimators': 130, 'max_depth': 20, 'min_samples_leaf': 7}. Best is trial 0 with value: 0.8032959630485156.
[I 2024-07-08 20:45:02,056] Trial 4 finished with value: 0.7971045600656945 and parameters: {'n_estimators': 210, 'max_depth': 23, 'min_samples_leaf': 7}. Best is trial 0 with value: 0.803

Best values of hyper parameters:  {'n_estimators': 140, 'max_depth': 21, 'min_samples_leaf': 6}
Train f1-score:  0.81
Test f1-score: 0.83


## Conclusions

The following results have been observed using the applied hyper parameter optimization methods.

| Model Type | F1-Score on Train Data | F1-Score on Test Data |
| ---------- | ---------------------- | --------------------- |
| No optimization - Logistic Regression | 0,89 | 0,78 |
| No optimization - Random Forest | 1 | 0,81 |
| Grid Search - Logistic Regression | 0,84 | 0,80 |
| Grid Search - Random Forest | 0,94 | 0,83 |
| Randomised Search - Logistic Regression | 0,84 | 0,80 |
| Randomised Search - Random Forest | 0,94 | 0,83 |
| Hyperopt - Logistic Regression | 0,84 | 0,79 |
| Hyperopt - Random Forest | 0,99 | 0,83 |
| Optuna - Logistic Regression | 0,78 | 0,81 |
| Optuna - Random Forest | 0,81 | 0,83 |

We an conclude that Random Forest is showing on average better restults than the Logistic Regression, although it tends to overfitting. 

For Logistic Regression all optimization methods showed better results than the basic model. Of which Optuna performed the best. 

For Random Forest all optimization methods showed better results than the basic model. Of which only Optuna did not tend to overfitting.