# Machine Learning. Hyper Parameters Optimizaion

## Task Description

Data for this task has been colleted experimentally and it represents a biological process. It contains the following columns:
* Activity: actual biological response [0, 1]
* D1-D1776: molecular descriptors like size, form and chemical elements.

Data preprocessing is not required, data is already encoded and normalized.

F1-score metric must be used within the task.

Two models must be trained: logistic regression and random forest. Further a hyper parameter optimization must be performed using basic and advanced methods (GridSearchCV, RandomizedSearchCV, Hyperopt, Optuna). Maximum number of iterations must not exceed 50.

## Initialization

### Import Necessary Libraries

In [49]:
import hyperopt as hp
import numpy as np
import optuna
import pandas as pd

from sklearn import ensemble
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

### Define Constants

In [42]:
max_iter = 1000
n_splits = 5
random_state = 42

## Load Data

In [18]:
data = pd.read_csv('../../data/_train_sem09.csv')
data

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.000000,0.497009,0.10,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.033300,0.480124,0.00,0.0,0.209791,0.610350,0.356453,0.517720,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.000000,0.538825,0.00,0.5,0.196344,0.724230,0.235606,0.288764,0.805110,...,0,0,0,0,0,0,0,0,0,0
4,0,0.100000,0.517794,0.00,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3746,1,0.033300,0.506409,0.10,0.0,0.209887,0.633426,0.297659,0.376124,0.727093,...,0,0,0,0,0,0,0,0,0,0
3747,1,0.133333,0.651023,0.15,0.0,0.151154,0.766505,0.170876,0.404546,0.787935,...,0,0,1,0,1,0,1,0,0,0
3748,0,0.200000,0.520564,0.00,0.0,0.179949,0.768785,0.177341,0.471179,0.872241,...,0,0,0,0,0,0,0,0,0,0
3749,1,0.100000,0.765646,0.00,0.0,0.536954,0.634936,0.342713,0.447162,0.672689,...,0,0,0,0,0,0,0,0,0,0


### Split the data into train and test datasets

In [19]:
X = data.drop('Activity', axis=1)
y = data['Activity']
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=random_state
)

## Building Initial Models

### Logistic Regression

In [32]:
lr_model = linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=lr_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 23.1 s, sys: 233 ms, total: 23.3 s
Wall time: 2.5 s
Train mean f1-score: 0.89
Test mean f1-score: 0.78


### Random Forest

In [34]:
rf_model = ensemble.RandomForestClassifier(random_state=random_state)

kf = model_selection.KFold(n_splits=n_splits)

%time cv_metrics = model_selection.cross_validate(estimator=rf_model, X=X, y=y, cv=kf, scoring='f1', return_train_score=True)

print(f'Train mean f1-score: {np.mean(cv_metrics['train_score']):.2f}')
print(f'Test mean f1-score: {np.mean(cv_metrics['test_score']):.2f}')


CPU times: user 6.14 s, sys: 82.1 ms, total: 6.22 s
Wall time: 6.38 s
Train mean f1-score: 1.00
Test mean f1-score: 0.81


## Optimize Hyper Parameters

### Grid Search

#### Logistic Regression

In [41]:
# Total combinations: 3 * 4 * 4 = 48 <= 50
param_grid = {
    'penalty': ['l1', 'l2', 'none'],
    'solver': ['liblinear', 'lbfgs', 'saga', 'sag'],
    'C': list(np.linspace(0.01, 1.0, 4, dtype=float))
}

grid_search_lr = model_selection.GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_lr.fit(X_train, y_train)

y_train_pred = grid_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_lr.best_params_)

120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    sol

CPU times: user 607 ms, sys: 156 ms, total: 763 ms
Wall time: 2min 59s
Train f1-score: 0.84
Test f1-score: 0.80
Best parameters:  {'C': 0.34, 'penalty': 'l1', 'solver': 'liblinear'}


#### Random Forest

In [43]:
# Total combinations: 4 * 2 * 6 = 48 <= 50
param_grid = {
    'n_estimators': list(range(80, 200, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(20, 40, 6, dtype=int))
}

grid_search_rf = model_selection.GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_grid=param_grid,
    cv=n_splits,
    n_jobs=-1
)

%time grid_search_rf.fit(X_train, y_train)

y_train_pred = grid_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = grid_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', grid_search_rf.best_params_)

CPU times: user 1.15 s, sys: 119 ms, total: 1.27 s
Wall time: 30.2 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'max_depth': 20, 'min_samples_leaf': 5, 'n_estimators': 80}


### Randomized Search

#### Logistic Regression

In [45]:
# Total combinations: 3 * 4 * 10 = 120
param_distributions = {
    'penalty': ['l1', 'l2', 'none'],
    'solver': ['liblinear', 'lbfgs', 'saga', 'sag'],
    'C': list(np.linspace(0.01, 1.0, 10, dtype=float))
}

random_search_lr = model_selection.RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=50,
    n_jobs=-1
)

%time random_search_lr.fit(X_train, y_train)

y_train_pred = random_search_lr.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_lr.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_lr.best_params_)

135 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    sol

CPU times: user 690 ms, sys: 207 ms, total: 897 ms
Wall time: 2min 14s
Train f1-score: 0.84
Test f1-score: 0.80
Best parameters:  {'solver': 'liblinear', 'penalty': 'l1', 'C': 0.34}


#### Random Forest

In [48]:
# Total combinations: 5 * 2 * 10 = 100
param_distributions = {
    'n_estimators': list(range(80, 201, 30)),
    'min_samples_leaf': [5, 10],
    'max_depth': list(np.linspace(10, 50, 10, dtype=int))
}

random_search_rf = model_selection.RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state),
    param_distributions=param_distributions,
    cv=n_splits,
    n_iter=50,
    n_jobs=-1
)

%time random_search_rf.fit(X_train, y_train)

y_train_pred = random_search_rf.predict(X_train)
print(f'Train f1-score: {metrics.f1_score(y_train, y_train_pred):.2f}')

y_test_pred = random_search_rf.predict(X_test)
print(f'Test f1-score: {metrics.f1_score(y_test, y_test_pred):.2f}')

print('Best parameters: ', random_search_rf.best_params_)

CPU times: user 2.38 s, sys: 170 ms, total: 2.55 s
Wall time: 36 s
Train f1-score: 0.94
Test f1-score: 0.83
Best parameters:  {'n_estimators': 200, 'min_samples_leaf': 5, 'max_depth': 45}


### Tree-Structured Parzen Estimators

#### Hyperopt

##### Logistic Regression

In [None]:
param_distributions = {
    'penalty': ['l1', 'l2', 'none'],
    'solver': ['liblinear', 'lbfgs', 'saga', 'sag'],
    'C': list(np.linspace(0.01, 1.0, 10, dtype=float))
}

space = {
    'n_estimators': hp.quniform('n_estimators', 100, 200, 1),
    'max_depth': hp.quniform('max_depth', 15, 26, 1),
    'min_samples_leaf': hp.quniform('min_samples_leaf', 2, 10, 1)
}


space = {
    'penalty': hp.choice
}