## Hyperparameter Optimization using `chemml.optimization.GeneticAlgorithm`

We use a sample dataset from ChemML library which has the SMILES codes and Dragon molecular descriptors for 500 small organic molecules with their densities in $kg/m^3$. 

For more information on Genetic Algorithm, please refer to our [paper](https://doi.org/10.26434/chemrxiv.9782387.v1) 

In [1]:
from chemml.datasets import load_organic_density
_,density,features = load_organic_density()

print(density.shape, features.shape)
density, features = density.values, features.values

from sklearn.preprocessing import StandardScaler
scalerx = StandardScaler()
features = scalerx.fit_transform(features)
density = scalerx.fit_transform(density)

(500, 1) (500, 200)


### Defining hyperparameter space

Lets consider [kernel ridge regression from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html) for training. The hyperparameters of interest are: alpha, kernel and degree.

The space variable is a tuple of dictionaries for each hyperparameter. The dictionary is specified as:

`{'name' : {'type' : <range>}}`

An additional mutation key, with its value as: (mean, standard deviation) of a Gaussian distribution, is also required for the ‘uniform’ hyperparameter type.

In [2]:
from sklearn.kernel_ridge import KernelRidge
space = (
        {'alpha'   :   {'uniform' : (0.1, 10), 'mutation': (0,1)}},
        {'kernels' :   {'choice'  : ['rbf', 'sigmoid', 'polynomial', 'linear']}},
        {'degree'  :   {'int'     : (1,5)}} )

### Defining objective function

The objective function is defined as a function that receives one ‘individual’ of the genetic algorithm’s population that is an ordered list of the hyperparameters defined in the space variable. Within the objective function, the user does all the required calculations and returns the metric (as a tuple) that is supposed to be optimized. If multiple metrics are returned, all the metrics are optimized according to the fitness defined in the initialization of the Genetic Algorithm class.

In [3]:
from sklearn.metrics import mean_absolute_error
from chemml.utils import regression_metrics
def obj(individual):
    krr = KernelRidge(alpha=individual[0], kernel=individual[1], degree=individual[2])
    krr.fit(features[:400], density[:400])
    pred = krr.predict(features[400:])
    mae = regression_metrics(density[400:],pred)['MAE'].values[0]
    return mae

### Optimize the model

In [4]:
from chemml.optimization import GeneticAlgorithm
import warnings
warnings.filterwarnings('ignore')

ga = GeneticAlgorithm(evaluate=obj, space=space, fitness=("min", ),
                    pop_size = 8, crossover_size=6, mutation_size=2, algorithm=3)
fitness_df, final_best_hyperparameters = ga.search(n_generations=5)

`ga.search` returns:

- a dataframe with the best individuals of each generation along with their fitness values and the time taken to evaluate the model

- a dictionary containing the best individual 

In [5]:
fitness_df

Unnamed: 0,Best_individual,Fitness_values,Time (hours)
0,"(2.928571428571429, linear, 2)",0.102514,4.2e-05
1,"(1.9083405267238138, linear, 3)",0.099425,3.3e-05
2,"(1.9083405267238138, linear, 3)",0.099425,3.9e-05
3,"(1.9083405267238138, linear, 3)",0.099425,3.3e-05
4,"(1.9083405267238138, linear, 3)",0.099425,6e-05


In [6]:
print(final_best_hyperparameters)

{'alpha': 1.9083405267238138, 'kernels': 'linear', 'degree': 3}


### Resume optimization 

The Genetic Algorithm can resume the search for a combination of the best hyperparameters from the last checkpoint. This feature can be useful when the objective function is computationally expensive.

In [7]:
fitness_df_resume, final_best_hyperparameters_resume = ga.search(n_generations=5)

In [8]:
fitness_df_resume

Unnamed: 0,Best_individual,Fitness_values,Time (hours)
0,"(1.9083405267238138, linear, 3)",0.099425,5.2e-05
1,"(1.8043386147076927, linear, 5)",0.098999,3.6e-05
2,"(1.6658103431498519, linear, 1)",0.098386,0.000145
3,"(1.6658103431498519, linear, 1)",0.098386,3.9e-05
4,"(1.6658103431498519, linear, 1)",0.098386,7.2e-05
