In [27]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from edbo.plus.optimizer_botorch import EDBOplus

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### This is a tutorial that covers the basics for running EDBO+: from designing a combinatorial space to running the Bayesian Optimizer.

## 1. Creating a search scope using EDBO+.

##### To run EDBO+ we need to first create a reaction scope (search space) in a .csv format with all the possible combinations that you want to consider for our optimization. 
##### You can "manually" create a reaction scope using any spreadsheet editor (such as Excel, Libreoffice Calc, ...) but we have also created a tool to help you generating combinatorial search spaces. 
##### For instance, lets say that we want to consider 3 solvents ($\bf{THF}$, $\bf{Toluene}$, $\bf{DMSO}$), 4 temperatures ($\bf{-10}$, $\bf{0}$, $\bf{10}$, $\bf{25}$) and 3 concentration levels ($\bf{0.1}$, $\bf{0.2}$, $\bf{1.0}$). We can introduce these in the EDBO+ scope generator in a dictionary form as follows:

In [28]:
reaction_components = {
    'solvent': ['THF', 'Toluene', 'DMSO'],
    'T': [-10, 0, 10, 25],
    'concentration': [0.1, 0.2, 1.0]
}

##### Now we need to pass the previous dictionary to the $\bf{generate\_reaction\_scope}$ tool in the EDBOplus class.

In [29]:
EDBOplus().generate_reaction_scope(
    components=reaction_components, 
    filename='my_optimization.csv',
    check_overwrite=False
)

Generating reaction scope...


Unnamed: 0,solvent,T,concentration
0,THF,-10,0.1
1,THF,-10,0.2
2,THF,-10,1.0
3,THF,0,0.1
4,THF,0,0.2
5,THF,0,1.0
6,THF,10,0.1
7,THF,10,0.2
8,THF,10,1.0
9,THF,25,0.1


##### We can always load/read the previously generated reaction scope using any spreadsheet editor but in this case we will use Pandas for that:

In [30]:
import pandas as pd
df_scope = pd.read_csv('my_optimization.csv')  # Load csv file.

##### Now we can check the number of combinations in the reaction scope:

In [31]:
n_combinations = len(df_scope)
print(f"Your reaction scope has {n_combinations} combinations.")

Your reaction scope has 36 combinations.


##### Of course, this is a very small reaction scope  as it is meant to be a toy model to demonstrate how EDBO+ works.

## 2. First steps, initializing EDBO+ (in absence of training data).

##### We are going to execute EDBO+ to suggest some initial samples. 
##### Since we have not collected any experimental data (observations) yet, EDBO+ will suggest a set of initial experiments based on an feature space sampling method, in this case the CVT sampling method (see:http://kmh-lanl.hansonhub.com/uncertainty/meetings/gunz03vgr.pdf).
##### In this example the $\bf{objective}$ is to maximize the reaction $\bf{yield}$ and $\bf{enantioselectivity}$ but at the same time we want to minimize the amount one of a given $\bf{side~product}$ in this reaction. We also need to introduce the name of the csv file containing our reaction scope (in our case this was $\bf{my\_optimization.csv}$). Now we can execute the algorithm using the $\bf{run}$ command in EDBOplus:

In [32]:
EDBOplus().run(
    filename='my_optimization.csv',  # Previously generated scope.
    objectives=['yield', 'ee', 'side_product'],  # Objectives to be optimized.
    objective_mode=['max', 'max', 'min'],  # Maximize yield and ee but minimize side_product.
    batch=3,  # Number of experiments in parallel that we want to perform in this round.
    columns_features='all', # features to be included in the model.
    init_sampling_method='cvtsampling'  # initialization method.
)

The following columns are categorical and will be encoded using One-Hot-Encoding: ['solvent']
Sampling type:  selection 

No column information provided. All except last column will be considered as x variables.

Number of unique samples returned by sampling algorithm: 3
Creating a priority list using random sampling: cvtsampling


Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
32,DMSO,10,1.0,PENDING,PENDING,PENDING,1
8,THF,10,1.0,PENDING,PENDING,PENDING,1
19,Toluene,10,0.2,PENDING,PENDING,PENDING,1
0,THF,-10,0.1,PENDING,PENDING,PENDING,0
26,DMSO,-10,1.0,PENDING,PENDING,PENDING,0
21,Toluene,25,0.1,PENDING,PENDING,PENDING,0
22,Toluene,25,0.2,PENDING,PENDING,PENDING,0
23,Toluene,25,1.0,PENDING,PENDING,PENDING,0
24,DMSO,-10,0.1,PENDING,PENDING,PENDING,0
25,DMSO,-10,0.2,PENDING,PENDING,PENDING,0


##### EDBO+ has created a column for each objective and added $\bf{PENDING}$ values to all of them so you can track the experiments that you have been collecting during the optimization campaign.
##### We can also see that EDBO+ has created a new $\bf{priority}$ column. This column is used to distinguish between high and low priority samples. The top entries (with $\bf{priority=1}$) highlight the next suggested samples.

##### We can check now the first 5 experiments in the scope by reading the $\bf{my\_optimization.csv}$ file:

In [33]:
df_edbo = pd.read_csv('my_optimization.csv')
df_edbo.head(5)

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,DMSO,10,1.0,PENDING,PENDING,PENDING,1
1,THF,10,1.0,PENDING,PENDING,PENDING,1
2,Toluene,10,0.2,PENDING,PENDING,PENDING,1
3,THF,-10,0.1,PENDING,PENDING,PENDING,0
4,DMSO,-10,1.0,PENDING,PENDING,PENDING,0


## 3. Adding training data in EDBO+.

##### Note: We will use Python and Pandas to add new training data in this example. But you can always edit and add new data into the '.csv' file using any spreedsheet editor (such as Excel, Libreoffice Calc, ...) if that's more convinient for you.

##### Let's open again the $\bf{my\_optimization.csv}$ file we generated before:

In [34]:
df_edbo = pd.read_csv('my_optimization.csv')
df_edbo.head(5)

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,DMSO,10,1.0,PENDING,PENDING,PENDING,1
1,THF,10,1.0,PENDING,PENDING,PENDING,1
2,Toluene,10,0.2,PENDING,PENDING,PENDING,1
3,THF,-10,0.1,PENDING,PENDING,PENDING,0
4,DMSO,-10,1.0,PENDING,PENDING,PENDING,0


##### We can fill the first out entry in the previous dataframe with the "observed" values using Pandas:

In [35]:
df_edbo.loc[0, 'yield'] = 20.5
df_edbo.loc[0, 'ee'] = 40
df_edbo.loc[0, 'side_product'] = 0.1

##### We can check that we have filled out the first entry with our "observed data":

In [36]:
df_edbo.head(5)

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,DMSO,10,1.0,20.5,40,0.1,1
1,THF,10,1.0,PENDING,PENDING,PENDING,1
2,Toluene,10,0.2,PENDING,PENDING,PENDING,1
3,THF,-10,0.1,PENDING,PENDING,PENDING,0
4,DMSO,-10,1.0,PENDING,PENDING,PENDING,0


##### We can also fill out the second entry with their corresponding "observations":

In [37]:
df_edbo.loc[1, 'yield'] = 50.3
df_edbo.loc[1, 'ee'] = 10
df_edbo.loc[1, 'side_product'] = 0.2

In [38]:
df_edbo.head(5)

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,DMSO,10,1.0,20.5,40,0.1,1
1,THF,10,1.0,50.3,10,0.2,1
2,Toluene,10,0.2,PENDING,PENDING,PENDING,1
3,THF,-10,0.1,PENDING,PENDING,PENDING,0
4,DMSO,-10,1.0,PENDING,PENDING,PENDING,0


##### Now we can save our dataset as $\bf{my\_optimization\_round0.csv}$:

In [39]:
df_edbo.to_csv('my_optimization_round0.csv', index=False)

## 4. Running EDBO+ with training data.

##### First let's check our previous data (which include some $\bf{yield}$, $\bf{ee}$ and $\bf{side\_product}$ observations, which will be used to train the model):

In [40]:
df_edbo_round0 = pd.read_csv('my_optimization_round0.csv')
df_edbo_round0.head(5)

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,DMSO,10,1.0,20.5,40,0.1,1
1,THF,10,1.0,50.3,10,0.2,1
2,Toluene,10,0.2,PENDING,PENDING,PENDING,1
3,THF,-10,0.1,PENDING,PENDING,PENDING,0
4,DMSO,-10,1.0,PENDING,PENDING,PENDING,0


##### Now that we have introduced some "observations" in our $\bf{my\_optimization\_round0.csv}$ file, we can execute EDBO+ to suggest samples using these "observations" as training data.

In [43]:
EDBOplus().run(
    filename='my_optimization_round0.csv',  # Previous scope (including observations).
    objectives=['yield', 'ee', 'side_product'],  # Objectives to be optimized.
    objective_mode=['max', 'max', 'min'],  # Maximize yield and ee but minimize side_product.
    batch=3,  # Number of experiments in parallel that we want to perform in this round.
    columns_features='all', # features to be included in the model.
    init_sampling_method='cvtsampling'  # initialization method.
)

The following columns are categorical and will be encoded using One-Hot-Encoding: ['solvent']
Using EHVI acquisition function.
Using hyperparameters optimized for continuous variables.




Using hyperparameters optimized for continuous variables.
Using hyperparameters optimized for continuous variables.
Number of QMC samples using SobolQMCNormalSampler sampler: 512
Acquisition function optimized.
Predictions obtained and expected improvement obtained.




Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority
0,Toluene,25,0.1,PENDING,PENDING,PENDING,1.0
1,Toluene,0,0.1,PENDING,PENDING,PENDING,1.0
2,DMSO,-10,1.0,PENDING,PENDING,PENDING,1.0
3,Toluene,25,1.0,PENDING,PENDING,PENDING,0.0
4,Toluene,25,0.2,PENDING,PENDING,PENDING,0.0
5,Toluene,10,1.0,PENDING,PENDING,PENDING,0.0
6,Toluene,10,0.2,PENDING,PENDING,PENDING,0.0
7,Toluene,10,0.1,PENDING,PENDING,PENDING,0.0
8,Toluene,0,1.0,PENDING,PENDING,PENDING,0.0
9,Toluene,0,0.2,PENDING,PENDING,PENDING,0.0


##### Again the samples suggested by EDBO+ have $\bf{priority = +1}$. In addition, we asign $\bf{priority = -1}$ to the experiments that we have already run (these are at the bottom of the dataset).

## Extra: Accessing the model predictions.

##### Each time that EDBO+ is executed with training data it will generate a .csv file with the predictions for the entire scope (including the 'untested' samples).
##### In the previous example, by running EDBO+ with training data, we generated two files: $\bf{my\_optimization\_round0.csv}$ and a second file with the predictions $\bf{pred\_my\_optimization\_round0.csv}$.
##### Let's have a look to the predictions file using Pandas:

In [45]:
df_predictions_round0 = pd.read_csv('pred_my_optimization_round0.csv')
df_predictions_round0.style.background_gradient(subset=['priority'], cmap='plasma')
df_predictions_round0

Unnamed: 0,solvent,T,concentration,yield,ee,side_product,priority,yield_predicted_mean,yield_predicted_variance,yield_expected_improvement,ee_predicted_mean,ee_predicted_variance,ee_expected_improvement,side_product_predicted_mean,side_product_predicted_variance,side_product_expected_improvement
0,Toluene,25,0.1,PENDING,PENDING,PENDING,1.0,35.389211,106.283951,77.555477,25.010862,106.997267,78.086238,0.149964,0.356658,0.426849
1,Toluene,0,0.1,PENDING,PENDING,PENDING,1.0,35.329193,104.724311,76.285865,25.071282,105.427159,76.86509,0.149762,0.351424,0.422797
2,DMSO,-10,1.0,PENDING,PENDING,PENDING,1.0,35.398475,106.34799,77.61082,25.001535,107.061735,78.133148,0.149995,0.356872,0.42703
3,Toluene,25,1.0,PENDING,PENDING,PENDING,0.0,35.389097,106.282733,77.554453,25.010976,106.99604,78.085316,0.149963,0.356653,0.426846
4,Toluene,25,0.2,PENDING,PENDING,PENDING,0.0,35.389187,106.283698,77.555264,25.010886,106.997011,78.086046,0.149964,0.356657,0.426849
5,Toluene,10,1.0,PENDING,PENDING,PENDING,0.0,34.34167,23.938856,12.172579,26.065433,24.099519,13.05938,0.146449,0.080332,0.250805
6,Toluene,10,0.2,PENDING,PENDING,PENDING,0.0,34.383269,27.152537,14.630186,26.023555,27.334769,15.530646,0.146588,0.091116,0.25399
7,Toluene,10,0.1,PENDING,PENDING,PENDING,0.0,34.393614,27.974466,15.263226,26.01314,28.162214,16.166053,0.146623,0.093874,0.254927
8,Toluene,0,1.0,PENDING,PENDING,PENDING,0.0,35.328145,104.684494,76.253682,25.072337,105.387074,76.833685,0.149759,0.35129,0.422695
9,Toluene,0,0.2,PENDING,PENDING,PENDING,0.0,35.328975,104.716042,76.279182,25.071502,105.418835,76.858569,0.149762,0.351396,0.422776


##### In the previous dataset we can access the model predictions ($\bf{predicted\_mean}$ and $\bf{predicted\_variance}$ columns) but also to the expected improvement values (in the $\bf{expected\_improvement}$ columns) for each objective ($\bf{yield}$, $\bf{ee}$ and $\bf{side\_product}$.).