Authors:
Alexandre A. Schoepfer, Jan Weinreich, Ruben Laplaza, Jerome Waser, and Clemence Corminboeuf
Inspired by the Star Trek universe following Ferengi's 3rd rule of acquisition - "Never spend more for an acquisition than you have to," and the 74th rule - "Knowledge equals profit," we introduce strategies for cost-efficient BO to find a good cost and yield increase compromise.
Bayesian optimization (BO) of reactions becomes increasingly important for advancing chemical discovery. Although effective in guiding experimental design, BO does not account for experimentation costs. For example, it may be more cost-effective to measure a reaction with the same ligand multiple times at different temperatures than buying a new one. We present Cost-Informed BO (CIBO), a policy tailored for chemical experimentation to prioritize experiments with lower costs. In contrast to BO, CIBO finds a cost-effective sequence of experiments towards the global optimum, the “mountain peak”. We envision use cases for efficient resource allocation in experimentation planning for traditional or self-driving laboratories.
CIBO vs BO. BO suggests a direct and steep path with expensive climbing equipment and a higher chance of costs for suffering injuries. CIBO suggests a slightly longer but safer path with lower equipment costs needed for the ascent.
Add a crucial dimension to the BO: the cost and ease of availability of each compound used at each batch iteration.
Overview of standard BO (blue) vs. cost-informed Bayesian optimization (CIBO, orange) for yield optimization.
(a): BO recommends purchasing more materials. Meanwhile, CIBO balances purchases with their expected improvement of the experiment, at the cost of performing more experiments (here five vs. four).
(b): A closer look at the two acquisition functions of BO and CIBO for the selection of experiment two. In CIBO, the BO acquisition function is modified to account for the cost by subtracting the latter. Following the blue BO curve, the next experiment to perform uses green and red reactants (corresponding to the costly maximum on the right). Subtracting the price of the experiments results in the orange CIBO curve, which instead suggests the more cost-effective experiment on the left (blue and red reactants).
Best to create a new environment, for instance with
conda create --name cibo python=3.10
then:
conda activate cibo
pip install .
That's it!
Open the file
tutorial.ipynb
To learn how to load the datasets shown in the paper and, more importantly, how to perform a cost-informed Bayesian Optimization with your own data.
Your data must come in a CSV
file. For instance, for the direct arylation dataset we would have:
from cibo.data.datasets import user_data
description = {
"compounds": {
"1": {"name": "Ligand_SMILES", "inp_type": "smiles"},
"2": {"name": "Base_SMILES", "inp_type": "smiles"},
"3": {"name": "Solvent_SMILES", "inp_type": "smiles"},
},
"parameters": {
"1": {"name": "Concentration", "inp_type": "float"},
"2": {"name": "Temp_C", "inp_type": "float"},
},
"cost": {"name": "Ligand_Cost_fixed", "inp_type": "float"},
"target": {"name": "Yield", "inp_type": "float"},
}
data = user_data(csv_file=my_data_path, description=description)
X, y = data.X, data.y
Simply specify the location of your file, the compound columns (you can have arbitrarily many) as well as reaction parameters such as the temperature or the concentration. Finally, also specify the costs and which column corresponds to the reaction yield.
For each experiment, the parameters are set with the config.py
file in the same directory.
A subfolder figures
will be created after executing and results will be plotted and saved as
png in this folder. In addition results are saved in a pkl file results_*.pkl
.
Figure 3:
cd /cibo/AcqFuncPrice/CheapInit/DirectAryl
python costs_min.py
Figure 5
cd /cibo/AcqFuncPrice/CheapInit/Baumgartner
python costs_min.py
The folder cibo
has the following subfolders:
Currently supports two different datasets:
Direct arylation (DA) [1] and Cross-coupling (CC) [2] with yields ranging from 0–100%. To add your own dataset create a preprocessing script similar to data/baumgartner.py
or data/directaryl.py
and add the option to load your data to the Evaluation_data
class in data/datasets.py
The datasets are called "BMS" and "baumgartner" respectively, as a keywork in the exp_config.py
files.
Regression on both datasets resulting in a scatter plot with errorbars (correlation.png
). All regressors are compatible with botorch
:
Gaussian Process Regression: GPR.py
Try the effect of different kernels: Tanimoto
kernel performs quite well and is the default choice. Optionally also try Random Forest regression RFR.py
interfaced with sklearn
.
To change the dataset "dataset"
, initialization scheme ("init_strategy"
) and number of training points "ntrain"
open the exp_configs_1.py
file. Other keywords have no effect on these two scripts and are only relevant for the Bayesian optimization runs.
Reproduce figures from the paper: for the two different datasets Baumgartner
and DirectAryl
.
In both cases the scripts work identically, the main difference is the config.py
scripts that control the configurations that should be tested. Therein a list is defined and the experiments are performed subsequently:
benchmark = [
{
"dataset": "BMS",
"init_strategy": "worst_ligand",
"cost_aware": True,
"n_runs": 5,
"n_iter": 30,
"batch_size": 5,
"ntrain": 200,
"prices": "update_ligand_when_used",
"surrogate": "GP",
"acq_func": "NEI",
"label": "BMS_COST_GP_NEI",
"cost_mod": "minus",
"cost_weight" 1.0
}
...
]
dataset
: "BMS" for DirectArylation or "baumgartner" for the Baumgarnter dataset.
init_strategy
: "worst_ligand" when using "BMS" literally means start with the ligand with worst overall yield given all other reaction conditions. "cheapest" when using "baumgartner", start with cheapest commercially available compounds. "random" is also an option but was not used for the BO/CIBO in the paper
cost_aware
: True
= CIBO, False
= BO
n_iter
: number of BO/CIBO iterations
batch_size
: batchsize for each BO/CIBO iteration
ntrain
: maximal number of points for initialization.
prices
: "update_ligand_when_used" was the option used for the paper. After buying any compound keep it in stock, only pay once
surrogate
: Currently GaussianProcess GP
and RandomForest RF
are supported
acq_func
: Type of acquisition function, tested "Noisy expected improvement" NEI
cost_mod
: Selected modification of the original acquisition function to include the cost. "minus" corresponds to results in paper.
label
: label used for the output files.
cost_weight
: parameter
Generate the result plots for the paper.
BO.py
contains all functions and classes a fit a surrogate model
update_model
: Update and return a GP model with new training data from scratch.
Surrogate_Model
: Surrogate model class that supports different types of kernels and surrogate methods and model fitting
- Content: Space for experimental or outdated items.
We welcome contributions and suggestions!
This project is licensed under the MIT License
[1] Shields, B. J.; Stevens, J.; Li, J.; Paras- ram, M.; Damani, F.; Alvarado, J. I. M.; Janey, J. M.; Adams, R. P.; Doyle, A. G. Bayesian reaction optimization as a tool for chemical synthesis. Nature 2021, 590, 89–96.
[2] Baumgartner, L. M.; Dennis, J. M.; White, N. A.; Buchwald, S. L.; Jensen, K. F. Use of a droplet plat- form to optimize Pd-catalyzed C–N coupling reactions promoted by organic bases. Org. Process Res. Dev. 2019, 23, 1594–1601
Das Zeitmaß, in dem er lebte, es hatte keine Begrenzung, es hatte keine Zwischenzeiten und keine Intervalle, es war eine ungeheure, gleichmäßige Dauer, die wie eine silberne Welle unendlich über den Horizont lief, und in der Stunden, Tage, Jahre, eine Ewigkeit aufgesogen wurden, ohne dass man ihres Verlustes gewahr wurde.