In [1]:
import pandas as pd
from edbogenerator import EDBOGenerator

## Run Initialization
- Create a run object (example_run) for us to perform actions on
- Initialize required directory and files if missing (components and logging)

In [2]:
example_run = EDBOGenerator(run_dir='example_run')

run directory did not exist, created one
logging file (summary.json) generated in run directory
components.csv file did not exist, created a template. run successfully initialized


The components template file looks like this. Here, the reaction scope is defined through the component columns (base, temperature, catalyst, etc.), and the run objectives are defined in the min/max columns (maximize desired_product, minimize byproducts 1,2 etc.). Update this file manually or programmatically to include the reaction components and options you want to include

In [3]:
dummy_components = pd.read_csv('components.csv')
dummy_components

Unnamed: 0,component_1,component_2,component_3,component_x,min,max
0,a,e,i,m,obj_1,obj_3
1,b,f,j,n,obj_2,
2,,g,k,o,,
3,,h,,p,,


In [7]:
#Updating programmatically for this example, but probably easier to do in the csv directly
#Let's say we're attempting to optimize a Buchwald
updated_components = pd.DataFrame({
            'base':['K3PO4','Cs2CO3','',''],
            'catalyst':['RuPhos Pd G4','XPhos Pd G3','XantPhos Pd G4','JackiePhos Pd G3'],
            'temperature':['80','100','120',''],
            'solvent':['THF','NMP','dioxane','toluene'],
            'min':['dehalogenation_pdt','hydrolyis_pdt','',''],
            'max':['desired_pdt','','','']
        })
updated_components.to_csv('components.csv', index = False)
updated_components

Unnamed: 0,base,catalyst,temperature,solvent,min,max
0,K3PO4,RuPhos Pd G4,80.0,THF,dehalogenation_pdt,desired_pdt
1,Cs2CO3,XPhos Pd G3,100.0,NMP,hydrolyis_pdt,
2,,XantPhos Pd G4,120.0,dioxane,,
3,,JackiePhos Pd G3,,toluene,,


## Scope Initialization
- Generate a combinatorial scope of the defined reaction components
- Perform one empty round of optimization to generate a 'round_0' file that is ready for data input

In [6]:
#Run this every time you start a new session (even if you already have a few rounds of prediction)
example_run.initialize_scope(batch_size=5)

Generating reaction scope...
scope successfully initialized
The following columns are categorical and will be encoded using One-Hot-Encoding: ['base', 'catalyst', 'solvent']
Sampling type:  selection 


Number of unique samples returned by sampling algorithm: 5
Creating a priority list using random sampling: cvtsampling
no round_0 file found,a new one was successfully generated. ready for data input!


The round_0 file looks like this: 
- The priority column summarizes which condition set (row) is 'high priority' and should be explored based on the underlying EDBO+ algorithm

In [8]:
round_0 = pd.read_csv('example_run_round_0.csv')
round_0
#Total number of rows in round_0 file matches the number of combinations of components. Nice!

Unnamed: 0,base,catalyst,temperature,solvent,dehalogenation_pdt,hydrolyis_pdt,desired_pdt,priority
0,K3PO4,RuPhos Pd G4,100.0,THF,PENDING,PENDING,PENDING,1
1,Cs2CO3,RuPhos Pd G4,100.0,dioxane,PENDING,PENDING,PENDING,1
2,Cs2CO3,XPhos Pd G3,100.0,toluene,PENDING,PENDING,PENDING,1
3,Cs2CO3,XPhos Pd G3,100.0,dioxane,PENDING,PENDING,PENDING,1
4,K3PO4,XPhos Pd G3,100.0,dioxane,PENDING,PENDING,PENDING,1
...,...,...,...,...,...,...,...,...
91,K3PO4,XantPhos Pd G4,100.0,toluene,PENDING,PENDING,PENDING,0
92,K3PO4,XantPhos Pd G4,100.0,dioxane,PENDING,PENDING,PENDING,0
93,K3PO4,XantPhos Pd G4,100.0,NMP,PENDING,PENDING,PENDING,0
94,K3PO4,XantPhos Pd G4,100.0,THF,PENDING,PENDING,PENDING,0


## Optimization Time!
Several options here: 
- We can perform a standard single round of prediction by inputting a batch of experimental results into round_x.csv (typical usage)
- We can also perform bulk data input from a training csv, which can be useful if you have existing experimental data you would like to include in the optimization
- Finally, we can run a simulated optimizaton with a full set of experimental results. This can be useful if you want to validate the predictive ability of the algorithm before running a real optimization

Each of these have methods associated with them, and all required file generation/logging is handled automatically

In [9]:
#Single round prediction. Begin at round 0 (or the highest round number in run directory if you began optimiation previously)

#For this example, I will programmatically insert dummy data representing percent conversion to desired product/byproducts
# You can also edit the csv directly if you prefer
import numpy as np
round_0 = pd.read_csv('example_run_round_0.csv')
round_0.iloc[0:5,4:7] = np.random.rand(5,3)
round_0.to_csv('example_run_round_0.csv',index = False)
round_0

Unnamed: 0,base,catalyst,temperature,solvent,dehalogenation_pdt,hydrolyis_pdt,desired_pdt,priority
0,K3PO4,RuPhos Pd G4,100.0,THF,0.523781,0.254522,0.595685,1
1,Cs2CO3,RuPhos Pd G4,100.0,dioxane,0.070229,0.055544,0.522639,1
2,Cs2CO3,XPhos Pd G3,100.0,toluene,0.14946,0.097818,0.173855,1
3,Cs2CO3,XPhos Pd G3,100.0,dioxane,0.204286,0.454851,0.763389,1
4,K3PO4,XPhos Pd G3,100.0,dioxane,0.342003,0.030445,0.461793,1
...,...,...,...,...,...,...,...,...
91,K3PO4,XantPhos Pd G4,100.0,toluene,PENDING,PENDING,PENDING,0
92,K3PO4,XantPhos Pd G4,100.0,dioxane,PENDING,PENDING,PENDING,0
93,K3PO4,XantPhos Pd G4,100.0,NMP,PENDING,PENDING,PENDING,0
94,K3PO4,XantPhos Pd G4,100.0,THF,PENDING,PENDING,PENDING,0


We can then run a round of optimization using these newly inputted results:

In [10]:
example_run.single_round_predict(round_num=0,batch_size=5)

The following columns are categorical and will be encoded using One-Hot-Encoding: ['base', 'catalyst', 'solvent']
Using EHVI acquisition function.
Using hyperparameters optimized for continuous variables.
Using hyperparameters optimized for continuous variables.
Using hyperparameters optimized for continuous variables.
Number of QMC samples using SobolQMCNormalSampler sampler: 512
Acquisition function optimized.
Predictions obtained and expected improvement obtained.
optimization successful! next round (round_1) is ready for data input


Success!! Upon completion of a round, a new file is created titled 'pred_{run_name}_round_x.csv' containing prediction data:
- {objective_name}_predicted_mean: the predicted value for this run objective for the condition set in the given row
- {objective_name}_predicted_variance: the predicted variance associated with the above predicted mean, for the condition set in the given row
- {objective_name}_expected_improvement: the expected improvement calculated by the underlying acquisition function of the EDBO+ optimizer, used in selecting the next batch of experiments to perform

In [11]:
round_0_preds = pd.read_csv('pred_example_run_round_0.csv')
round_0_preds

Unnamed: 0,base,catalyst,temperature,solvent,dehalogenation_pdt,hydrolyis_pdt,desired_pdt,priority,dehalogenation_pdt_predicted_mean,dehalogenation_pdt_predicted_variance,dehalogenation_pdt_expected_improvement,hydrolyis_pdt_predicted_mean,hydrolyis_pdt_predicted_variance,hydrolyis_pdt_expected_improvement,desired_pdt_predicted_mean,desired_pdt_predicted_variance,desired_pdt_expected_improvement
0,K3PO4,XantPhos Pd G4,80.0,NMP,PENDING,PENDING,PENDING,1.0,0.258225,1.082743,1.038050,0.178612,1.006348,0.911805,0.503382,1.269464,0.888189
1,K3PO4,RuPhos Pd G4,120.0,THF,PENDING,PENDING,PENDING,1.0,0.258353,1.082743,1.038121,0.178623,1.006348,0.911811,0.503422,1.269464,0.888207
2,K3PO4,RuPhos Pd G4,80.0,dioxane,PENDING,PENDING,PENDING,1.0,0.258187,1.082743,1.038028,0.178601,1.006348,0.911799,0.503450,1.269464,0.888220
3,Cs2CO3,XantPhos Pd G4,120.0,toluene,PENDING,PENDING,PENDING,1.0,0.257951,1.082743,1.037896,0.178609,1.006348,0.911804,0.503236,1.269465,0.888122
4,Cs2CO3,JackiePhos Pd G3,120.0,toluene,PENDING,PENDING,PENDING,1.0,0.257950,1.082743,1.037895,0.178609,1.006348,0.911804,0.503234,1.269464,0.888121
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,K3PO4,XPhos Pd G3,100.0,dioxane,0.34200316086645866,0.030444653178705927,0.46179339668255515,-1.0,0.332528,0.055586,0.402761,0.176006,0.090483,0.217899,0.528223,0.091664,0.008657
92,K3PO4,RuPhos Pd G4,100.0,THF,0.5237810685604164,0.2545222900760834,0.5956854741775429,-1.0,0.439764,0.064210,0.509994,0.179157,0.120216,0.235023,0.528360,0.123836,0.022720
93,Cs2CO3,XPhos Pd G3,100.0,toluene,0.14945996485504986,0.09781775249704106,0.1738548570443399,-1.0,0.180995,0.053911,0.251586,0.174023,0.107868,0.224302,0.387732,0.125644,0.007450
94,Cs2CO3,XPhos Pd G3,100.0,dioxane,0.20428604232558834,0.45485115629775963,0.7633893597062874,-1.0,0.175167,0.039787,0.245419,0.192475,0.080666,0.229077,0.533039,0.081778,0.005887


From here, repeat single rounds of optimization as needed until desired reaction outcome is observed (or give up and try a different reaction ha ha ha.....)

Or, say you forgot to input your existing experimental results before starting the optimization run.
- Using the bulk input method, you can insert bulk training data mid-run! Simply specify the round you want to continue from (round 1 in our case)
- This method takes a training dataframe in a similar format as your round_0 file, with columns for each reaction component, and min/max columns for run objectives (priority column not needed)
- Names of components/run objectives need to match exactly! Take caution.

In [23]:
#Let's make a dummy training dataframe
training_df = round_0.copy().drop(columns=['priority']).iloc[71:95,]
training_df.iloc[:,4:7] = np.random.rand(24,3)
training_df.temperature = training_df.temperature.astype(str)
training_df

Unnamed: 0,base,catalyst,temperature,solvent,dehalogenation_pdt,hydrolyis_pdt,desired_pdt
71,K3PO4,RuPhos Pd G4,80.0,dioxane,0.919483,0.714241,0.998847
72,K3PO4,XantPhos Pd G4,80.0,THF,0.149448,0.868126,0.162493
73,K3PO4,XantPhos Pd G4,80.0,NMP,0.61556,0.12382,0.848008
74,K3PO4,XantPhos Pd G4,80.0,dioxane,0.807319,0.569101,0.407183
75,K3PO4,XantPhos Pd G4,80.0,toluene,0.069167,0.697429,0.453543
76,K3PO4,JackiePhos Pd G3,120.0,dioxane,0.722056,0.866382,0.975522
77,K3PO4,JackiePhos Pd G3,120.0,NMP,0.855803,0.011714,0.359978
78,K3PO4,JackiePhos Pd G3,120.0,THF,0.729991,0.17163,0.521037
79,K3PO4,JackiePhos Pd G3,100.0,toluene,0.054338,0.199997,0.018522
80,K3PO4,JackiePhos Pd G3,100.0,dioxane,0.793698,0.223925,0.345352


In [24]:
#Run the bulk data input at the specified round in the run (most recent round usually). batch_size here just indicated the batch_size desired for next round
example_run.bulk_training(training_df=training_df,round_num=1,batch_size=5)

data from training df successfully filled for query: base == "K3PO4" & catalyst == "XantPhos Pd G4" & temperature == "80.0" & solvent == "NMP"
data from training df successfully filled for query: base == "K3PO4" & catalyst == "RuPhos Pd G4" & temperature == "80.0" & solvent == "dioxane"
no matching row in training df for query: base == "Cs2CO3" & catalyst == "XantPhos Pd G4" & temperature == "120.0" & solvent == "toluene"
no matching row in training df for query: base == "Cs2CO3" & catalyst == "JackiePhos Pd G3" & temperature == "120.0" & solvent == "toluene"
no matching row in training df for query: base == "K3PO4" & catalyst == "RuPhos Pd G4" & temperature == "120.0" & solvent == "THF"
no matching row in training df for query: base == "Cs2CO3" & catalyst == "XPhos Pd G3" & temperature == "120.0" & solvent == "toluene"
no matching row in training df for query: base == "Cs2CO3" & catalyst == "XPhos Pd G3" & temperature == "80.0" & solvent == "THF"
no matching row in training df for que

Success!! Results from training df are filled and priorities are correctly set. We're now ready for more rounds of prediction

In [26]:
pd.read_csv('example_run_round_1.csv').tail(30)

Unnamed: 0,base,catalyst,temperature,solvent,dehalogenation_pdt,hydrolyis_pdt,desired_pdt,priority
46,Cs2CO3,XPhos Pd G3,120.0,dioxane,PENDING,PENDING,PENDING,0.0
47,Cs2CO3,XantPhos Pd G4,80.0,THF,PENDING,PENDING,PENDING,0.0
48,Cs2CO3,XantPhos Pd G4,80.0,NMP,PENDING,PENDING,PENDING,0.0
49,Cs2CO3,RuPhos Pd G4,120.0,toluene,PENDING,PENDING,PENDING,0.0
50,Cs2CO3,XantPhos Pd G4,80.0,dioxane,PENDING,PENDING,PENDING,0.0
51,Cs2CO3,XantPhos Pd G4,80.0,toluene,PENDING,PENDING,PENDING,0.0
52,Cs2CO3,XantPhos Pd G4,100.0,THF,PENDING,PENDING,PENDING,0.0
53,Cs2CO3,JackiePhos Pd G3,100.0,THF,PENDING,PENDING,PENDING,0.0
54,Cs2CO3,XantPhos Pd G4,100.0,dioxane,PENDING,PENDING,PENDING,0.0
55,Cs2CO3,XantPhos Pd G4,100.0,toluene,PENDING,PENDING,PENDING,0.0


## Planned features
- Built-in visualization methods to examine optimization success/when to stop optimizing
- A UI, negating the need to edit csv's directly
- Optimization summary generation
- Better logging

For now, logging is stored as {run_name}_summary.json in the run directory. It is persistent between sessions, so you can pickup where you left off as long as you re-run intialize_scope() (which will see that you have an existing scope and load existing components/objectives accordingly)
- Log is organized as a dict of dicts. Each entry is an action performed during the optimization run
- Each entry contains information regarding what action was taken (timestamp, method name, params, and result)
- A more refined logging approach is planned

In [27]:
#Access logs like this:
example_run.summary

{0: {'timestamp': '2024-08-18 21:10:36',
  'method': 'initialize_scope',
  'params': {'batch_size': 5, 'kwargs': {}},
  'result': None},
 1: {'timestamp': '2024-08-18 21:11:47',
  'method': 'single_round_predict',
  'params': {'round_num': 0, 'batch_size': 5, 'kwargs': {}},
  'result': None},
 2: {'timestamp': '2024-08-18 21:24:41',
  'method': 'bulk_training',
  'params': {'training_df': {'base': {71: 'K3PO4',
     72: 'K3PO4',
     73: 'K3PO4',
     74: 'K3PO4',
     75: 'K3PO4',
     76: 'K3PO4',
     77: 'K3PO4',
     78: 'K3PO4',
     79: 'K3PO4',
     80: 'K3PO4',
     81: 'K3PO4',
     82: 'K3PO4',
     83: 'K3PO4',
     84: 'K3PO4',
     85: 'K3PO4',
     86: 'K3PO4',
     87: 'K3PO4',
     88: 'K3PO4',
     89: 'K3PO4',
     90: 'K3PO4',
     91: 'K3PO4',
     92: 'K3PO4',
     93: 'K3PO4',
     94: 'K3PO4'},
    'catalyst': {71: 'RuPhos Pd G4',
     72: 'XantPhos Pd G4',
     73: 'XantPhos Pd G4',
     74: 'XantPhos Pd G4',
     75: 'XantPhos Pd G4',
     76: 'JackiePhos Pd G

## Thanks for sticking around! Always happy to chat, feel free to reach out with any suggestions/questions.