### Introduction

A common workflow in computational chemistry+ML is to generate combinatorial search spaces. Here, we provide an example of such workflow, exclusively based on RDKit functions, from a recent paper: https://www.nature.com/articles/s41597-023-01977-8. We note that other approaches (e.g., generative models), as well as specialized packages, may exist as well.


The specific project we focus on here involves the development of a dataset of [3+2] cycloaddition reaction profiles, with the aim of building a predictive model for the screening of potential bioorthogonal click reactions. In first instance, we will generate a list of potential dipolarophiles, next we will turn to the dipoles.

In [1]:
import pandas as pd
from rdkit import Chem
import pandas as pd
import itertools
import re
import subprocess
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers

from rdkit import rdBase
from rdkit import RDLogger

# Suppress RDKit warnings
rdBase.DisableLog('rdApp.*')
RDLogger.DisableLog('rdApp.*')

In [2]:
# substituent list
subs_list_LR = ['C', 'F', 'Cl', 'Br', 'C#N', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 
                'c1ccccc1', 'OC', 'C(F)(F)F', None]

In [3]:
# auxiliary functions
def generate_dipolarophiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol)

def single_edit_mol(mol, label, subs):
    if subs != None:
        mod_mol = Chem.ReplaceSubstructs(mol, Chem.MolFromSmiles(label), Chem.MolFromSmiles(subs))[0]
    else: 
        mod_mol = Chem.DeleteSubstructs(mol, Chem.MolFromSmiles(label))
    return mod_mol

def modify_mol(dipole, subs_comb_LR, labels):
    mol = Chem.MolFromSmiles(dipole)
    mod_mol = single_edit_mol(mol, labels[0],subs_comb_LR[0])
    for i, subs in enumerate(subs_comb_LR[1:]):
        mod_mol = single_edit_mol(mod_mol, labels[i + 1], subs)
    
    return Chem.MolFromSmiles(Chem.MolToSmiles(mod_mol))

In [4]:
# generate all ethylene-based dipolarophiles -- metal centers are used a placeholders which indicate the substitution sites.
dipolarophile = 'C(*)(*)=C(*)(*)'
labels = ['[Ti]', '[Cr]', '[Mn]', '[Fe]']
connectable_substituents = set(['C', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 'c1ccccc1', 'OC']) # for in bioorthogonal click reactions, 
                                                                                        # one needs at least one substituent that can be extended.
generated_full_dipolarophiles = []
valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipolarophile)]

for i in range(len(valency_indices)):
    dipolarophile = dipolarophile.replace('*', labels[i], 1)

substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
for subs_comb in substituent_combs:
    if connectable_substituents.intersection(subs_comb) != set(): # make sure at least one substituent is connectable
        if len(set(subs_comb)) == len(subs_comb) - 2: # make sure there are only two different type of substituents
            generated_full_dipolarophiles.append(modify_mol(dipolarophile, subs_comb, labels))
    else:
        continue

In [5]:
full_dipolarophile_set = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipolarophiles)))
dipolarophiles_ethylene = set()

for full_dipolarophile in full_dipolarophile_set:
    isomers = tuple(EnumerateStereoisomers(Chem.MolFromSmiles(full_dipolarophile)))
    for smi in set(list(map(lambda x: Chem.MolToSmiles(x), isomers))):
        dipolarophiles_ethylene.add(smi)

print(len(dipolarophiles_ethylene))

255


In [6]:
# generate all acetylene-based dipolarophiles
dipolarophile = 'C(*)#C(*)'
connectable_substituents = set(['C', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 'c1ccccc1', 'OC'])
generated_full_dipolarophiles = []

valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipolarophile)]
for i in range(len(valency_indices)):
    dipolarophile = dipolarophile.replace('*', labels[i], 1)
substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
for subs_comb in substituent_combs:
    if connectable_substituents.intersection(subs_comb) != set(): # make sure at least one substituent is connectable
        generated_full_dipolarophiles.append(modify_mol(dipolarophile, subs_comb, labels))
    else:
        continue

In [7]:
full_dipolarophile_set = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipolarophiles)))
dipolarophiles_acetylene = set()

for full_dipolarophile in full_dipolarophile_set:
    isomers = tuple(EnumerateStereoisomers(Chem.MolFromSmiles(full_dipolarophile)))
    for smi in set(list(map(lambda x: Chem.MolToSmiles(x), isomers))):
        dipolarophiles_acetylene.add(smi)

print(len(dipolarophiles_acetylene))

57


In [8]:
# generate all norbornen-based dipolarophiles
dipolarophile = 'C(*)1=C(*)C2CCC1C2'
connectable_substituents = set(['C', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 'c1ccccc1', 'OC'])
generated_full_dipolarophiles = []

valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipolarophile)]
for i in range(len(valency_indices)):
    dipolarophile = dipolarophile.replace('*', labels[i], 1)
substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
for subs_comb in substituent_combs:
    generated_full_dipolarophiles.append(modify_mol(dipolarophile, subs_comb, labels))

In [9]:
dipolarophiles_norbornene = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipolarophiles)))
print(len(dipolarophiles_norbornene))

78


In [10]:
# generate all oxo-norbornadiene-based dipolarophiles
dipolarophile = 'C(*)1=C(*)C2C=CC1O2'
generated_full_dipolarophiles = []

valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipolarophile)]
for i in range(len(valency_indices)):
    dipolarophile = dipolarophile.replace('*', labels[i], 1)
substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))

for subs_comb in substituent_combs:
    generated_full_dipolarophiles.append(modify_mol(dipolarophile, subs_comb, labels))

In [11]:
full_dipolarophile_set = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipolarophiles)))
dipolarophiles_oxonorbornadiene = set()

for full_dipolarophile in full_dipolarophile_set:
    dipolarophiles_oxonorbornadiene.add(full_dipolarophile)

print(len(dipolarophiles_oxonorbornadiene))

78


In [12]:
# generate all cyclooctyne-based dipolarophiles
dipolarophile = f'C1CCC(*)(*)C#CC(*)(*)C1'
generated_full_dipolarophiles = []

valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipolarophile)]
for i in range(len(valency_indices)):
    dipolarophile = dipolarophile.replace('*', labels[i], 1)
substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))

for subs_comb in substituent_combs:
    if subs_comb[0] == subs_comb[1] or subs_comb[2] == subs_comb[3]: # make sure the reactant is achiral
    # if len(set(subs_comb)) != len(subs_comb): # make sure there are only two different type of substituents
        generated_full_dipolarophiles.append(modify_mol(dipolarophile, subs_comb, labels))

In [13]:
full_dipolarophile_set = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipolarophiles)))
dipolarophiles_cyclooctyne = set()

for full_dipolarophile in full_dipolarophile_set:
    dipolarophiles_cyclooctyne.add(full_dipolarophile)

print(len(dipolarophiles_cyclooctyne))

870


In [14]:
# turn lists into dataframes
df_ethylene = pd.DataFrame(list(dipolarophiles_ethylene))
df_acetylene = pd.DataFrame(list(dipolarophiles_acetylene))
df_norbornene = pd.DataFrame(list(dipolarophiles_norbornene))
df_oxonorbornadiene = pd.DataFrame(list(dipolarophiles_oxonorbornadiene))
df_cyclooctyne = pd.DataFrame(list(dipolarophiles_cyclooctyne))

In [15]:
# concatenate and save
df = pd.concat((df_ethylene, df_acetylene, df_norbornene, df_oxonorbornadiene, df_cyclooctyne), ignore_index=True)
df.columns = ['dipolarophile']
df.to_csv('work_dir/dipolarophiles.csv')

Now we turn to the dipoles

In [16]:
# substituent lists -- seperate list for substituents left and right and substituents in the middle
subs_list_LR = ['C', 'C#N', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 'c1ccccc1', None] 
subs_list_M = ['C', None]

In [17]:
# auxiliary functions
def generate_dipoles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol)

def single_edit_mol(mol, label, subs):
    if subs != None:
        mod_mol = Chem.ReplaceSubstructs(mol, Chem.MolFromSmiles(label), Chem.MolFromSmiles(subs))[0]
    else:
        mod_mol = Chem.ReplaceSubstructs(mol, Chem.MolFromSmiles(label), Chem.MolFromSmiles('[H]'))[0]
        mod_mol = Chem.RemoveHs(mod_mol)
    return mod_mol

def modify_mol(dipole, subs_comb_LR, subs_M, labels):
    mol = Chem.MolFromSmiles(dipole)
    if 'Sc' in dipole:
        mod_mol = single_edit_mol(mol, '[Sc]', subs_M)
        mod_mol = single_edit_mol(mod_mol, labels[0], subs_comb_LR[0])
        for i, subs in enumerate(subs_comb_LR[1:]):
            mod_mol = single_edit_mol(mod_mol, labels[i + 1], subs)
    else:
        mod_mol = single_edit_mol(mol, labels[0], subs_comb_LR[0])
        for i, subs in enumerate(subs_comb_LR[1:]):
            mod_mol = single_edit_mol(mod_mol, labels[i + 1], subs)
    
    return Chem.MolFromSmiles(Chem.MolToSmiles(mod_mol))

In [18]:
# construct allyl-type dipoles
dipole_scaffolds = []

for L in ['C(*)(*)', 'N(*)']: # O on left side doesn't make sense because then there can be no connection site 
    for M in ['[O+]', '[N+]([Sc])']:
        for R in ['[O-]', '[C-](*)(*)', '[N-](*)']:
            dipole_scaffolds.append(f'{L}={M}{R}')

            
# remove pseudo-duplicates (resonance structure can be "pushed" to other side)
dipole_scaffolds.remove('N(*)=[N+]([Sc])[C-](*)(*)')
dipole_scaffolds.remove('N(*)=[O+][C-](*)(*)')
print(len(dipole_scaffolds))
print(dipole_scaffolds)

10
['C(*)(*)=[O+][O-]', 'C(*)(*)=[O+][C-](*)(*)', 'C(*)(*)=[O+][N-](*)', 'C(*)(*)=[N+]([Sc])[O-]', 'C(*)(*)=[N+]([Sc])[C-](*)(*)', 'C(*)(*)=[N+]([Sc])[N-](*)', 'N(*)=[O+][O-]', 'N(*)=[O+][N-](*)', 'N(*)=[N+]([Sc])[O-]', 'N(*)=[N+]([Sc])[N-](*)']


In [19]:
labels = ['[Ti]', '[Cr]', '[Mn]', '[Fe]']
connectable_substituents = set(['C', 'C(=O)OC', 'C(=O)C', 'C(=O)NC', 'c1ccccc1'])
generated_full_dipoles = []

for dipole in dipole_scaffolds:
    valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipole)]
    for i in range(len(valency_indices)):
        dipole = dipole.replace('*', labels[i], 1)
    substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
    for subs_comb in substituent_combs:
        if connectable_substituents.intersection(subs_comb) != set(): # make sure at least one substituent is connectable
            for subs_M in subs_list_M:
                generated_full_dipoles.append(modify_mol(dipole,subs_comb, subs_M, labels))
        else:
            continue

In [20]:
print(len(generated_full_dipoles))
dipoles_double = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipoles)))
print(len(dipoles_double))

11260
3120


In [21]:
# construct propargyl-type dipoles
dipole_scaffolds2 = []

for L in ['C(*)', 'N']: # O on left side doesn't make sense because then there can no longer be a connection site 
    for M in ['[N+]']:
        for R in ['[O-]', '[C-](*)(*)', '[N-](*)']:
            dipole_scaffolds2.append(f'{L}#{M}{R}')

dipole_scaffolds2.remove('N#[N+][O-]')
print(len(dipole_scaffolds2))

5


In [22]:
generated_full_dipoles2 = []

for dipole in dipole_scaffolds2:
    valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipole)]
    for i in range(len(valency_indices)):
        dipole = dipole.replace('*', labels[i], 1)
    substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
    for subs_comb in substituent_combs:
        if connectable_substituents.intersection(subs_comb) != set(): # make sure at least one substituent is connectable
            for subs_M in subs_list_M:
                generated_full_dipoles2.append(modify_mol(dipole,subs_comb, subs_M, labels))
        else:
            continue

In [23]:
full_dipole_set2 = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipoles2)))
dipoles_triple = set()

for full_dipole in full_dipole_set2:
    isomers = tuple(EnumerateStereoisomers(Chem.MolFromSmiles(full_dipole)))
    for smi in set(list(map(lambda x: Chem.MolToSmiles(x), isomers))):
        dipoles_triple.add(smi)
        
print(len(dipoles_triple))

270


In [24]:
#construct cyclic dipoles
dipole_scaffolds3 = []

for L in ['C(*)', 'N']: # O on left side doesn't make sense because then there can be not connection site 
    for M in ['[O+]', '[N+]([Sc])']:
        for R in ['[C-](*)', '[N-]']:
            dipole_scaffolds3.append(f'{L}2={M}{R}C(=O)O2')
            
print(len(dipole_scaffolds3))

8


In [25]:
generated_full_dipoles3 = []

for dipole in dipole_scaffolds3:
    valency_indices = [valency.start() for valency in re.finditer('\(\*\)', dipole)]
    for i in range(len(valency_indices)):
        dipole = dipole.replace('*', labels[i], 1)
    substituent_combs = itertools.product(subs_list_LR, repeat = len(valency_indices))
    for subs_comb in substituent_combs:
        if connectable_substituents.intersection(subs_comb) != set(): # make sure at least one substituent is connectable
            for subs_M in subs_list_M:
                generated = modify_mol(dipole,subs_comb, subs_M, labels)
                if generated != None:
                    generated_full_dipoles3.append(generated)
                else:
                    print(dipole, subs_comb)
        else:
            continue

In [26]:
dipoles_ring = set(list(map(lambda x: Chem.MolToSmiles(x), generated_full_dipoles3)))    
print(len(dipoles_ring))   

165


In [27]:
# turn lists into dataframes
df_double = pd.DataFrame(list(dipoles_double))
df_triple = pd.DataFrame(list(dipoles_triple))
df_ring = pd.DataFrame(list(dipoles_ring))

In [28]:
# concatenate
df = pd.concat((df_double, df_triple, df_ring))
df.columns = ['dipole']
df.to_csv('work_dir/dipoles.csv')

### Combination into reaction SMILES

The next step is to combine the lists of dipoles and dipolarophiles into reaction SMILES, which can be parsed by autodE (see previous Notebook). Doing so is a complex procedure, among others since the stereochemistry needs to made consistent between reactants and products. Because of time constraints, we will not go into detail on this point -- a script with the necessary functions to do this will be called below on a small subset of the generated dipoles and dipolarophiles as an illustration.

In [29]:
command = 'python construct_reaction_smiles.py'
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

# Print the error output line by line
for line in iter(process.stderr.readline, ''):
    print(line, end='')

# Print the output line by line
for line in iter(process.stdout.readline, ''):
    print(line, end='')


  0%|          | 0/30 [00:00<?, ?it/s]
  3%|▎         | 1/30 [00:00<00:05,  5.58it/s]
 13%|█▎        | 4/30 [00:00<00:01, 15.69it/s]
 27%|██▋       | 8/30 [00:00<00:01, 15.88it/s]
 33%|███▎      | 10/30 [00:01<00:02,  7.68it/s]
 43%|████▎     | 13/30 [00:01<00:01, 10.71it/s]
 53%|█████▎    | 16/30 [00:01<00:01, 13.90it/s]
 63%|██████▎   | 19/30 [00:01<00:00, 16.98it/s]
 73%|███████▎  | 22/30 [00:01<00:00, 19.18it/s]
 83%|████████▎ | 25/30 [00:01<00:00, 20.85it/s]
 97%|█████████▋| 29/30 [00:01<00:00, 23.60it/s]
100%|██████████| 30/30 [00:01<00:00, 16.40it/s]


The resulting SMILES strings can be visualized by copying them and pasting them in ChemDraw. Take the time to look at a couple of them; for those that contain stereo-elements, you will see that they are conserved on both sides of the reaction, i.e., the reactions make stereochemical sense. Note that this is not automatically the case when applying reaction templates in RDKit!

The generated reaction SMILES can now be passed on to autodE, though it needs to be taken into account that autodE does not automatically preserve stereochemistry in all species either, even when they are indicated in the reaction SMILES -> post-hoc correction of some reaction profiles/SMILES may be needed, but this falls outside of the scope of this tutorial (see e.g., https://www.nature.com/articles/s41597-023-01977-8)