# Overview of created data

This notebook gives an overview which data was created through the data creation pipeline.

The dataset creation is split into to parts: A) creating training, and test  pools; and 
B) creating the actual benchmark dataset from the test pool.

## A) Creating training and test pools

### Step 0: SDF files

The PubChem SDF files are not provided in this repo but can be downloaded here: https://pubchem.ncbi.nlm.nih.gov/docs/downloads#section=From-the-PubChem-FTP-Site.

In absence of the SDF files the data creation pipeline is shown with pseudo data, i.e 
one (of many) Pubchem SDF file chunks (see: ```data/dataset_pools/pseudo_sdf/Compound_014000001_014500000.sdf.gz```).


In [1]:
# Load pseudo sdf
import gzip

file_path = 'data/dataset_pools/pseudo_sdf/Compound_014000001_014500000.sdf.gz'
sdf = gzip.open(file_path, "rb")

# Print first 10 lines
for _ in range(40):
    print(sdf.readline().decode('utf-8').rstrip())

14000009
  -OEChem-05292503352D

 23 23  0     0  0  0  0  0  0999 V2000
    3.7320    0.2500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.4641    0.2500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -1.2500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7320   -0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -2.2500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5981   -1.2500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000   -0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7320   -2.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5981   -2.2500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5981    0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5981    1.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4641    2.2500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3301    2.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3291   -2.5

### Step 1: Collect PubChem data

This part of the pipeline loops over the pubchem data chunks and collects SMILES and 
IUPAC information.

In [2]:
# Load intermdiate data after pubchem data collection step
# The pickle file contains two objects: a list of SMILES strings and a dictionary 
# mapping SMILES to IUPAC names
import pickle

file_path = 'data/dataset_pools/intermediate/pubchem_smiles_and_iupacs.pkl'

with open(file_path, 'rb') as f:
    smiles_list, iupac_dict = pickle.load(f)

In [3]:
# SMILES
smiles_list[:10]

['C=C=CC(=O)Oc1ccccc1C',
 'Cc1cccc(N(C(=O)CBr)c2ccccc2)c1',
 'Cc1cc(C)cc(N(C)C(=O)CCl)c1',
 'CN=[S+](F)(F)F',
 'CN=S(F)(F)(F)F',
 'N#S(F)(F)C(F)(C(F)(F)F)C(F)(F)F',
 'N#S(F)(F)C(F)(F)C(F)(F)F',
 'CN(C)S(=O)F',
 'C=C1C(=O)CCC12CCCCC2',
 'CC1(C)CCC(=O)C1(C)C']

In [4]:
# IUPAC
list(iupac_dict.items())[:10]

[('Cc1cccc(N(C(=O)CBr)c2ccccc2)c1',
  ['2-bromo-N-(3-methylphenyl)-N-phenylacetamide']),
 ('Cc1cc(C)cc(N(C)C(=O)CCl)c1',
  ['2-chloro-N-(3,5-dimethylphenyl)-N-methylacetamide']),
 ('CN=S(F)(F)(F)F', ['tetrafluoro(methylimino)-lambda6-sulfane']),
 ('N#S(F)(F)C(F)(C(F)(F)F)C(F)(F)F',
  ['azanylidyne-difluoro-(1,1,1,2,3,3,3-heptafluoropropan-2-yl)-lambda6-sulfane']),
 ('N#S(F)(F)C(F)(F)C(F)(F)F',
  ['azanylidyne-difluoro-(1,1,2,2,2-pentafluoroethyl)-lambda6-sulfane']),
 ('CN(C)S(=O)F', ['[fluorosulfinyl(methyl)amino]methane']),
 ('C=C1C(=O)CCC12CCCCC2', ['4-methylidenespiro[4.5]decan-3-one']),
 ('CC1(C)CCC(=O)C1(C)C', ['2,2,3,3-tetramethylcyclopentan-1-one']),
 ('C=C1C(=O)CCC1(C)c1ccc(C)cc1',
  ['3-methyl-2-methylidene-3-(4-methylphenyl)cyclopentan-1-one']),
 ('C=Cc1sc(-c2ccccc2)nc1C', ['5-ethenyl-4-methyl-2-phenyl-1,3-thiazole'])]

### Step 2: Collect molecules from other benchmarks
This part of the pipeline extracts molecules from other benchmarks and stores them in a
list. 

In [5]:
# Load external test set molecules
file_path = 'data/dataset_pools/external/external_test_set_molecules.pkl'
with open(file_path, 'rb') as f:
    external_smiles = pickle.load(f)
    
# Print first 10 SMILES
external_smiles[:10]

['S=c1nnn(C2CCCN2)[nH]1',
 'CCC#CC1=CC(N2CCCNCC2)=CN=C1Cl',
 'COC1=CC=C(CCNC(=S)NC=CC2=NC3=CC=CC=C3N=C2O)C=C1OC',
 'CCCCCCCCCCCCCCC(=O)OCC(COC(=O)CCCCCCCCCCCCC)OC(=O)CCCCCCCCCCCCCCCCCCCCC(C)CC',
 'C1=CC=C(P(C2=CC=CC=C2)C2=CC=C3C=CC=CC3=C2C2=C3C=CC=CC3=CC=C2P(C2=CC=CC=C2)C2=CC=CC=C2)C=C1.CC(C)(C)OC(=O)N1CCO[C@H](C2=CC=C(Br)C=C2)C1.CC(C)(C)[O-].CC1=CC=CC=C1.N=C(C1=CC=CC=C1)C1=CC=CC=C1.O=C(/C=C/C1=CC=CC=C1)/C=C/C1=CC=CC=C1.[Na+].[Pd]',
 'ClC1=CC=C(C(Cl)(Cl)Cl)C=C1.ClC1=CC=C(C(Cl)(Cl)Cl)C=C1Cl',
 'CC1=CC(Cl)=NC(N2CCCCC2)=N1.COC1=CC(N)=CC=C1N1C=NC(C)=C1',
 'COC1=CC=CC2=C1[OH+][Cu-3]1(O)[O+]=C(C3=CC=NC=C3)[N-][N+]1=C2',
 'CC1=NC=C(C(=O)O)C=C1[O-]',
 'COC1=CC=CC2=C1C(=O)N1CCN(C(=O)OC(C)(C)C)CC21']

### Step 3: Canonicalize PubChem molecules and remove molecules from other benchmarks

To maximize the benefit to the community when integrating our benchmark into the 
existing chemical reasoning benchmark landscape, we excluded molecules that appear in 
the following evaluation sets: LLaSMol, ChemDFM, Ether0, and ChemIQ.

Notably, since this filtering was performed once and based on available benchmark 
snapshots (which may be outdated), we cannot guarantee complete disjointness between 
molecule sets.

In [6]:
# Load and inspect processed data
file_path = 'data/dataset_pools/processed/filtered_smiles_and_iupacs.pkl'

with open(file_path, 'rb') as f:
    filtered_smiles, filtered_iupac_dict = pickle.load(f)

In [7]:
# Filtered SMILES
filtered_smiles[:10]

['B#C',
 'B1=CC=C1',
 'BP(c1ccccc1)c1ccccc1',
 'B[C@@H]1C[C@@H]2C[C@H]([C@H]1CC)C2(C)C',
 'Br/C(=C(\\Br)C(Br)Br)C(Br)Br',
 'Br/C1=C/CC/C=C\\CC1',
 'Br/C1=C/CCC2OC2CC1',
 'Br/C1=C/CCCC[C@H]2O[C@@H]12',
 'Br/C=C/CCBr',
 'Br/C=C/CCCBr']

In [8]:
# First entries of filtered IUPAC dictionary
list(filtered_iupac_dict.items())[:10]

[('Cc1cccc(N(C(=O)CBr)c2ccccc2)c1',
  ['2-bromo-N-(3-methylphenyl)-N-phenylacetamide']),
 ('Cc1cc(C)cc(N(C)C(=O)CCl)c1',
  ['2-chloro-N-(3,5-dimethylphenyl)-N-methylacetamide']),
 ('CN=S(F)(F)(F)F', ['tetrafluoro(methylimino)-lambda6-sulfane']),
 ('N#S(F)(F)C(F)(C(F)(F)F)C(F)(F)F',
  ['azanylidyne-difluoro-(1,1,1,2,3,3,3-heptafluoropropan-2-yl)-lambda6-sulfane']),
 ('N#S(F)(F)C(F)(F)C(F)(F)F',
  ['azanylidyne-difluoro-(1,1,2,2,2-pentafluoroethyl)-lambda6-sulfane']),
 ('CN(C)S(=O)F', ['[fluorosulfinyl(methyl)amino]methane']),
 ('C=C1C(=O)CCC12CCCCC2', ['4-methylidenespiro[4.5]decan-3-one']),
 ('CC1(C)CCC(=O)C1(C)C', ['2,2,3,3-tetramethylcyclopentan-1-one']),
 ('C=C1C(=O)CCC1(C)c1ccc(C)cc1',
  ['3-methyl-2-methylidene-3-(4-methylphenyl)cyclopentan-1-one']),
 ('C=Cc1sc(-c2ccccc2)nc1C', ['5-ethenyl-4-methyl-2-phenyl-1,3-thiazole'])]

### Step 4: Performe cluster-based split to create training and test pools.

To allow for both future model training on benchmark tasks as well as unbiased evaluation,
we create three pools from the filtered PubChem data by performance a cluster-based 
split:
1) A training pool: Contains molecules from which training samples can be built.
2) An easy test set: Contains molecules that are drawn from clusters present in the 
   training pool.
3) A hard test set: Contains molecules that are drawn from clusters not present in the
   training pool.

The splitting is performed using MinHash LSH to cluster similar molecules based on
their Morgan fingerprints. Emerging pools build disjoint sets of molecules.

In [9]:
# Load created dataset pools

file_path = 'data/dataset_pools/processed/pubchem_train_test_pools.pkl'
with open(file_path, 'rb') as f:
    hard_test_set, easy_test_set, training_set, iupac_dict = pickle.load(f)

In [10]:
# Inspect hard test pool
hard_test_set[:10]

['C=C(C)Cc1cc(C(=O)c2ccccc2)ccc1O',
 'CCCCNC(=O)Cc1sc(NC(C)=O)nc1CC',
 'CC1(C)CC[C@]2(C)CC[C@]3(C)[C@H]4CC[C@H]5C(C)(C)[C@@H]6CC[C@]5(O6)[C@]4(C)CC[C@@]3(C)[C@@H]2C1',
 'CC(=O)Nc1ccc(C23CC2C(=O)NC3=O)cc1',
 'Nc1cc(Br)c2ccccc2n1',
 'CC1C(=O)CC2C(C(=O)O)=COC(OC3OC(CO)C(O)C(O)C3O)C12',
 'CN(C)C(=O)C(CCN1CCCCC1)c1ccccc1',
 'COc1ccc2oc(C=O)c3c2c1C=CC=C3',
 'COCOc1ccc(/C=C/C(=O)c2ccc(OCOC)c(CC=C(C)C)c2O)cc1',
 'Cc1cc2c(cc1C(=O)O)CS(=O)(=O)C2']

In [11]:
# Inspect easy test pool
easy_test_set[:10]

['O=S1(=O)CC(c2ccccc2)CS(=O)(=O)C12CCCCC2',
 'O=c1c2ccccc2nc(COc2ccc(I)cc2)n1CCO',
 'CC(C)(Cl)c1ccc(C#N)cc1',
 'COC(=O)CCC(=O)c1cccc(OCc2ccc3ccccc3n2)c1',
 'CC(C)CCCCCCCCCCOC(=O)c1ccccc1N',
 'Fc1cc(C(F)(F)F)cnc1Oc1cc(F)c(OCc2ccccc2)c(F)c1',
 'CC(C)(C)NC12CC1(C)CCCO2',
 'COCOc1cc(/C=C/C(=O)c2c(OC)cc(OC)c(CC=C(C)C)c2O)ccc1OC',
 'O=C(N[C@H]1C[C@H]2SC[C@H](C(=O)O)N2C1=O)OCc1ccccc1',
 'O=c1cc(-c2ccccc2)oc(-c2ccccc2)c1O']

In [12]:
# inspect training pool
training_set[:10]

['Br/C=C/CCCBr',
 'BrC12CC3CC(CC(C3)C1I)C2',
 'BrC1CCCC2CCCCC12',
 'BrCCC1OCCc2ccccc21',
 'BrCc1cc(Br)c2cccnc2c1Br',
 'BrCc1cc(Br)c2ncccc2c1Br',
 'BrCc1cccc(CBr)c1Sc1c(CBr)cccc1CBr',
 'Br[C@H]1C2C3CC1[C@@H](Br)C32',
 'Brc1c(Br)c(Br)c2c(c1Br)CCCC2',
 'Brc1cc(I)c2c(c1)CCN2']

### Hard test pool dataframe
This dataframe will be the basis for the benchmark dataset

In [14]:
# Load dataframe
file_path = 'data/dataset_pools/processed/hard_test_pool_dataframe.pkl'

with open(file_path, 'rb') as fl:
    df_htp = pickle.load(fl)

df_htp.head()

Unnamed: 0,smiles,mol,complexity,nbr_heavy_atoms,iupac_name,complexity_bin
0,C=C(C)Cc1cc(C(=O)c2ccccc2)ccc1O,<rdkit.Chem.rdchem.Mol object at 0x7fc4f2370040>,612.414121,19,[[4-hydroxy-3-(2-methylprop-2-enyl)phenyl]-phe...,600-700
1,CCCCNC(=O)Cc1sc(NC(C)=O)nc1CC,<rdkit.Chem.rdchem.Mol object at 0x7fc4f4c682c0>,443.096771,19,"[2-(2-acetamido-4-ethyl-1,3-thiazol-5-yl)-N-bu...",400-500
2,CC1(C)CC[C@]2(C)CC[C@]3(C)[C@H]4CC[C@H]5C(C)(C...,<rdkit.Chem.rdchem.Mol object at 0x7fc4f4c68310>,796.233392,31,"[(1R,2R,5S,6R,11R,14R,15R,18S,20S)-2,5,8,8,11,...",700-800
3,CC(=O)Nc1ccc(C23CC2C(=O)NC3=O)cc1,<rdkit.Chem.rdchem.Mol object at 0x7fc4f4c68360>,564.104018,18,"[N-[4-(2,4-dioxo-3-azabicyclo[3.1.0]hexan-1-yl...",500-600
4,Nc1cc(Br)c2ccccc2n1,<rdkit.Chem.rdchem.Mol object at 0x7fc4f4c68400>,425.798091,12,[4-bromoquinolin-2-amine],400-500


## B) Creating the benchmark dataset

To create a benchmark dataset with various different chemical features (for details see 
paper), three different reasoning tasks (counting, indexing, generation), and diversity 
across three complexity axes: molecular complexity, multi-task load, and input 
representation, we first precompute a property dataframe with ground truth values for 
the chemical features and then apply a sampling strategy. From the sampled datapoints, 
questions are generated, using question templates and an NLP formatter.

### Step 1: Create property df

In [15]:
# Load property dataframe
file_path = 'data/benchmark/properties.pkl'
with open(file_path, 'rb') as fl:
    properties_df = pickle.load(fl)
    
properties_df.head()

  properties_df = pickle.load(fl)


Unnamed: 0,smiles,mol,complexity,nbr_heavy_atoms,iupac_name,complexity_bin,original_smiles,ring_count,ring_index,fused_ring_count,...,template_based_reaction_prediction_epoxide_to_diol_products,template_based_reaction_prediction_epoxide_to_halohydrin_Br_success,template_based_reaction_prediction_epoxide_to_halohydrin_Br_products,template_based_reaction_prediction_epoxide_to_halohydrin_Cl_success,template_based_reaction_prediction_epoxide_to_halohydrin_Cl_products,template_based_reaction_prediction_ozonolysis_terminal_success,template_based_reaction_prediction_ozonolysis_terminal_products,murcko_scaffold_count,murcko_scaffold_index,murcko_scaffold_value
0,CN(C)[Si](Cl)Cl,<rdkit.Chem.rdchem.Mol object at 0x7fc4ef50efc0>,32.529325,6,[],0-100,CN(C)[Si](Cl)Cl,0,[],0,...,,0,,0,,0,,0,[],
1,OCCNc1[nH]c([nH]c(c1N)=O)=O,<rdkit.Chem.rdchem.Mol object at 0x7fc4ef50f060>,393.601217,13,[5-amino-6-(2-hydroxyethylamino)-1H-pyrimidine...,300-400,Nc1c(NCCO)[nH]c(=O)[nH]c1=O,1,"[4, 5, 6, 7, 8, 9]",0,...,,0,,0,,0,,8,"[4, 5, 6, 7, 8, 9, 11, 12]",O=c1cc[nH]c(=O)[nH]1
2,CC(C)(CCCc1ccc(F)c(Oc2ccccc2)c1)c1ccc(OC(F)(F)...,<rdkit.Chem.rdchem.Mol object at 0x7fc4ef50f100>,983.898985,31,[4-[4-[4-[chloro(difluoro)methoxy]phenyl]-4-me...,900-1000,CC(C)(CCCc1ccc(F)c(Oc2ccccc2)c1)c1ccc(OC(F)(F)...,3,"[6, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18, 19, 2...",0,...,,0,,0,,0,,23,"[1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 1...",c1ccc(CCCCc2cccc(Oc3ccccc3)c2)cc1
3,C1N(c2ccc(N3C(=O)N(CC)CC3=O)cc2)CCN(c2ccc(OC)c...,<rdkit.Chem.rdchem.Mol object at 0x7fc4ef50f150>,873.935319,29,[1-ethyl-3-[4-[4-(4-methoxyphenyl)piperazin-1-...,800-900,CCN1CC(=O)N(c2ccc(N3CCN(c4ccc(OC)cc4)CC3)cc2)C1=O,4,"[0, 1, 2, 3, 4, 5, 6, 7, 9, 12, 13, 15, 16, 17...",0,...,,0,,0,,0,,25,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15,...",O=C1CNC(=O)N1c1ccc(N2CCN(c3ccccc3)CC2)cc1
4,C1C(=O)C(C(=O)OCC)SCCC1,<rdkit.Chem.rdchem.Mol object at 0x7fc4ef50f1a0>,203.036495,13,[ethyl 3-oxothiepane-2-carboxylate],200-300,CCOC(=O)C1SCCCCC1=O,1,"[0, 1, 3, 9, 10, 11, 12]",0,...,,0,,0,,0,,8,"[0, 1, 2, 3, 9, 10, 11, 12]",O=C1CCCCSC1


## Step 2: Generate benchmark dataset

In [1]:
# todo