# Design Validation Plates

We want to design 4 plates for experimental validation

## exp100 / "cherry-picking plate"
One plate containing 320 compounds.
Design Process:
1. Select 15 representative Initiators, 15 representative Monomers and 10 representative Terminators
2. Filter out any known combinations
3. Let the model predict reaction outcome and remove all combinations where the predicted reaction outcome is negative
4. Randomly select 320 of the remaining compounds
5. Design source plate and transfer files for this plate

## exp101 / "extrapolation plates"
Three plates of 320 compounds each for a total of 960 compounds
12 I x 10 M x 8 T
We have identified 12 previously unused Initiators
Design process:
1. Select 10 representative Monomers and 8 representative Terminators 
    (here we can just use the first 10, resp. 8) sampled for the other experiment as sampling is done before prediction and thus unbiased.
2. We will use a full factorial design and only need to design the appropriate source plate and transfer files


In [None]:
import csv
import pathlib
import sys
from collections import Counter

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit import SimDivFilters, DataStructs
from rdkit.Chem import Draw, rdMolDescriptors

sys.path.append(str(pathlib.Path().resolve().parents[1]))
from src.util.db_utils import SynFermDatabaseConnection
from src.util.rdkit_util import desalt_building_block

In [None]:
con = SynFermDatabaseConnection()

In [None]:
res = con.con.execute("SELECT * FROM building_blocks").fetchall()
header = [i[1] for i in con.con.execute("PRAGMA table_info(building_blocks)").fetchall()]
df = pd.DataFrame(res, columns=header)
initiators = df.loc[df["category"] == "I"]
monomers = df.loc[df["category"] == "M"]
terminators = df.loc[df["category"] == "T"]

## Idea 1: RDKit's MinMaxPicker

In [None]:
mols = [Chem.MolFromSmiles(smi) for smi in terminators["SMILES"]]

fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2) for m in mols]

In [None]:
# https://rdkit.blogspot.com/2014/08/optimizing-diversity-picking-in-rdkit.html
def dmat_sim(fps,ntopick):
    ds=[]
    for i in range(1,len(fps)):
         ds.extend(DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i],returnDistance=True))
    mmp =SimDivFilters.MaxMinPicker()
    ids=mmp.Pick(np.array(ds),len(fps),ntopick)
    return ids

dmat_ids=dmat_sim(fps, 10)

Draw.MolsToGridImage([mols[x] for x in dmat_ids],molsPerRow=5, subImgSize=(300,300))

### Idea 1 conclusion
Due to how the MinMaxPicker works we sample the "weirdest" compounds, the ones most disparate from each other and thus our data set. This does not make sense when we want representative compounds

## Idea 2: Random picking


In [None]:
# still need to filter to only have the ones that were not excluded during data analysis
experiments = con.get_experiments_table_as_df()

In [None]:
valid_exp = experiments.loc[(~experiments["valid"].str.contains("ERROR", na=False)) & (experiments["exp_nr"].between(4,29))]
valid_ini = set(valid_exp["initiator_long"].to_numpy().tolist())
valid_mon = set(valid_exp["monomer_long"].to_numpy().tolist())
valid_ter = set(valid_exp["terminator_long"].to_numpy().tolist())

len(valid_ini), len(valid_mon), len(valid_ter)

In [None]:
# draw randomly
select_ini = initiators.loc[initiators["long"].isin(valid_ini)].sample(11, random_state=1)
select_mon = monomers.loc[monomers["long"].isin(valid_mon)].sample(15, random_state=2)
select_ter = terminators.loc[terminators["long"].isin(valid_ter)].sample(10, random_state=3)

selected = pd.concat((select_ini, select_mon, select_ter))

Draw.MolsToGridImage([Chem.MolFromSmiles(smi) for smi in selected["SMILES"]], legends=selected["long"].to_numpy().tolist(), molsPerRow=5)


In [None]:
# do not use 4-Pyrazole002, BiPh009, Mon082. These are the building blocks we had to remove in data curation.

In [None]:
# from these, we remove all combinations where we attempted synthesis
products = [f"{i} + {j} + {k}" for i in select_ini["long"] for j in select_mon["long"] for k in select_ter["long"]]
len(products)

In [None]:
attempted = experiments.loc[experiments["exp_nr"].between(4,29), "long_name"].to_numpy().tolist()
not_attempted = [p for p in products if p not in attempted]
len(not_attempted)

In [None]:
not_attempted

In [None]:
# for all the not attempted ones, write a file with the following columns:
# idx, vl_id, long_name, reaction_SMILES_atom_mapped
vl_ids, smiles = [], []
for long in not_attempted:
    res = con.con.execute("SELECT id, SMILES FROM virtuallibrary WHERE long_name IN (?) AND type = 'A';", (long,)).fetchall()
    assert len(res) == 1
    vl_ids.append(res[0][0])
    smiles.append(res[0][1])

In [None]:
df = pd.DataFrame({"vl_id": vl_ids, "long_name": not_attempted, "product_A_smiles": smiles})
df

In [None]:
df.to_csv("../../data/curated_data/validation-plate_candidates.csv", index=False)

(at this point, predictions where run on a different machine in notebook `inference_validation-plate.ipynb`)

In [None]:
# load predictions
preds = pd.read_csv("../../data/curated_data/validation-plate_candidates_predictions.csv")
preds

When preparing for the experiment, it turned out that our stock of Spiro003 is depleted. We remove this building block.

In [None]:
# how many possible experiments will be removed?
len(preds.loc[preds["long_name"].str.contains("Spiro003")])

In [None]:
# randomly sample 320 of the positive predictions to synthesize
sample = preds.loc[(preds["pred_A"] == 1) & (~preds["long_name"].str.contains("Spiro003"))].sample(320, random_state=42)
sample

In [None]:
building_blocks = sample["long_name"].str.split("+", expand=True).applymap(lambda x: x.strip())
building_blocks[0].drop_duplicates().sort_values()

In [None]:
building_blocks[1].drop_duplicates().sort_values()

In [None]:
building_blocks[2].drop_duplicates().sort_values()

In [None]:
# design the source plate
from labware.plates import Plate384, Plate384Echo

In [None]:
source_plate = Plate384(max_vol=65000, dead_vol=15000)

In [None]:
# add initiators to source plate
for name, count in building_blocks[0].value_counts().items():
    source_plate.fill_well(source_plate.free(), name, count*990+15000)

# add monomers to source plate, starting at row F
for name, count in building_blocks[1].value_counts().items():
    source_plate.fill_well(source_plate.free(from_well="F1"), name, count*990+15000)

# add terminators to source plate, starting at row K
for name, count in building_blocks[2].value_counts().items():
    source_plate.fill_well(source_plate.free(from_well="K1"), name, count*1100+15000)
    
# add oxalic acid. For 320 reactions, we need two source wells, but we just fill up the entire bottom row for redundancy
for _ in range(24):
    source_plate.fill_well(source_plate.free(from_well="P1"), "X", 65000)

In [None]:
print(source_plate)

In [None]:
source_plate.to_csv("../../data/plates/exp100/source_plate_layout.csv", save_volumes=True)

In [None]:
# now make the target plate
target_plate = Plate384Echo()
# add placeholders
target_plate.fill_span("A1", "P2", "placeholder", target_plate.max_vol)
target_plate.fill_span("A23", "P24", "placeholder", target_plate.max_vol)
# add sampled reactions
for i, row in sample.iterrows():
    well = target_plate.free()
    compounds = row["long_name"].split(" + ")
    target_plate.fill_well(well, compounds[0], 990)
    target_plate.fill_well(well, compounds[1], 990)
    target_plate.fill_well(well, compounds[2], 1100)
    target_plate.fill_well(well, "X", 220)
# remove placeholders
target_plate.empty_span("A1", "P2")
target_plate.empty_span("A23", "P24")

In [None]:
print(target_plate)

In [None]:
target_plate.to_csv("../../data/plates/exp100/plate_layout_plate1.csv", save_volumes=True)

In [None]:
compound_location = {compound[0]: well for well, compound in source_plate.to_dict().items() if len(compound) == 1}

In [None]:
# to prepare the tranfer files, just iterate the target plate
header = ['Source Barcode', 'Source Well', 'Destination Barcode', 'Destination Well', 'Volume']
step1_transfers, step2_transfers = [], []
step1_transfers.insert(0, header)
step2_transfers.insert(0, header)
source_barcode = 'Source1'
destination_barcode = 'Synthesis1'
for dest_well, compounds, volume in target_plate.iterate_wells():
    if volume == 0:
        continue  # skip empty wells
    step1_transfers.append([source_barcode, compound_location[compounds[0]], destination_barcode, dest_well, 990])
    step1_transfers.append([source_barcode, compound_location[compounds[1]], destination_barcode, dest_well, 990])
    step2_transfers.append([source_barcode, compound_location[compounds[2]], destination_barcode, dest_well, 1100])

    # add oxalic acid
    if int(dest_well[1:]) < 13:
        step1_transfers.append([source_barcode, "P1", destination_barcode, dest_well, 220])
    else:
        step1_transfers.append([source_barcode, "P2", destination_barcode, dest_well, 220])

In [None]:
# correct number of transfers?
assert len(step1_transfers) == 320 * 3 + 1
# source wells occur no more than 50 times (volume limit)?
used_wells = [l[1] for l in step1_transfers[1:]]
for k, v in Counter(used_wells).items():
    if k.startswith("A") or k.startswith("F"):
        assert v <= 50
    elif k.startswith("P"):
        assert v == 160
    else:
        raise ValueError(f"unexpected well {k}")
# all transfers are unique
assert len(step1_transfers) == len(set([tuple(line) for line in step1_transfers]))
# all destination wells are used exactly thrice
used_dest_wells = [l[2] + "_" + l[3] for l in step1_transfers[1:]]
for k, v in Counter(used_dest_wells).items():
    assert v == 3

In [None]:
# save to file
with open('validation_exp100_step1.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(step1_transfers)

In [None]:
# correct number of transfers?
assert len(step2_transfers) == 320 + 1
# source wells occur no more than 45 times (volume limit)?
used_wells = [l[1] for l in step2_transfers[1:]]
for k, v in Counter(used_wells).items():
    if k.startswith("K"):
        assert v <= 50
    else:
        raise ValueError(f"unexpected well {k}")
# all transfers are unique
assert len(step2_transfers) == len(set([tuple(line) for line in step2_transfers]))
# all destination wells are used exactly once
used_dest_wells = [l[2] + "_" + l[3] for l in step2_transfers[1:]]
for k, v in Counter(used_dest_wells).items():
    assert v == 1

In [None]:
# save to file
with open('validation_exp100_step2.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(step2_transfers)

## exp101
For exp101, we still need to choose monomers and terminators and we need to make the plate layout files

In [None]:
# to choose M/T, we simply use the first 10/8 of the sets randomly selected earlier for exp100 (excluding Spiro003 - not available)
mon101 = select_mon.loc[select_mon["long"] != "Spiro003"].iloc[0:10].sort_values(by="long")
mon101.to_csv("../../data/plates/exp101/monomers.csv", index=False)
mon101

In [None]:
ter101 = select_ter.iloc[0:8].sort_values(by="long")
ter101.to_csv("../../data/plates/exp101/terminators.csv", index=False)
ter101

In [None]:
def get_ini_lcms_mass(smi):
    mol = desalt_building_block(smi)
    return Chem.Descriptors.ExactMolWt(mol)

In [None]:
def get_ini_lcms_formula(smi):
    mol = desalt_building_block(smi)
    return Chem.rdMolDescriptors.CalcMolFormula(mol)

In [None]:
initiators = ["Ph037", "Ph038", "Ph039", "Ph004", "BiAl004", "Ph040", "Ph016", "Ph041", "Ph042", "Ph011", "BiAl005", "Ph043"]

ini101 = pd.DataFrame(initiators, columns=["long"])

ini101["SMILES"] = [Chem.MolToSmiles(Chem.MolFromSmiles(smi)) for smi in [
    "O=S(C1=CC=C(C([B-](F)(F)F)=O)C=C1)(C)=O.[K+]",
    "O=C([B-](F)(F)F)C1=CC2=C(C=CC=C2)C=C1.[K+]",
    "O=C([B-](F)(F)F)C1=CC(C#N)=CC=C1.[K+]",
    "C1C(OC)=CC(C(=O)[B-](F)(F)F)=CC=1.[K+]",
    "O=C([B-](F)(F)F)CCCCCCl.[K+]",
    "FC1=C(C(F)(F)F)C=C(C([B-](F)(F)F)=O)C=C1.[K+]",
    "O=C([B-](F)(F)F)C1=CC(NC=C2)=C2C=C1.[K+]",
    "O=C([B-](F)(F)F)C1=CC(C(C)(C)C)=CC=C1.[K+]",
    "O=C([B-](F)(F)F)C1=CC(F)=C(C)C=C1.[K+]",
    "O=C([B-](F)(F)F)C1=CC=C([N+]([O-])=O)C=C1.[K+]",
    "O=C([B-](F)(F)F)CCCCO.[K+]",
    "O=C([B-](F)(F)F)CCCCO.[K+]",
]]
ini101["category"] = ["I" for _ in range(len(ini101))]
ini101["boc"] = 0
ini101["cbz"] = 0
ini101["tbu"] = 0
ini101["tms"] = 0
ini101["lcms_mass_1"] = ini101["SMILES"].apply(lambda x: get_ini_lcms_mass(x))
ini101["lcms_mass_alt"] = None
ini101["comment"] = None
ini101["lcms_formula_1"] = ini101["SMILES"].apply(lambda x: get_ini_lcms_formula(x))
ini101["lcms_formula_alt"] = None
ini101["reactant_class"] = [
    "KAT_arom",
    "KAT_arom",
    "KAT_arom",
    "KAT_arom",
    "KAT_al",
    "KAT_arom",
    "KAT_hetarom",
    "KAT_arom",
    "KAT_arom",
    "KAT_arom",
    "KAT_al",
    "KAT_arom",
]
ini101.to_csv("../../data/plates/exp101/initiators.csv", index=False)
ini101