# Compound prep:
First we extract the SMILES codes from the CSV supplied as supplementary information to [Ultra-large library docking for discovering new chemotypes](https://www.nature.com/articles/s41586-019-0917-9). 

The output file of this (`./data/ligands.smi`) can be processed in two ways:
- Embedded in 3D as-is, using RDKit's ETKDG method
- Pre-treated by enumerating tautomers, charge states at pH 7.4, and enantiomers. For this, we use [Gypsum-DL](https://durrantlab.pitt.edu/gypsum-dl/) from the Durrant lab, with command: `python run_gypsum_dl.py --source ligands.smi --min_ph 7.4 --max_ph 7.4 --pka_precision 0.25 --output_folder ./output --add_html_output`. This process includes embedding in 3D (also using RDKit's ETKDG method). Warning -  this increases the number of ligands to be docked from 548 to ~1680!

In [1]:
import pandas as pd
from rdkit import Chem
import matplotlib.pyplot as plt
from rdkit.Chem import AllChem
import tqdm

In [2]:
df = pd.read_csv('./data/41586_2019_917_MOESM4_ESM.csv').iloc[:-4] #remove last four rows
df.head()

Unnamed: 0,ZINC ID,Global Rank∗,Clustered Rank†,Energy,TC to knowns‡,Cosest neighbor among known DRD4 binders,Top-pick or not,Just from energy window,Energy window,Tested or not,Binder or not,D4 Ki(nM),D2 Ki(nM),D3 Ki(nM),cAMP EC50(nM),Inhibition (%) at 10uM,SMILES,Vendor ID,Charge from docked poses
0,ZINC000191583186,1,1,-75.5,0.3,ZINC000028347504,0,1,-75,1,1,1390.0,3860.0,1730.0,NT||,82.48,Cc1ccc(C[C@@H](CO)N[C@@H](C)CCc2ccccc2[N+](=O)...,Z1804039468,1.0
1,ZINC000159533726,2,2,-73.67,0.33,ZINC000103232405,0,1,-75,1,0,,,,NT,2.68,C[C@H](C(=O)Nc1cc([N+](=O)[O-])ccc1Cl)N(C)C[C@...,Z1514931360,1.0
2,ZINC000151228439,3,4,-73.47,0.34,ZINC000053274848,0,1,-75,1,0,,,,NT,17.5,C[C@@H](NC[C@](C)(O)c1ccccc1)c1cn(-c2ccccc2)nn1,Z1419817479,1.0
3,ZINC000291023493,5,5,-72.95,0.31,ZINC000028363497,0,1,-75,1,0,,,,NT,-10.33,C[C@H](Nc1cc(-n2cccn2)nc(N)n1)[C@H](c1ccccc1)N...,Z2179794811,2.0
4,ZINC000593577820,7,7,-72.5,0.35,ZINC000036216606,0,1,-75,1,0,,,,NT,11.48,COC(=O)C[C@H]1CSCCN1Cc1cn(-c2cccc(C)c2)nc1C,Z2480456501,1.0


In [3]:
# Save smiles and zinc names to file in order to use dimorphite-dl
df[['SMILES', 'ZINC ID', 'Inhibition (%) at 10uM']].to_csv('./data/ligands.smi',index=False, sep='\t')

In [4]:
mols = [Chem.MolFromSmiles(i) for i in df['SMILES']]

In [6]:
with Chem.SDWriter('./data/ligands3d.sdf') as writer:
    for m, n in tqdm.tqdm_notebook(zip(mols, df['ZINC ID']), total=len(mols)):
        mH = Chem.AddHs(m)
        AllChem.EmbedMolecule(mH)
        mH.SetProp('_Name', n)
        writer.write(mH)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for m, n in tqdm.tqdm_notebook(zip(mols, df['ZINC ID']), total=len(mols)):


  0%|          | 0/549 [00:00<?, ?it/s]

# ---- end -----