# PEPCONF OptimizationDataset Preparation

From John:

> This dataset contains several categories of interesting short peptides:
> - dipeptide
> - tripeptide
> - disulfide-linked peptides
> - short bioactive peptides
> - cyclic peptides
>
> We should use our standard conformer enumeration schemes to generate conformers for QM.

The original starting molecules were XYZ files.
John has already turned these into canonical isomeric SMILES using `extract-pepconf-smiles.py`.
We will start from these SMILES in our submission generation.

In [1]:
import os

import numpy as np

from qcsubmit.factories import OptimizationDataset, OptimizationDatasetFactory
from openforcefield.topology import Molecule
from qcsubmit import workflow_components



## Preparation steps

In [2]:
moldata = dict()
with open('pepconf.csv', 'r') as f:
    for line in f:
        smiles, name = line.strip().split(',')
        moldata[name] = smiles

We won't use the names in the SMILES file, since we'll be generating stereoisomers, conformers.

In [3]:
mols = [Molecule.from_smiles(smiles, allow_undefined_stereo=True) for smiles in moldata.values()]

In [4]:
len(mols)

741

In [6]:
# Generate the workflow to apply to the molecules
qcs_ds = OptimizationDatasetFactory()

component = workflow_components.EnumerateStereoisomers()
component.max_isomers = 100
component.toolkit = "rdkit"
qcs_ds.add_workflow_component(component)

component = workflow_components.StandardConformerGenerator()
component.max_conformers = 100
component.toolkit = "rdkit"
component.rms_cutoff = 3.0
qcs_ds.add_workflow_component(component)

In [6]:
desc = "OptimizationDataset of short peptides in various contexts, including disulfide bridges."
name = "OpenFF PEPCONF OptimizationDataset v1.0"

dataset = qcs_ds.create_dataset(
    dataset_name=name,
    molecules=mols,
    description=desc,
    tagline=desc,
)
print("Workflow complete; dataset generated.")

Deduplication                 : 100%|████████| 741/741 [00:02<00:00, 321.14it/s]
EnumerateStereoisomers        : 100%|█████████| 741/741 [01:24<00:00,  8.78it/s]
StandardConformerGenerator    : 100%|███████| 736/736 [1:27:01<00:00,  7.09s/it]
Preparation                   : 100%|█████████| 736/736 [02:17<00:00,  5.35it/s]

Workflow complete; dataset generated.





In [12]:
dataset.metadata.long_description_url = "https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2020-10-26-PEPCONF-Optimization"
dataset.metadata.long_description = "OptimizationDataset of short peptides in various contexts, including disulfide bridges. The source PEPCONF dataset is documented in [Nature Scientific Data](https://www.nature.com/articles/sdata2018310), and available on [GitHub](https://github.com/aoterodelaroza/pepconf). This dataset extracts the molecules that were simulated, but uses the QCSubmit infrastructure to generate a new `OptimizationDataset`, so does not use the original conformers."
dataset.metadata.submitter = 'dotsdl'

confs = np.array([len(mol.conformers) for mol in dataset.molecules])
print("Number of unique molecules       ", dataset.n_molecules)
print("Number of filtered molecules     ", dataset.n_filtered)
print("Number of conformers             ", dataset.n_records)
print("Number of conformers min mean max", 
      confs.min(), "{:6.2f}".format(confs.mean()), confs.max())

dataset.export_dataset("dataset.json.bz2")

Number of unique molecules        736
Number of filtered molecules      5
Number of conformers              7560
Number of conformers min mean max 1  10.27 58


In [13]:
dataset.metadata.elements

{'C', 'H', 'N', 'O', 'S'}

In [12]:
dataset.visualize("molecules.pdf", columns=3, toolkit="openeye")

In [13]:
dataset.provenance

{'qcsubmit': 'v0.1.0', 'openforcefield': '0.8.0', 'rdkit': '2020.09.1'}

In [15]:
# manual provenance fix
import openeye
dataset.provenance["openeye"] = openeye.__version__

In [20]:
dataset.metadata.submitter = 'jchodera'

In [21]:
dataset.export_dataset("dataset.json.bz2")

In [18]:
dataset.molecules_to_file('molecules.smi', 'smi')