# Dataset Preparation - Disaccharide Optimization

David Cerutti has prepared two disaccharide molecules to start with for optimizations with Psi4.
These molecules have been pre-optimized with B3LYP / def2-SVP using his own MDGX program.

The two molecules are currently in separate JSON files.
Our objectives are:
1. Pull these together into a single dataset submission with QCSubmit.
2. Identify any missing fields that we need to populate before submission.
3. Populate those missing fields.

In this notebook we will work to fulfill all three objectives.
However, it may prove impossible initially with the information in these initial JSON files.
Additional iterations with David Cerutti may be necessary from his end to satisfy these objectives.

In [1]:
import sys
sys.path.append('../../management/')

from qcsubmit.serializers import deserialize
from validation import get_meta_info, validate_dataset, check_metadata, check_scf_props, check_basis_coverage





Performed some manual fixes to JSON file; noted in [PR](https://github.com/openforcefield/qca-dataset-submission/pull/124).

In [2]:
ds1 = deserialize('DAllpa1-2DAllpa1-OME.json')

In [3]:
ds1

{'dataset_name': 'mdgx-created data set',
 'dataset_tagline': 'A data set created by mdgx',
 'dataset_type': 'OptimizationDataSet',
 'method': 'd3bj b3lyp',
 'basis': 'def2-tzvpp',
 'program': 'psi4',
 'maxiter': 200,
 'driver': 'gradient',
 'scf_properties': ['dipole',
  'quadrupole',
  'wiberg_lowdin_indices',
  'mayer_indices'],
 'spec_name': 'default',
 'spec_description': 'OpenFF quantum chemistry input',
 'priority': 'normal',
 'description': 'mdgx conformational search outputs converted to JSON format',
 'dataset_tags': ['openff'],
 'compute_tag': 'openff',
 'dataset': {'OME_2NA_0NA': {'index': 'OME_2NA_0NA',
   'initial_molecules': [{'schema_name': 'qcschema_molecule',
     'schema_version': 2,
     'validated': True,
     'symbols': ['H',
      'C',
      'H',
      'H',
      'O',
      'C',
      'H',
      'C',
      'H',
      'C',
      'H',
      'C',
      'H',
      'C',
      'H',
      'C',
      'H',
      'H',
      'O',
      'H',
      'O',
      'O',
      'H',


## Validation

In [4]:
import traceback

Running this dataset through all of the validators we use in the CI:

In [5]:
for validator in (get_meta_info, validate_dataset, check_metadata, check_scf_props, check_basis_coverage):
    print('-------------')
    try:
        print(validator(ds1))
    except Exception as e:
        print(traceback.format_exc())
        
    print('--------------')

-------------
{'**Dataset Name**': 'mdgx-created data set', '**Dataset Type**': 'OptimizationDataSet', '**Method**': 'd3bj b3lyp', '**Basis**': 'def2-tzvpp'}
--------------
-------------
Traceback (most recent call last):
  File "<ipython-input-5-b4ddd9e57b9e>", line 4, in <module>
    print(validator(ds1))
  File "../../management/validation.py", line 64, in validate_dataset
    del data_copy["metadata"]
KeyError: 'metadata'

--------------
-------------
Traceback (most recent call last):
  File "<ipython-input-5-b4ddd9e57b9e>", line 4, in <module>
    print(validator(ds1))
  File "../../management/validation.py", line 97, in check_metadata
    dataset = create_dataset(data_copy)
  File "../../management/validation.py", line 44, in create_dataset
    raise RuntimeError(f"The dataset type {dataset_type} is not supported.")
RuntimeError: The dataset type OptimizationDataSet is not supported.

--------------
-------------
Traceback (most recent call last):
  File "<ipython-input-5-b4ddd9

We do need some `metadata`.
This should get generated when we pass this all through QCSubmit to generate the submission.
We'll pursue that next.

## Passing through QCSubmit

QCSubmit will do many checks and automatic bits, so we'll pass our structure through it.

In [11]:
from qcsubmit.factories import OptimizationDatasetFactory
from openforcefield.topology import Molecule as OFFMolecule

In [7]:
factory = OptimizationDatasetFactory()
factory.scf_properties = ds1['scf_properties']
factory

OptimizationDatasetFactory(method='B3LYP-D3BJ', basis='DZVP', program='psi4', maxiter=200, driver=<DriverEnum.gradient: 'gradient'>, scf_properties=['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices'], spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', priority='normal', dataset_tags=['openff'], compute_tag='openff', workflow={}, optimization_program=GeometricProcedure(program='geometric', coordsys='tric', enforce=0.0, epsilon=1e-05, reset=False, qccnv=False, molcnv=False, check=0, trust=0.1, tmax=0.3, maxiter=300, convergence_set='GAU', constraints={}))

In [8]:
factory.export_settings('optimization_settings.yaml')

In [10]:
dataset = factory.create_dataset(dataset_name="OpenFF Disaccharide Optimization v1.0",
                                 molecules=[], 
                                 description="An optimization dataset of disaccharides",
                                 tagline="Optimizations of a set of disaccharides")

Now, we want to add our molecules:

In [15]:
for idx, (canonical_optimization_index, optimization_data) in enumerate(ds1['dataset'].items()):
    attributes = optimization_data['attributes']
    molecule = OFFMolecule.from_qcschema(optimization_data)
    
    dataset.add_molecule(index=idx, molecule=molecule, attributes=attributes)

KeyError: 'The record must contain the hydrogen mapped smiles to be safely made from the archive.'

So, we need a source of CMILES before we can go much further.
To my understanding these can be generated from SDF files with the toolkit. [@jthorton has also asked for the original structure files](https://github.com/openforcefield/qca-dataset-submission/pull/124#issuecomment-669064179), which may also work.