# Retrieving Result Collections

This example shows how QCSubmit can be used to retrieve the results of quantum chemical (QC) calculations from a [QCFractal] instance such as [QCArchive].

In particular, it demonstrates how:

* raw torsion drive, optimised geometry and hessian result records can be retrieved from the public
  [QCArchive] server and stored in a result collection

* the retrieved result records can be filtered and curated using a set of built-in filters

* the result collection can be saved and loaded from disk

[QCFractal]: http://docs.qcarchive.molssi.org/projects/qcfractal/en/latest/
[QCArchive]: https://qcarchive.molssi.org/

For the sake of clarity all verbose warnings will be disabled in this tutorial:

In [1]:
import warnings

warnings.filterwarnings('ignore')
import logging
logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

## Retrieving result collections

QCSubmit provides a suite of utilities for retrieving and curating collections of QC results directly from a running QCFractal server, or an already computed QCPortal dataset. This functionality is provided through three main classes:

* `BasicResultCollection` - stores references to simple QCPortal result record that may contain energies, gradients, or hessians computed for a molecule in a single conformation.

* `OptimizationResultCollection` - stores references to full optimization result records (i.e. `OptimizationRecord`
  objects), as well as the final minimised conformer produced by the optimization.

* `TorsionDriveResultCollection` - stores references to full torsion drive result records (i.e. `TorsionDriveRecord`
  objects), as well as the minimum energy conformer associated with each torsion angle that was scanned.

Each of these collections can be generated directly from a running `QCFractal` server using the `from_server` class
method.

We begin by creating a QCPortal `FractalClient` instance that will allow us to communicate with the running
server. By default, `FractalClient` connects to the main QCArchive server:

In [2]:
from qcportal import PortalClient

qc_client = PortalClient("https://api.qcarchive.molssi.org:443")

Other servers can be accessed by providing the server's URI.

We can then use this to generate our result collections:

In [3]:
from openff.qcsubmit.results import (
    BasicResultCollection,
    OptimizationResultCollection,
    TorsionDriveResultCollection,
)

# Pull down the energy result records from the 'OpenFF BCC Refit Study COH v1.0' dataset.
energy_result_collection = BasicResultCollection.from_server(
    client=qc_client,
    datasets="OpenFF BCC Refit Study COH v1.0",
    spec_name="spec_2"  # This used to be "resp-2-vacuum", but the spec name was changed in the QCArchive 0.50 migration
)
print(energy_result_collection)

# Pull down the optimization records from both the 'OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy' and
# 'OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy' datasets.
optimization_result_collection = OptimizationResultCollection.from_server(
    client=qc_client,
    datasets=[
        "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy",
        "OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy",
    ],
    spec_name="default",
)
print(optimization_result_collection)

# Pull down the torsion drive records from the 'OpenFF Rowley Biaryl v1.0' dataset.
torsion_drive_result_collection = TorsionDriveResultCollection.from_server(
    client=qc_client,
    datasets="OpenFF Rowley Biaryl v1.0",
    spec_name="default",
)
print(torsion_drive_result_collection)

entries={'https://api.qcarchive.molssi.org:443/': [BasicResult(type='basic', record_id=32651888, cmiles='[H:9][C:2]([H:10])([H:11])[C:1](=[O:7])[O:8][C:6]([H:19])([H:20])[C:5]([H:17])([H:18])[C:4]([H:15])([H:16])[C:3]([H:12])([H:13])[H:14]', inchi_key='DKPFZGUDAPQIHT-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651895, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:16])([H:17])[C:4]([H:14])([H:15])[C:3]([H:12])([H:13])[C:2]([H:9])([H:10])[H:11]', inchi_key='NMJJFJNHVMGPGM-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651766, cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651881, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:16])([H:17])[C:4]([H:14])([H:15])[C:3]([H:12])([H:13])[C:2]([H:9])([H:10])[H:11]', inchi_key='NMJJFJNHVMGPGM-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651822, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:16])([H:17])[C:4]([H:1

entries={'https://api.qcarchive.molssi.org:443/': [OptimizationResult(type='optimization', record_id=6091376, cmiles='[H:13][c:1]1[c:2]([c:5]([c:4]([c:6]([c:3]1[H:15])[O:11][H:22])[C:7](=[O:10])[O:12][C:9]([H:19])([H:20])[H:21])[C:8]([H:16])([H:17])[H:18])[H:14]', inchi_key='GPNCYIZKJTXKRO-UHFFFAOYNA-N'), OptimizationResult(type='optimization', record_id=6091377, cmiles='[H:13][c:1]1[c:2]([c:5]([c:4]([c:6]([c:3]1[H:15])[O:11][H:22])[C:7](=[O:10])[O:12][C:9]([H:19])([H:20])[H:21])[C:8]([H:16])([H:17])[H:18])[H:14]', inchi_key='GPNCYIZKJTXKRO-UHFFFAOYNA-N'), OptimizationResult(type='optimization', record_id=6091434, cmiles='[H:11][C:1]1=[N:7][N:8]([C:3]([O:10]1)([C:5]([H:15])([H:16])[H:17])[C:6]([H:18])([H:19])[H:20])[C:2](=[O:9])[C:4]([H:12])([H:13])[H:14]', inchi_key='OJPHECJXUFDEEH-UHFFFAOYNA-N'), OptimizationResult(type='optimization', record_id=6091356, cmiles='[H:18][c:1]1[c:2]([n:15]([c:5]2[c:4]1[c:6]([n:14][c:3]([n:13]2)[H:20])[N:16]3[C:7]([C:9]([N:17]([C:10]([C:8]3([H:23])[H:24]

entries={'https://api.qcarchive.molssi.org:443/': [TorsionDriveResult(type='torsion', record_id=21272379, cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[C:9]2=[C:7]([C:8](=[N:10][N:11]2[H:19])[H:18])[H:17])[H:15])[H:13]', inchi_key='OEDUIFSDODUDRK-WXRBYKJCNA-N'), TorsionDriveResult(type='torsion', record_id=21272436, cmiles='[H:13][c:1]1[c:2]([c:5]([c:10]([c:6]([c:3]1[H:15])[H:18])[c:11]2[c:7]([c:4]([c:8]([n:12][c:9]2[H:21])[H:20])[H:16])[H:19])[H:17])[H:14]', inchi_key='HJKGBRPNSJADMB-UHFFFAOYNA-N'), TorsionDriveResult(type='torsion', record_id=21272437, cmiles='[H:13][c:1]1[c:2]([c:5]([c:9]([c:6]([c:3]1[H:15])[H:18])[c:10]2[c:7]([c:4]([c:8]([n:11][n:12]2)[H:20])[H:16])[H:19])[H:17])[H:14]', inchi_key='XWSSUYOEOWLFEI-UHFFFAOYNA-N'), TorsionDriveResult(type='torsion', record_id=21272380, cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[C:7]2=[N:8][N:9]=[N:10][N:11]2[H:17])[H:15])[H:13]', inchi_key='MARUHZGHZWCEQU-FZOZFQFYNA-N'), TorsionDriveResult(type

*Note: currently only complete results are pulled down by the `from_server` method*

There are two main inputs to the `from_server` method, in addition to the fractal client:

* the name(s) of the existing datasets to retrieve the results of. This can either be the name of a single dataset or a list of dataset names
* the name of the specification used to compute the records. Each specification corresponds to a particular basis, method, program and additional settings.

Let's print out some basic information about each of these result collections:

In [4]:
print("===HESSIAN RESULTS===")

print(f"N RESULTS:   {energy_result_collection.n_results}")
print(f"N MOLECULES: {energy_result_collection.n_molecules}")

print("===OPTIMIZATION RESULTS===")

print(f"N RESULTS:   {optimization_result_collection.n_results}")
print(f"N MOLECULES: {optimization_result_collection.n_molecules}")

print("===TORSION DRIVE RESULTS===")

print(f"N RESULTS:   {torsion_drive_result_collection.n_results}")
print(f"N MOLECULES: {torsion_drive_result_collection.n_molecules}")

===HESSIAN RESULTS===
N RESULTS:   191
N MOLECULES: 91
===OPTIMIZATION RESULTS===
N RESULTS:   2398
N MOLECULES: 419
===TORSION DRIVE RESULTS===
N RESULTS:   87
N MOLECULES: 87


We can easily save / load the collections to / from disk:

In [5]:
# save the energy result collection to a JSON file
with open("energy-result-collection.json", "w") as file:
    file.write(energy_result_collection.json())

# re-load the serialized result collection
BasicResultCollection.parse_file("energy-result-collection.json")

BasicResultCollection(entries={'https://api.qcarchive.molssi.org:443/': [BasicResult(type='basic', record_id=32651888, cmiles='[H:9][C:2]([H:10])([H:11])[C:1](=[O:7])[O:8][C:6]([H:19])([H:20])[C:5]([H:17])([H:18])[C:4]([H:15])([H:16])[C:3]([H:12])([H:13])[H:14]', inchi_key='DKPFZGUDAPQIHT-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651895, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:16])([H:17])[C:4]([H:14])([H:15])[C:3]([H:12])([H:13])[C:2]([H:9])([H:10])[H:11]', inchi_key='NMJJFJNHVMGPGM-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651766, cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651881, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:16])([H:17])[C:4]([H:14])([H:15])[C:3]([H:12])([H:13])[C:2]([H:9])([H:10])[H:11]', inchi_key='NMJJFJNHVMGPGM-UHFFFAOYNA-N'), BasicResult(type='basic', record_id=32651822, cmiles='[H:8][C:1](=[O:6])[O:7][C:5]([H:

Each of these collections will store the referenced results in their `entries` dictionary. This dictionary uses the
address of the QCFractal server as keys:

In [6]:
torsion_drive_result_collection.entries.keys()

dict_keys(['https://api.qcarchive.molssi.org:443/'])

This allows results generated by multiple different servers (e.g. a local fractal instance and the public QCArchive
server) to be stored in a single result collection object.

The references to the actual data are then stored in corresponding lists:

In [7]:
torsion_drive_result_collection.entries[qc_client.address][:10]

[TorsionDriveResult(type='torsion', record_id=21272379, cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[C:9]2=[C:7]([C:8](=[N:10][N:11]2[H:19])[H:18])[H:17])[H:15])[H:13]', inchi_key='OEDUIFSDODUDRK-WXRBYKJCNA-N'),
 TorsionDriveResult(type='torsion', record_id=21272436, cmiles='[H:13][c:1]1[c:2]([c:5]([c:10]([c:6]([c:3]1[H:15])[H:18])[c:11]2[c:7]([c:4]([c:8]([n:12][c:9]2[H:21])[H:20])[H:16])[H:19])[H:17])[H:14]', inchi_key='HJKGBRPNSJADMB-UHFFFAOYNA-N'),
 TorsionDriveResult(type='torsion', record_id=21272437, cmiles='[H:13][c:1]1[c:2]([c:5]([c:9]([c:6]([c:3]1[H:15])[H:18])[c:10]2[c:7]([c:4]([c:8]([n:11][n:12]2)[H:20])[H:16])[H:19])[H:17])[H:14]', inchi_key='XWSSUYOEOWLFEI-UHFFFAOYNA-N'),
 TorsionDriveResult(type='torsion', record_id=21272380, cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[C:7]2=[N:8][N:9]=[N:10][N:11]2[H:17])[H:15])[H:13]', inchi_key='MARUHZGHZWCEQU-FZOZFQFYNA-N'),
 TorsionDriveResult(type='torsion', record_id=21272389, cmiles='[H:12]

After running the above command, notice that the entries stored in the collection are not the actual result
records generated and stored on the server, but rather a reference to them. In particular, the unique ID of the record is stored along with a SMILES depiction of the molecule the result was generated for.

The main reason for doing this is that we often would like to be able to state which data we would like to use in
an application without having to create multiple copies of the data. Not only can this take up large amounts of disk space, it runs the risk of data becoming out of sync with the original if the format the records are stored in changes or the local copy of the data is accidentally mutated. Storing a reference to the original data and then retrieving it when needed is typically a cleaner and safer solution.

## Retrieving the result records

The raw result record objects can be easily retrieved using the result collection objects:

In [8]:
torsion_drive_records = torsion_drive_result_collection.to_records()
torsion_drive_records[:5]

[(TorsiondriveRecord(id=21272379, record_type='torsiondrive', is_service=True, properties={}, extras={}, status=<RecordStatusEnum.complete: 'complete'>, manager_name=None, created_on=datetime.datetime(2020, 7, 21, 16, 42, 27, 710811), modified_on=datetime.datetime(2020, 7, 21, 16, 42, 27, 710809), owner_user=None, owner_group=None, compute_history_=None, task_=None, service_=None, comments_=None, native_files_=None, specification=TorsiondriveSpecification(program='torsiondrive', optimization_specification=OptimizationSpecification(program='geometric', qc_specification=QCSpecification(program='psi4', driver=<SinglepointDriver.deferred: 'deferred'>, method='b3lyp-d3bj', basis='dzvp', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<Nativ

QCSubmit seamlessly takes care of pulling the data from the server in the most efficient way making sure to take
advantage of the pagination that QCFractal provides. Further, it attempts to cache all calls to the server so that
multiple calls to `to_records` does not need to constantly query the server.

Notice that not only are the raw result records retrieved, but also an OpenFF `Molecule` object is created for each result record. This molecule has the correct ordering and also stores any conformers associated with the
result collection. For basic collections, the conformer is the one that was used in any calculations; for optimization collections, it is the final conformer yielded by the optimization; and for torsion drives, it is the lowest energy conformer for each sampled torsion angle.

In the case of torsion drive records, we can easily iterate over the grid ID, the associated conformer, and the
associated energy in one go:

In [9]:
torsion_drive_record, molecule = torsion_drive_records[0]
for grid_id, qc_conformer in zip(
    molecule.properties["grid_ids"], molecule.conformers
):
    qc_energy = torsion_drive_record.final_energies[grid_id]

    print(f"{grid_id} E={qc_energy:.4f} Ha")

(-165,) E=-457.3350 Ha
(-150,) E=-457.3353 Ha
(-135,) E=-457.3346 Ha
(-120,) E=-457.3333 Ha
(-105,) E=-457.3318 Ha
(-90,) E=-457.3312 Ha
(-75,) E=-457.3318 Ha
(-60,) E=-457.3332 Ha
(-45,) E=-457.3346 Ha
(-30,) E=-457.3353 Ha
(-15,) E=-457.3350 Ha
(0,) E=-457.3346 Ha
(15,) E=-457.3350 Ha
(30,) E=-457.3353 Ha
(45,) E=-457.3346 Ha
(60,) E=-457.3332 Ha
(75,) E=-457.3318 Ha
(90,) E=-457.3312 Ha
(105,) E=-457.3318 Ha
(120,) E=-457.3333 Ha
(135,) E=-457.3346 Ha
(150,) E=-457.3353 Ha
(165,) E=-457.3350 Ha
(180,) E=-457.3346 Ha


We can also directly visualize the torsion drive using the built-in OpenFF Toolkit utilities using the
`molecule.visualize("nglview")` function.

### Basic results from optimization results

It is common for certain datasets within a QCFractal server to be created using the output of another dataset. This is especially the case for datasets of hessian records that are computed using the conformer produced by an optimization.

The `OptimizationResultCollection` currently provides a `to_basic_result_collection` method to handle such cases. This can take some time to run:

In [10]:
derived_hessian_collection = optimization_result_collection.to_basic_result_collection(
    driver="hessian"
)
derived_hessian_collection.n_results

2378

This is a particularly useful way to access hessian data contained within older datasets. Older datasets do not usually
store SMILES information for their result records, and hence it can be difficult to know exactly which molecule the
hessian was computed for. The `to_basic_result_collection` method takes care of this by propagating SMILES information
from the parent optimization record down to the child hessian result record.

In addition to retrieving already computed datasets, the optimization result collection provides a utility for
generating a new QC dataset based on the optimized conformers:

In [12]:
from qcportal.singlepoint import SinglepointDriver

hessian_dataset = optimization_result_collection.create_basic_dataset(
    dataset_name="My Dataset",
    description="A dataset created from an optimization result collection.",
    tagline="Contains hessian data.",
    driver=SinglepointDriver.hessian,
)

The resulting dataset can then be submitted to a running QCFractal server.

## Filtering result collections

A powerful feature of the result collections is the ability to easily filter the entries it contains using a diverse
range of filters, such as filtering out specific molecules based on SMILES patterns, records where the
connectivity of the molecule changed during the optimization, or much more!

The built-in filters are stored in the `openff.qcsubmit.results.filters` module:

In [13]:
from openff.qcsubmit.results import filters

Let's apply some basic filters to our optimization collection:

In [None]:
from qcportal.record_models import RecordStatusEnum

filtered_collection = optimization_result_collection.filter(
    filters.RecordStatusFilter(status=RecordStatusEnum.complete),
    filters.ConnectivityFilter(tolerance=1.2),
    filters.ElementFilter(
        # The elements supported by OpenFF 1.3.0
        allowed_elements=["H", "C", "N", "O", "S", "P", "F", "Cl", "Br", "I"]
    ),
    filters.ConformerRMSDFilter(max_conformers=10),
)

print("===========")
print(f"N RECORDS INITIAL: {optimization_result_collection.n_results}")
print(f"N RECORDS FINAL:   {filtered_collection.n_results}")

print(f"N MOLECULES INITIAL: {optimization_result_collection.n_molecules}")
print(f"N MOLECULES FINAL:   {filtered_collection.n_molecules}")
print("===========")

Here we have removed:

* any incomplete records using the `RecordStatusFilter`

* records whose whereby a connectivity during the computation, e.g. a proton transfer occurred
  using the `ConnectivityFilter`

* records that were computed for molecules composed of elements that are not supported by the current
  OpenFF force fields

and finally, a `ConformerRMSDFilter` was applied. When a collection contains multiple optimized conformers for the
same molecule, the `ConformerRMSDFilter` will only retain up to a maximum number of conformers for that molecule that
are distinct to within a specified RMSD tolerance.

We could have also made use of the `LowestEnergyFilter` to only retain the lowest energy conformer associated with each
unique molecule in the collection.

The filtered result collection will record provenance information about which filters were applied:

In [None]:
filtered_collection.provenance

## Additional utilities

In addition to providing an interface for curating collections of QC results, the result collection objects also expose
a number of quality of life utilities for visualizing and analysing the stored results.

A pdf showing the molecules within a result collection can be easily generated:

In [None]:
energy_result_collection.visualize("energy-result-collection.pdf", columns=8)

## (Currently inoperable, as caching was removed in the 0.50 release) Cached queries

While results should mostly be retrieved using the result collections, there are times when it is useful to directly
query a QCFractal server for specific records and QC molecules.

To this end the OpenFF QCSubmit framework offers *cached* versions of the `FractalClient.query_procedures` and
`FractalClient.query_molecules` functions. These and others are provided in the `openff.qcsubmit.results.caching`
module:

In [None]:
from openff.qcsubmit.results import caching

Currently, the most useful cached methods are the `cached_query_procedures` function:

In [None]:
caching.cached_query_procedures(
    qc_client.address, record_ids=["21272353", "21272354"]
)

and the `cached_query_molecules` function:

In [None]:
caching.cached_query_molecules(
    qc_client.address, molecule_ids=["21272"]
)

The internal cache used by the framework can be easily cleared if the memory usage becomes too large:

In [None]:
caching.clear_results_caches()