# Retrieving Result Collections

In this example we will show how `openff-qcsubmit` can be used to retrieve the results of quantum chemical (QC)
calculations from a [QCFractal](http://docs.qcarchive.molssi.org/projects/qcfractal/en/latest/) instance.

In particular, we will demonstrate how:

* raw torsion drive, optimised geometry and hessian result records can be retrieved from the public
  [QCArchive](https://qcarchive.molssi.org/) server and stored in a result collection

* the retrieved result records can be filtered and curated using a set of built-in filters

* the result collection can be saved and loaded from disk

For the sake of clarity all verbose warnings will be disabled in this tutorial:

In [1]:
import warnings

warnings.filterwarnings('ignore')
import logging
logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

## Retrieving result collections

The OpenFF QCSubmit package provides a suite of utilities for retrieving and curating collections of QC results directly
from a running QCFractal server, or an already computed QCPortal dataset. This functionality is provided through three
main classes:

* `BasicResultCollection` - stores references to simple QCPortal result record that may contain energies, gradients, or
  hessians computed for a molecule in a single conformation.

* `OptimizationResultCollection` - stores references to full optimization result records (i.e. `OptimizationRecord`
  objects) as well as the final minimised conformer produced by the optimization.

* `TorsionDriveResultCollection` - stores references to full torsion drive result records (i.e. `TorsionDriveRecord`
  objects) as well as the minimum energy conformer associated with each torsion angle that was scanned.

Each of these collections can be generated directly from a running `QCFractal` server using the `from_server` class
method.

We begin by creating a QCPortal `FractalClient` instance that will allow us to communicate with the running
server:

In [2]:
from qcportal import FractalClient

qc_client = FractalClient()

where here we are connecting directly to the main QCArchive server. You are free to connect to whichever server you have
access to however (including one's running locally) by providing the servers address.

We can then use this to generate our result collections:

In [3]:
from openff.qcsubmit.results import (
    BasicResultCollection,
    OptimizationResultCollection,
    TorsionDriveResultCollection,
)

# Pull down the energy result records from the 'OpenFF BCC Refit Study COH v1.0' dataset.
energy_result_collection = BasicResultCollection.from_server(
    client=qc_client,
    datasets="OpenFF BCC Refit Study COH v1.0",
    spec_name="resp-2-vacuum"
)
print(energy_result_collection)

# Pull down the optimization records from both the 'OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy' and
# 'OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy' datasets.
optimization_result_collection = OptimizationResultCollection.from_server(
    client=qc_client,
    datasets=[
        "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy",
        "OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy",
    ],
    spec_name="default",
)
print(optimization_result_collection)

# Pull down the torsion drive records from the 'OpenFF Rowley Biaryl v1.0' dataset.
torsion_drive_result_collection = TorsionDriveResultCollection.from_server(
    client=qc_client,
    datasets="OpenFF Rowley Biaryl v1.0",
    spec_name="default",
)
print(torsion_drive_result_collection)

entries={'https://api.qcarchive.molssi.org:443/': [BasicResult(type='basic', record_id='32651903', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32652118', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32651833', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32652048', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32651858', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResu

*Note: currently only complete results are pulled down by the `from_server` method*

As can be seen there are two main inputs to the `from_server` method in addition to the fractal client:

* the name(s) of the existing datasets to retrieve the results of. This can either be the name of a single dataset or
  a list of dataset names
* the name of the specification used to compute the records. Each specification corresponds to a particular basis,
  method, program and additional settings.

Let's print out some basic information about each of these result collections:

In [4]:
print("===HESSIAN RESULTS===")

print(f"N RESULTS:   {energy_result_collection.n_results}")
print(f"N MOLECULES: {energy_result_collection.n_molecules}")

print("===OPTIMIZATION RESULTS===")

print(f"N RESULTS:   {optimization_result_collection.n_results}")
print(f"N MOLECULES: {optimization_result_collection.n_molecules}")

print("===TORSION DRIVE RESULTS===")

print(f"N RESULTS:   {torsion_drive_result_collection.n_results}")
print(f"N MOLECULES: {torsion_drive_result_collection.n_molecules}")

===HESSIAN RESULTS===
N RESULTS:   429
N MOLECULES: 94
===OPTIMIZATION RESULTS===
N RESULTS:   2398
N MOLECULES: 419
===TORSION DRIVE RESULTS===
N RESULTS:   87
N MOLECULES: 85


We can easily save / load the collections to / from disk:

In [5]:
# save the energy result collection to a JSON file
with open("energy-result-collection.json", "w") as file:
    file.write(energy_result_collection.json())

# re-load the serialized result collection
BasicResultCollection.parse_file("energy-result-collection.json")

BasicResultCollection(entries={'https://api.qcarchive.molssi.org:443/': [BasicResult(type='basic', record_id='32651903', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32652118', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32651833', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32652048', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHFFFAOYSA-N'), BasicResult(type='basic', record_id='32651858', cmiles='[H:7][C:1]([H:8])([C:3]([H:11])([C:2]([H:9])([H:10])[O:5][H:13])[O:6][H:14])[O:4][H:12]', inchi_key='PEDCQBHIVMGVHV-UHF

Each of these collections will store the referenced results in their `entries` dictionary. This dictionary uses the
address of the QCFractal server as keys:

In [6]:
torsion_drive_result_collection.entries.keys()

dict_keys(['https://api.qcarchive.molssi.org:443/'])

This allows results generated by multiple different servers (e.g. a local fractal instance and the public QCArchive
server) to be stored in a single result collection object.

The references to the actual data are then stored in corresponding lists:

In [7]:
torsion_drive_result_collection.entries[qc_client.address][:10]

[TorsionDriveResult(type='torsion', record_id='21272352', cmiles='[H:13][c:1]1[c:2]([c:7]([n:12][c:10]([c:5]1[H:17])[c:9]2[c:4]([c:3]([c:6]([n:11][c:8]2[H:20])[H:18])[H:15])[H:16])[H:19])[H:14]', inchi_key='VEKIYFGCEAJDDT-UHFFFAOYSA-N'),
 TorsionDriveResult(type='torsion', record_id='21272353', cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[N:11]2[C:9](=[C:7]([C:8](=[C:10]2[H:20])[H:18])[H:17])[H:19])[H:15])[H:13]', inchi_key='GEZGAZKEOUKLBR-UHFFFAOYSA-N'),
 TorsionDriveResult(type='torsion', record_id='21272354', cmiles='[H:12][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:14])[H:16])[N:11]2[C:8](=[C:7]([C:9](=[N:10]2)[H:19])[H:17])[H:18])[H:15])[H:13]', inchi_key='WITMXBRCQWOZPX-UHFFFAOYSA-N'),
 TorsionDriveResult(type='torsion', record_id='21272355', cmiles='[H:13][c:1]1[c:2]([c:4]([c:8]([c:5]([c:3]1[H:15])[H:17])[c:9]2[n:11][c:6]([n:10][c:7]([n:12]2)[H:19])[H:18])[H:16])[H:14]', inchi_key='RXELBMYKBFKHSM-UHFFFAOYSA-N'),
 TorsionDriveResult(type='torsion', record_id='21272

After running the above command you'll notice that the entries stored in the collection are not the actual result
records generated and stored on the server, but rather a reference to them. In particular, the unique id of the record
is stored along with a SMILES depiction of the molecule the result was generated for.

The main reason for doing this is that we often would like to be able to state which data we would like to use in
certain applications (i.e. as part of a training set) but without having to create multiple copies of the data. Not only
can this take up large amounts of disk space, it runs the risk of data becoming out of sync with the original (e.g. the
format the record are stored in changes and hence the local copy can no longer be loaded) or the local copy of the data
is accidentally mutated. Hence storing only a reference to the original data and then retrieving it when needed is often
a much cleaner (and usually safer) option.

## Retrieving the result records

The raw result record objects can be easily retrieved using the result collection objects:

In [8]:
torsion_drive_records = torsion_drive_result_collection.to_records()
torsion_drive_records[:5]

[(TorsionDriveRecord(id='21272352', status='COMPLETE'),
  Molecule with name '' and SMILES '[H]c1c(c(nc(c1[H])c2c(c(c(nc2[H])[H])[H])[H])[H])[H]'),
 (TorsionDriveRecord(id='21272353', status='COMPLETE'),
  Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])[H])N2C(=C(C(=C2[H])[H])[H])[H])[H])[H]'),
 (TorsionDriveRecord(id='21272354', status='COMPLETE'),
  Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])[H])N2C(=C(C(=N2)[H])[H])[H])[H])[H]'),
 (TorsionDriveRecord(id='21272355', status='COMPLETE'),
  Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])[H])c2nc(nc(n2)[H])[H])[H])[H]'),
 (TorsionDriveRecord(id='21272356', status='COMPLETE'),
  Molecule with name '' and SMILES '[H]c1c(nc(nc1[H])C2=C(OC(=C2[H])[H])[H])[H]')]

OpenFF QCSubmit seamlessly takes care of pulling the data from the server in the most efficient way making sure to take
advantage of the pagination that QCFractal provides. Further, it attempts to cache all calls to the server so that
multiple calls to `to_records` does not need to constantly query the server.

As can be seen from the output of the above command not only are the raw result records retrived, but also an OpenFF
molecule is created for each result record with the correct ordering and that stores any conformers associated with the
result collection. For basic collections the conformer is the one that was used in any calculations, for optimization
collections this will be the final conformer yielded by the optimization, and for torsion drives this will be the lowest
energy conformer for each sampled torsion angle.

In the case of torsion drive records, we can easily iterate over the grid id, the associated conformer, and the
associated energy in one go:

In [9]:
torsion_drive_record, molecule = torsion_drive_records[0]

for grid_id, qc_conformer in zip(
    molecule.properties["grid_ids"], molecule.conformers
):

    qc_energy = torsion_drive_record.final_energy_dict[grid_id]

    print(f"{grid_id} E={qc_energy:.4f} Ha")

[-120] E=-495.4583 Ha
[60] E=-495.4589 Ha
[-135] E=-495.4603 Ha
[-105] E=-495.4563 Ha
[45] E=-495.4611 Ha
[75] E=-495.4566 Ha
[-150] E=-495.4614 Ha
[-90] E=-495.4555 Ha
[30] E=-495.4624 Ha
[90] E=-495.4555 Ha
[-165] E=-495.4613 Ha
[-75] E=-495.4566 Ha
[15] E=-495.4626 Ha
[105] E=-495.4563 Ha
[-60] E=-495.4589 Ha
[0] E=-495.4624 Ha
[120] E=-495.4583 Ha
[180] E=-495.4611 Ha
[-45] E=-495.4611 Ha
[-15] E=-495.4626 Ha
[135] E=-495.4603 Ha
[165] E=-495.4613 Ha
[-30] E=-495.4624 Ha
[150] E=-495.4614 Ha


We can also directly visualize the torsion drive using the built-in OpenFF toolkit utilities using the
`molecule.visualize("nglview")` function.

### Basic results from optimization results

It is common for certain datasets within a QCFractal server to be created using the output of another dataset. This is
especially the case for datasets of hessian records that are computed using the conformer produced by an optimization.

The `OptimizationResultCollection` currently provides a `to_basic_result_collection` method to handle such cases (this
may take some time to run):

In [10]:
derived_hessian_collection = optimization_result_collection.to_basic_result_collection(
    driver="hessian"
)
derived_hessian_collection.n_results

2301

This is a particularly useful way to access hessian data contained within older datasets. Older datasets do not usually
store SMILES information for their result records, and hence it can be difficult to know exactly which molecule the
hessian was computed for. The `to_basic_result_collection` method takes care of this by propagating SMILES information
from the parent optimization record down to the child hessian result record.

In addition to retrieving already computed datasets, the optimization result collection provides a utility for
generating a new QC dataset based on the optimized conformers:

In [11]:
from qcportal.models.common_models import DriverEnum

hessian_dataset = optimization_result_collection.create_basic_dataset(
    dataset_name="My Dataset",
    description="A dataset created from an optimization result collection.",
    tagline="Contains hessian data.",
    driver=DriverEnum.hessian,
)

The created dataset can then easily be submitted to a running QCFractal server.

## Filtering result collections

A powerful feature of the result collections is the ability to easily filter the entries it contains using a diverse
range of filters, such as filtering out specific molecules based on SMILES patterns, or records where the
connectivity of the molecule changed during the optimization, or much more!

The built-in filters are stored in the `openff.qcsubmit.results.filters` module:

In [12]:
from openff.qcsubmit.results import filters

Let's apply some basic filters to our optimization collection:

In [13]:
from qcportal.models.records import RecordStatusEnum

filtered_collection = optimization_result_collection.filter(
    filters.RecordStatusFilter(status=RecordStatusEnum.complete),
    filters.ConnectivityFilter(tolerance=1.2),
    filters.ElementFilter(
        # The elements supported by OpenFF 1.3.0
        allowed_elements=["H", "C", "N", "O", "S", "P", "F", "Cl", "Br", "I"]
    ),
    filters.ConformerRMSDFilter(max_conformers=10),
)

print("===========")
print(f"N RECORDS INITIAL: {optimization_result_collection.n_results}")
print(f"N RECORDS FINAL:   {filtered_collection.n_results}")

print(f"N MOLECULES INITIAL: {optimization_result_collection.n_molecules}")
print(f"N MOLECULES FINAL:   {filtered_collection.n_molecules}")
print("===========")

N RECORDS INITIAL: 2398
N RECORDS FINAL:   1587
N MOLECULES INITIAL: 419
N MOLECULES FINAL:   419


Here we have removed:

* any incomplete records using the `RecordStatusFilter`

* records whose whereby a connectivity during the computation, e.g. a proton transfer occurred
  using the `ConnectivityFilter`

* records that were computed for molecules composed of elements that are not supported by the current
  OpenFF force fields

and finally, a `ConformerRMSDFilter` was applied. When a collection contains multiple optimized conformers for the
same molecule, the `ConformerRMSDFilter` will only retain up to a maximum number of conformers for that molecule that
are distinct to within a specified RMSD tolerance.

We could have also made use of the `LowestEnergyFilter` to only retain the lowest energy conformer associated with each
unique molecule in the collection.

The filtered result collection will record provenance information about which filters were applied:

In [14]:
filtered_collection.provenance

{'applied-filters': {'RecordStatusFilter-0': {'status': <RecordStatusEnum.complete: 'COMPLETE'>},
  'ConnectivityFilter-1': {'tolerance': 1.2},
  'ElementFilter-2': {'allowed_elements': ['H',
    'C',
    'N',
    'O',
    'S',
    'P',
    'F',
    'Cl',
    'Br',
    'I']},
  'ConformerRMSDFilter-3': {'max_conformers': 10,
   'rmsd_tolerance': 0.5,
   'heavy_atoms_only': True,
   'check_automorphs': True}}}

## Additional utilities

In addition to providing an interface for curating collections of QC results, the result collection objects also expose
a number of quality of life utilities for visualizing and analysing the stored results.

A pdf showing the molecules within a result collection can be easily generated:

In [15]:
energy_result_collection.visualize("energy-result-collection.pdf", columns=8)

## (Optional) Cached queries

While results should mostly be retrieved using the result collections, there are times when it is useful to directly
query a QCFractal server for specific records and QC molecules.

To this end the OpenFF QCSubmit framework offers *cached* versions of the `FractalClient.query_procedures` and
`FractalClient.query_molecules` (and more!) functions. These are provided in the `openff.qcsubmit.results.caching`
module:

In [16]:
from openff.qcsubmit.results import caching

Currently, the most useful cached methods are the `cached_query_procedures` function:

In [17]:
caching.cached_query_procedures(
    qc_client.address, record_ids=["21272353", "21272354"]
)

[TorsionDriveRecord(id='21272353', status='COMPLETE'),
 TorsionDriveRecord(id='21272354', status='COMPLETE')]

and the `cached_query_molecules` function:

In [18]:
caching.cached_query_molecules(
    qc_client.address, molecule_ids=["21272"]
)

[Molecule(name='C6H8', formula='C6H8', hash='3821db0')]

The internal cache used by the framework can be easily cleared if the memory usage is becomming too large:

In [19]:
caching.clear_results_caches()