# Dataset Basics

In this example, we will be working with a Singlepoint Dataset. However, the concepts will apply to all other datasets

In [1]:
from qcportal import PortalClient
from qcportal.records.singlepoint import QCSpecification
from qcportal.datasets.singlepoint import SinglepointDatasetNewEntry
from qcportal.molecules import Molecule
from qcfractal import FractalSnowflake

For this example, we will use a snowflake server. For production calculations, you will generally be connecting to an external server.

In [2]:
snowflake = FractalSnowflake()

The snowflake sets up a database and server. We can get a client to that server from the snowflake object

In [3]:
client = snowflake.client()

## Creating the dataset, entries, and specifications

Next, we will create the singlepoint dataset on the server. We will also give it a default tag.

The first argument is the type of dataset. See
[PortalClient.add_dataset](../api/qcportal.rst#qcportal.client.PortalClient.add_dataset) for more options.

In [4]:
ds = client.add_dataset("singlepoint",
                        name="Element Benchmark",
                        description="Variety of calculations on single atoms",
                        default_tag='sp_el_tag')

Now add datasaet entries. For a singlepoint dataset, these correspond to the molecules the singlepoint calculation runs on.

This cell creates ten Molecule objects, one for each of the first 10 elements, with the atom at the origin. It then creates entries for the dataset, and adds them to the dataset.

Dataset entries can have other fields as well. See, for example, [SinglepointDatasetNewEntry](../api/qcportal.datasets.singlepoint.rst#qcportal.datasets.singlepoint.models.SinglepointDatasetNewEntry)

In [5]:
for m in ['h', 'he', 'li', 'be', 'b', 'c', 'n', 'o', 'f', 'ne']:
    mol = Molecule(symbols=[m], geometry=[0.0, 0.0, 0.0])
    
    # Creates an entry from the molecule. The entry contains the molecule and a name,
    # but there are additional fields you can have as well
    entry_name = m + "_atom"
    ent = SinglepointDatasetNewEntry(name=entry_name + "", molecule=mol)
    ds.add_entries([ent])

We will now create two different specifications, and add them to the dataset. The First will be hf/sto-3g, and the second will be mp2/aug-cc-pvtz.

On both, we will increase the maximum number of SCF iterations to 100

In [6]:
spec_1 = QCSpecification(
            program="psi4",
            driver="energy",
            method="hf",
            basis="sto-3g",
            keywords={"maxiter": 100}
)

spec_2 = QCSpecification(
            program="psi4",
            driver="properties",
            method="mp2",
            basis="aug-cc-pvtz",
            keywords={"maxiter": 100}
)

ds.add_specification(name="hf/sto-3g", specification=spec_1)
ds.add_specification(name="mp2/aug-cc-pvtz", specification=spec_2)

## Submitting the computations and checking the status

At this point, we have added specifications and entries,
but have not submitted any calculations yet. We do that with
the `submit()` function

By default, this submits all calculations, but we could restrict the entries
and specifications that get submitted.

The compute tag for all these computations can be specified here, but by default, the `default_tag` we passed to the `add_dataset` function will be used.

In [7]:
ds.submit()

We can check the status of the calculations on the server with the `status()` function. Note that this will always be computed on the server, and will not use any locally-cached records.

In [16]:
ds.status()

{'hf/sto-3g': {<RecordStatusEnum.complete: 'complete'>: 10},
 'mp2/aug-cc-pvtz': {<RecordStatusEnum.complete: 'complete'>: 5,
  <RecordStatusEnum.error: 'error'>: 5}}

## Entries and specifications

Specifications can be viewed with the `specifications` property, which returns a dictionary

In [9]:
ds.specifications

{'hf/sto-3g': SinglepointDatasetSpecification(name='hf/sto-3g', specification=QCSpecification(program='psi4', driver=<SinglepointDriver.energy: 'energy'>, method='hf', basis='sto-3g', keywords={'maxiter': 100}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)), description=None),
 'mp2/aug-cc-pvtz': SinglepointDatasetSpecification(name='mp2/aug-cc-pvtz', specification=QCSpecification(program='psi4', driver=<SinglepointDriver.properties: 'properties'>, method='mp2', basis='aug-cc-pvtz', keywords={'maxiter': 100}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)), description=None)}

For entries, we can get a list of entry names with `entry_names`

In [10]:
ds.entry_names

['h_atom',
 'he_atom',
 'li_atom',
 'be_atom',
 'b_atom',
 'c_atom',
 'n_atom',
 'o_atom',
 'f_atom',
 'ne_atom']

To get the full information about an entry, use [get_entry()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.get_entry). This function will fetch from the server as needed.

By default, this will not fetch the full molecule for the entry. We can force that with `include=['molecule']`

In [11]:
ds.get_entry('h_atom', include=['molecule'])

SinglepointDatasetEntry(name='h_atom', comment=None, molecule=Molecule(name='H', formula='H', hash='512c204'), additional_keywords={}, attributes={}, molecule_id=1, local_results=None)

We can iterate over all the entries with
[iterate_entries()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_entries).
This function returns a python generator and will automatically fetch entry information as needed

In [12]:
for entry in ds.iterate_entries(include=['molecule']):
    print(entry.name, entry.molecule.get_hash())

h_atom 512c204fbb415052dbcf3bca37cde209edb05c6c
he_atom b3855c64e9f61158f5e449e2df7b79bf1fa599d7
li_atom 276d7fc85bb9f9a56b45e238b277ca7033701d47
be_atom 323841f39301b6ba786dc777fc00b772585748b4
b_atom 52710bae9ed5f58616c108ecebb5156cf06eb5d7
c_atom 95903295c6c2e250e62e0b930ae0916f8ee82c3d
n_atom 4549a18cc99231565da1764ccd20855198628d1b
o_atom 61591a7367f341bba8c2e9a8afa15f740b651efe
f_atom 3fdfbc30f87150456ef640e32982a81bbae6fcdc
ne_atom d078cfcbf6d52ba080868acb0560296bb7c4542a


## Getting and iterating over records

Records are indexed by the entry name and the specification name. Similar to entries, a single record can be obtained with [get_record()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.get_record)

In [13]:
rec = ds.get_record("h_atom", "hf/sto-3g")
print(rec.id)
print(rec.properties.return_energy)

3


AttributeError: 'NoneType' object has no attribute 'return_energy'

When you need information about a bunch of records, we can iterate over all of them with
[iterate_records()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_records).

This function returns a generator which produces a tuple with 3 values (entry name, specification name, and record).
This function will also automatically fetch records information as needed

[iterate_records()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_records) has some additional arguments which are useful, such as being able to iterate only over records with a particular status. This is useful in this case because some computations have not finished or are errored.

In [None]:
for entry_name, spec_name, record in ds.iterate_records(status='complete'):
    print(entry_name, spec_name, record.properties.return_energy)

## Compiling a pandas dataframe

One common task is to create a pandas dataframe with values that you have computed. For this, you can use
[compile_values()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.compile_values).

The first argument of this function is a callable which is applied to all (completed) records, and is used to extract the values stored in the dataframe. The function then iterates over all records, applies that function,
and creates the pandas dataframe for you.

In [14]:
df = ds.compile_values(lambda r: r.properties.return_energy, 'total energy')

In [15]:
print(df)

specification  hf/sto-3g
entry                   
b_atom        -24.149117
h_atom         -0.466582
he_atom        -2.807913
o_atom        -73.661918
