# Creating training and benchmarking datasets
This notebook will guide you through the process of creating the physical property training and benchmarking datasets used in the development of the transferable double exponential force field (DE-FF).

## Physical properties

First, we start with the optimisation datasets pulled from thermoML on which the DE-FF was trained. The majority of the dataset can be found in the  `physical-data-sets/sage-train-v1.json` which is taken directly from the Sage training [repo](https://github.com/openforcefield/openff-sage), the [script](https://github.com/openforcefield/openff-sage/blob/main/data-set-curation/physical-property/optimizations/curate-training-set.py) to reproduce this dataset can also be found in the repo. 

To make the complete training set we need to extract pure water properties from thermoML and add them to the sage training set.

In [None]:
import pandas
from openff.evaluator.datasets.curation.components import filtering, selection, thermoml
from openff.evaluator.datasets.curation.components.selection import State, TargetState
from openff.evaluator.datasets.curation.workflow import (
    CurationWorkflow,
    CurationWorkflowSchema,
)
from openff.evaluator.datasets import PhysicalPropertyDataSet
import os

Make a new directory to store generated datasets so as not to overwrite the refernce datasets provided in `physical-data-sets`.

In [None]:
new_ouput_folder = "remade-data-sets"
os.makedirs(new_ouput_folder, exist_ok=True)

Load thermoML and filter it for pure water density data at ambient conditions. 

In [None]:
data_frame = CurationWorkflow.apply(
        pandas.DataFrame(),
        CurationWorkflowSchema(
            component_schemas=[
                thermoml.ImportThermoMLDataSchema(cache_file_name="physical-data-sets/thermol.csv"),
                filtering.FilterByNComponentsSchema(n_components=[1]),
                filtering.FilterDuplicatesSchema(),
                filtering.FilterByPropertyTypesSchema(property_types=["Density"]),
                filtering.FilterByTemperatureSchema(
                    minimum_temperature=280.0, maximum_temperature=370
                ),
                filtering.FilterByPressureSchema(
                    minimum_pressure=99.9, maximum_pressure=101.4
                ),
                filtering.FilterBySmilesSchema(smiles_to_include=["O"]),
                selection.SelectDataPointsSchema(
                    target_states=[
                        TargetState(
                            property_types=[
                                ("Density", 1),
                            ],
                            states=[
                                State(
                                    temperature=temperature,
                                    pressure=101.325,
                                    mole_fractions=(1.0,),
                                )
                                for temperature in [
                                    281.15,
                                    298.15,
                                    313.15,
                                    329.15,
                                    345.15,
                                    361.15,
                                ]
                            ],
                        )
                    ]
                ),
            ]
        ),
        n_processes=4,
    )

density_data = PhysicalPropertyDataSet.from_pandas(data_frame=data_frame)
with open(os.path.join(new_ouput_folder, "pure_water_rho.json"), "w") as output:
    output.write(density_data.json())

This will produce a new pure water-only dataset at `remade-data-sets/pure_water_rho.json` which can be viewed as a table using the to pandas method.

In [None]:
density_data.to_pandas()

Now we can create the fitting dataset by combining the pure water and  sage datasets. First, load the sage data and check the number of entries, we expect 1032 rows.

In [None]:
sage_data = PhysicalPropertyDataSet.from_json("physical-data-sets/sage-train-v1.json")
sage_data.to_pandas()

Add in the water densities and make sure the number of rows has increased to 1038. Then save as an evaluator input file as `remade-data-sets/sage-and-water-rho.json` .

In [None]:
sage_data.add_properties(*density_data.properties)
sage_data.to_pandas()

In [None]:
with open(os.path.join(new_ouput_folder, "sage_and_water_rho.json"), "w") as output:
    output.write(sage_data.json())

# Physical property benchmarks

Here we will construct the benchmark hydration-free energy dataset by re-filtering the sage [set](https://github.com/openforcefield/openff-sage/blob/main/data-set-curation/physical-property/benchmarks/data-sets/sage-fsolv-test-v1.json) see the original filtering [script](https://github.com/openforcefield/openff-sage/blob/main/data-set-curation/physical-property/benchmarks/curate-fsolv-test-set.py) for details on its construction. This was done by creating a list of solutes for which we have an aqueous and non-aqueous solvation free energy measurement, the list of solutes is included in `pysical-data-sets/filtered-mnsol.txt`.

The mnsol non-aqueous solvation free energy test set can be constructed filtering for records containing these same solutes, due to licensing issues a text file with IDs of the records used from MNsol is included at `pysical-data-sets/filtered-mnsol.txt`. See the original [instructions](https://github.com/openforcefield/openff-sage/blob/2.0.0-rc.1/data-set-curation/physical-property/benchmarks/README.md) on downloading the MNsol dataset and processing it into a CSV format compatible with the openff-evaluator filtering tools. This dataset should then be filtered for the substances in `pysical-data-sets/filtered-mnsol.txt` to create the test set.

In [None]:
# load the sage dataset and all of the unique solutes from the processed mnsol dataset
sage_fsolv = PhysicalPropertyDataSet.from_json("physical-data-sets/sage-fsolv-test-v1.json")
solutes = set()
with open("physical-data-sets/filtered-mnsol.txt") as mnsol:
    for line in mnsol.readlines():
        if "Id" in line:
            continue
        data = line.split(",")
        solute = data[1] if data[2] == "Solute" else data[3]
        solutes.add(solute)

In [None]:
# add in 3 missing solutes which were used to expand the hydration free energy test set
for mol in ["CCCO", "CCCCO", "Cc1ccc(O)cc1"]:
    solutes.add(mol)

In [None]:
# remove any records that are not with a solute we want
entries_to_keep = []
for entry in sage_fsolv.properties:
    solute = [component.smiles for component in entry.substance.components if component.role.name == "Solute"][0]
    if solute in solutes:
        entries_to_keep.append(entry)
# we should have 72 entries
len(entries_to_keep)

Create an evaluator dataset for the hydration-free energies from the entries we want to keep and save it to `remade-data-sets/fsolv-filtered.json`.

In [None]:
sage_fsolv._properties = entries_to_keep
with open("remade-data-sets/fsolv-filtered.json", "w") as output:
    output.write(sage_fsolv.json())