# Tutorial 02 - Estimating Data Sets

In this tutorial we will be estimating the data set we created in the first tutorial using molecular simulation. The 
tutorial will cover:

- defining custom calculation schemas for the properties in our data set.
- estimating the data set of properties using an [Evaluator server](../gettingstarted/server.rst) instance.
- retrieving the results from the server and performing some simple analysis.

For the sake of clarity all warnings will be disabled in this tutorial:

In [None]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("openforcefield").setLevel(logging.ERROR)

*Note: If you are running this example in google colab you will need to run a setup script:*

In [None]:
# %run colab_setup.ipynb

*Make sure that you are using a GPU accelerated runtime.*

## Loading the Data Set to Estimate

We will begin by loading in the data set which we created in the previous tutorial:

In [None]:
from evaluator.datasets import PhysicalPropertyDataSet

# data_set_path = "filtered_data_set.json"

# If you have not yet completed that tutorial or do not have the data set file 
# available, a copy is provided by the framework:

from evaluator.utils import get_data_filename
data_set_path = get_data_filename("tutorials/tutorial01/filtered_data_set.json")

data_set = PhysicalPropertyDataSet.from_json(data_set_path)

This data set will contain our density and $H_{vap}$ measurements for ethanol and isopropanol. 

## Defining the Calculation Schemas

After loading the data set, the next step we will take will be to define a calculation schema for each types of property 
in our set. A calculation schema is the blueprint for how a type of property should be calculated using a particular 
[calculation approach](../layers/calculationlayers.rst), such as directly by simulation, by reprocessing cached 
simulation data or, in future, a range of other options.

The framework has built-in schemas which defining how densities and $H_{vap}$ should be estimated from molecular 
simulation, covering all of the aspects from coordinate generation, force field assignment, energy minimisation,
equilibration and finally the production simulation and data analysis. All of this functionality is defined in terms
of the built in [workflow engine](../workflows/workflows.rst), where each of the above steps is implemented as a 
separate [workflow task](../workflows/protocols.rst).

For the purpose of this tutorial, we are going to modify the default calculation schemas to reduce the number of 
molecules to include in our simulations to speed up the calculations. This step can be skipped entirely if the default
options (which we would normally recommend) are acceptable.

We can extract the default simulation schemas using the ``default_simulation_schema()`` function::

In [None]:
from evaluator.properties import Density, EnthalpyOfVaporization

density_schema = Density.default_simulation_schema(n_molecules=256)
h_vap_schema = EnthalpyOfVaporization.default_simulation_schema(n_molecules=256)

from evaluator.client import RequestOptions

# Create an options object which defines how the data set should be estimated.
estimation_options = RequestOptions()
# Specify that we only wish to use molecular simulation to estimate the data set.
estimation_options.calculation_layers = ["SimulationLayer"]

# Add our custom schemas, specifying that the should be used by the 'SimulationLayer'
estimation_options.add_schema("SimulationLayer", "Density", density_schema)
estimation_options.add_schema("SimulationLayer", "EnthalpyOfVaporization", h_vap_schema)

here we override the default number of molecules to include in the simulation (reducing this count down from 1000 to 
256).

We could further use this method to set either the absolute or the relative uncertainty that the property should be 
estimated to within. If either of these are set, the schemas will be set up to automatically extend any simulations 
until the target uncertainty in the property has been met. For our purposes however we won't set any targets, leaving 
the simulations to run for a default 1 ns.   

## Launching the Server

In [None]:
from evaluator.utils import setup_timestamp_logging
setup_timestamp_logging()

Loris ipsum.

In [None]:
from evaluator.backends import ComputeResources
from evaluator.backends.dask import DaskLocalCluster

calculation_backend = DaskLocalCluster(
    number_of_workers=1,
    resources_per_worker=ComputeResources(
        number_of_threads=1, 
        number_of_gpus=1, 
        preferred_gpu_toolkit=ComputeResources.GPUToolkit.CUDA
    ),
)
calculation_backend.start()

from evaluator.server import EvaluatorServer

evaluator_server = EvaluatorServer(calculation_backend=calculation_backend)
evaluator_server.start(asynchronous=True)

Loris ipsum.

## Estimating the Data Set

Loris ipsum.

In [None]:
from evaluator.forcefield import SmirnoffForceFieldSource

force_field_path = "openff-1.0.0.offxml"
force_field_source = SmirnoffForceFieldSource.from_path(force_field_path)

from evaluator.client import EvaluatorClient
evaluator_client = EvaluatorClient()

from evaluator.forcefield import ParameterGradientKey

requested_gradients = [
    ParameterGradientKey(tag="vdW", smirks="[#6X4:1]", attribute="epsilon"),
    ParameterGradientKey(tag="vdW", smirks="[#6X4:1]", attribute="rmin_half"),
]

request, exception = evaluator_client.request_estimate(
    property_set=data_set,
    force_field_source=force_field_source,
    options=estimation_options,
    parameter_gradient_keys=requested_gradients,
)

assert exception is None

# Wait for the results.
results, exception = request.results(synchronous=True, polling_interval=30)
assert exception is None

results.json(f"results.json", True);

Loris ipsum.

## Analysing the Results

Loris ipsum.

In [None]:
print(len(results.queued_properties))

print(len(results.estimated_properties))

print(len(results.unsuccessful_properties))
print(len(results.exceptions))

Loris ipsum.

## Conclusion

Loris ipsum.