Taking first go at the property estimator layout #99

Lnaden · 2018-01-31T00:53:28Z

Many functions are still in abstract form, so only API purposes.
thermoml_parser picked up and adapted from thermopyl

Major wall to overcome

Determine how to extract Compound/Mixture info from ThermoML

This is actually a really difficult task since the composition can be defined in several locations:

As a Variable (controlled quantity that is changing)
As a Constraint (constant)
As a Property (measured quantity)

Within any PureOrMixtureProperty which is the parent node of all three of the previous ones, the compounds which participate in the measurement are listed at Compound. However, the following can all be mutually true across all three sub-nodes:

No quantities need be defined, in which case it is of pure fluid (easy)
Quantities come in all types (volume, mass, mole)
The quantities of all compounds after considerations in Variable, Constraint, Property need not sum to 1
Some components can be dependent of each other (e.g. B = 1-A)
Subsets of components can be added together to make 1 (e.g. A+B=1 because this is the solvent that C is dissolved into)

Because these rules all seem to be contextually dependent, it becomes much harder to figure out what the correct composition of the measurement should be.

Known current problems

I have only tested over a small set, but it looks like right now it can only get out the Mole Fraction observable from Gas Chromatography results (tested by going through about half of the JCED set).
It does not appear to be extracting anything else right now.

The good news is I figured out how to get around the PyXB pinning version problem!

cc @jchodera @davidlmobley @mrshirts @bmanubay

Many functions are still in abstract form, so only API purposes. `thermoml_parser` picked up and adapted from `thermopyl` ### Major wall to overcome **Determine how to extract Compound/Mixture info from ThermoML** This is actually a really difficult task since the composition can be defined in several locations: * As a `Variable` (controlled quantity that is changing) * As a `Constraint` (constant) * As a `Property` (measured quantity) Within any `PureOrMixtureProperty` which is the parent node of all three of the previous ones, the compounds which participate in the measurement are listed at `Compound`. However, the following can all be mutually true across all three sub-nodes: * No quantities need be defined, in which case it is of pure fluid (easy) * Quantities come in all types (volume, mass, mole) * The quantities of all compounds after considerations in `Variable`, `Constraint`, `Property` need not sum to 1 * Some components can be dependent of each other (e.g. B = 1-A) * Subsets of components can be added together to make 1 (e.g. A+B=1 because this is the solvent that C is dissolved into) Because these rules all seem to be contextually dependent, it becomes much harder to figure out what the correct composition of the measurement should be. Right now I have the some of the logic in place to handle this, and it should work for pure fluids, but I have not tested this.

davidlmobley · 2018-01-31T04:36:41Z

Thanks for getting this up, Levi. It'll take me a while to get to try it (I'm offline tomorrow) so this is just to let you know it's on my list.Hopefully Michael and Bryce can look sooner.

bmanubay · 2018-01-31T16:43:25Z

This is actually a really difficult task since the composition can be defined in several locations:

As a Variable (controlled quantity that is changing)
As a Constraint (constant)
As a Property (measured quantity)

Hey @Lnaden , do you think you could give a quick example of these to clarify? I'm not sure that I fully understand what the issue is, sorry :(

Lnaden · 2018-01-31T18:17:28Z

do you think you could give a quick example of these to clarify

Sure, I'll modify an example to have all three, I added all the comments  so you can follow the logic

<PureOrMixtureData>
    <!-- Define 4 components -->
    <Component>
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>2</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>3</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>4</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <!-- Define 1 property we are measuring, the mole fraction of component 1-->
    <Property>
        <nPropNumber>1</nPropNumber>
        <Property-MethodID>
            <PropertyGroup>
                <CompositionAtPhaseEquilibrium>
                    <ePropName>Mole fraction</ePropName>
                    <sMethodName>gravimetric</sMethodName>
                </CompositionAtPhaseEquilibrium>
            </PropertyGroup>
        <!-- Here is where the Property says its component 1 -->
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
        </Property-MethodID>
        <PropPhaseID>
            <ePropPhase>Liquid</ePropPhase>
        </PropPhaseID>
    </Property>
    <!-- Define some phase made of of arbitrary components -->
    <PhaseID>
            <ePhase>Liquid</ePhase>
    </PhaseID>
    <!-- Define another phase made up of component 1 -->
    <PhaseID>
            <ePhase>Crystal</ePhase>
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
    </PhaseID>
    <!-- Constrain mole fraction of 4 in the Liquid phase to x=0.2, it will ALWAYS be this now -->
    <Constraint>
        <ConstraintID>
            <ConstraintType>
                <eSolventComposition>Solvent: Mole fraction</eSolventComposition>
            </ConstraintType>
        </ConstraintID>
        <ConstraintPhaseID>
            <eConstraintPhase>Liquid</eConstraintPhase>
            <!-- Here is where the Constraint says its component 4 -->
            <RegNum>
                <nOrgNum>4</nOrgNum>
            </RegNum>
        </ConstraintPhaseID>
        <nConstraintValue>0.2</nConstraintValue>
        <nConstrDigits>1</nConstrDigits>
    </Constraint>
    <!-- Create a variable we control, in this case, component 3 in a mixture of 2+3+4 -->
    <Variable>
        <nVarNumber>1</nVarNumber> <!-- Index of variable -->
        <VariableID>
            <VariableType>
                <!-- Type of variable -->
                <eSolventComposition>Solvent: Mole fraction</eSolventComposition>
            </VariableType>
        <!-- Here is where the Variable says its component 3 -->
        <RegNum>
            <nOrgNum>3</nOrgNum>
        </RegNum>
        </VariableID>
        <!-- Here we tag what phase its in -->
        <VarPhaseID>
            <eVarPhase>Liquid</eVarPhase>
        </VarPhaseID>
        <!-- Here we define what the phase was made out of -->
        <Solvent>
            <!-- This says it has some 2. This is never explicitly defined! Its only implicitly defined
                 since we constrain 4 and treat 3 as variable
            -->
            <RegNum>
                <nOrgNum>2</nOrgNum>
            </RegNum>
            <!-- This says it has some 3, but we also set above that this is the variable we are controlling -->
            <RegNum>
                <nOrgNum>3</nOrgNum>
            </RegNum>
            <!-- This says it has some 4, but we constrained this in another block -->
            <RegNum>
                <nOrgNum>4</nOrgNum>
            </RegNum>
        </Solvent>
    </Variable>
    <NumValues>
        <!-- Variable 1 (Mole fraction of Component 3) set to 0-->
        <VariableValue>
            <nVarNumber>1</nVarNumber>
            <nVarValue>0</nVarValue>
            <nVarDigits>0</nVarDigits>
        </VariableValue>
        <!-- Measured Mole fraction of Component 1 in response to constraints and variables x=0.3491 -->
        <PropertyValue>
            <nPropNumber>1</nPropNumber>
            <nPropValue>.03491</nPropValue>
            <nPropDigits>4</nPropDigits>
        <PropUncertainty>
            <nUncertAssessNum>1</nUncertAssessNum>
            <nStdUncertValue>.0031419</nStdUncertValue>
        </PropUncertainty>
        </PropertyValue>
        <!-- Mole fraction of component 2 in mixture is 1-{sum of Constraints}-{sum of Variables}
             = 1-0.2-0 = 0.8
             Remember, Component 1 was NOT defined as a substance in the original Liquid, but its measurement 
             changes that composition and we leave it up as an exercise to the reader to figure out the final 
             mole fractions. Easy enough here (1-0.3491)*(pre-measurement mole fractions), but still something 
             I had to infer after reading the description 
        -->
    </NumValues>
</PureOrMixtureData>

bmanubay · 2018-01-31T19:55:40Z

Oooookay, that is much clearer now, and, yes, is a bit frustratingly complicated for something as simple as composition data :/

davidlmobley · 2018-03-24T04:09:17Z

My advice on this, @Lnaden , is that we get just the very basic framework working for this for one or two specific use cases (density, dielectric constant) in the most obvious/easy to handle composition cases, then you hand it off to someone else who will:

extend to handle other properties
work with NIST to figure out how to extract this information more reliably

As I understand it, NIST has ways of automatically handling these, building error models, curating data, etc., so they must be able to extract things reliably and usefully from these files. We just need to get them to teach us how to handle the full diversity...

Lnaden · 2018-10-11T16:31:19Z

Note to self: dig up the concrete example from what I used above to get in touch with NIST to try and fix.

Basic support for local, doi and url loading added. Imports all compounds and generate a SMIRKS pattern where possible. Generates a ThermodynamicState (T + p) for each property found. Maps ThermoML property names and phases onto standardised enum values. Only loads properties with uncertainties (std. or coverage factor). Started preparing ThermoMLProperty to inherit from MeasuredProperty.

Swapped enums to be IntFlags ready for bitwise filtering / queries Extracted all unit detection to single global method for now. VariableDefinitions and Properties now store which unit they should use Components now automatically calculate an iupac name Solvent entries parsed ready for later use Added basic support for multi-component systems - only mol% for now. ThermoML now properly and fully inherits MeasureablePhysicalProperty For now removed the MeasurementMethod design - not sure how this fits in

…d/openforcefield into property-estimator

Added basic filtering of data sets (property and phase). Parsing phases from strings refactored to a global method. Phase now detected in both PureOrMixtureData entry or Property entry IUPAC name generation removed for now as slow / not reliable. Added support for dielectric constants. Better logging of unsupported constraints / variables. Only liquid phase properties are supported for now. Swapped out deepcopy of cPickle due to heavy performance issues. ThermoMLDataSet now supports creation from doi / url / path lists Cleaned up the testing file. Removed redundant / unused 'first pass' methods.

Update property-estimate to be level with master

packmol now properly sets the box vectors of the output topology Added a very rough pass at the property-calculator framework.

Moved definition of property calculation templates to XML. Protocols now track thermodynamic state and substance. Added rough implementation of extracting properties from calcs. Fixed incorrect CalculationFidelity implementation. BuildSmirnoffTopology now calculates charges.

Added rudimentary printing of calculation results. Cleaned up the calculation protocols. Added better control over simulation threading. Removed (hopefully now) deprecated code. Substance tag now properly includes composition.

Major docstrings

ocmadin · 2018-11-27T19:58:21Z

So it seems like the design of the PropertyCalculationRunner object is a 'cascading' approach, where properties are calculated using the cheapest method as possible and then resorting to more expensive options if necessary. This seems like an appropriate way to do things for property estimation in general, but I want to make sure that this will be compatible with our approach in multi-fidelity sampling.

For the multi-fidelity sampling process, we'll want to be able to have as much control as possible over which level of fidelity we are using (i.e. only MBAR or only surrogate model, etc.), so it would be helpful to have objects for each individual level of fidelity, or options to choose only a single method. We might also want to be able to modify the runner so that it will fit our algorithmic strategy.

Although there's a lot of details left to be filled in, making this runner modular and customizable is a priority for us. I'll let @mrshirts chime in if he has any other thoughts.

SimonBoothroyd · 2018-11-27T20:14:52Z

Thanks for comments @ocmadin! It shouldn't be too difficult to add that in - I think the CalculationFidelity Enum could be swapped to an IntFlag, and then change the run command to take one as input so that

calculator.run(CalculationFidelity.SurrogateModel)

would only do surrogate modeling,

calculator.run(CalculationFidelity.SurrogateModel | CalculationFidelity.Reweighting)

would only allow both surrogate modeling and reweighting etc. A couple of if statements in the run method would then handle which fidelities are allowed to run.

davidlmobley · 2018-11-28T18:25:38Z

Also I want to note, @ocmadin , that most people running this are going to want to (and SHOULD) just ask for a calculation of the property they want; your group is going to care deeply about how that property is calculated and is going to have the job of designing things under the hood so that they get back a reliable, accurate, robust estimate of that property. But most users are not going to know how to select which method to use to calculate the property and should not be doing so. In fact eventually this will be something which gets done automatically without someONE even doing it (e.g. we'll come up with a new force field and just calculate a lot of properties to benchmark it, or we'll make a proposed parameter change and want to assess how it impacts things).

In other words your multi-fidelity sampling should be something which is done under the hood; users should not have to know about it/be aware of it.

mrshirts · 2018-11-28T19:47:02Z

that most people running this are going to want to (and SHOULD) just ask for a calculation of the property they wan

Hmm. I'm not sure if this is the best way to think about it. I think it's going to take an immense amount of work to get to a process that gives an accurate estimate. In the meantime, which could easily be 3-5 years, the answer will depend on how it's calculated. In the meantime, if we want to make something that is failsafe, the answer should be "perform a simulation".

(e.g. we'll come up with a new force field and just calculate a lot of properties to benchmark it,

I think it depends on what "come up with a new force field" means. If we know exactly what the parameters are, then we're probably better off doing a bunch of simulations.

or we'll make a proposed parameter change and want to assess how it impacts things).

I think the multifidelity simulation will be very good at telling us what the impact will be with high accuracy, but potentially not good enough to be gold standard results.

mrshirts · 2018-11-28T19:51:34Z

One other thing to consider is what sort of data will need to be stored for the property estimator in order to have provenance. In all of the multifidelity situations we are dealing with, our estimate will be calculated, ultimately, with a set of simulations "stitched" together. So the estimate will be a function of those simulations. The property estimator should be able to keep track of which simulations were used to construct it's given estimate. Given it's a lot of data, probably some sort of pointers to the data will make the most sense. Same goes with the reweighting (though that could be redone on the fly at medium expense). And then the Gaussian process / polynomial chaos / neural net surrogate model will need to be stored as well.

The model itself will depend to some extent on the experimental data as well, since we are not planning on generating property(parameter) for ALL parameters, but only for the parameters with higher posterior based on the initial set of experimental data.

davidlmobley · 2018-11-28T20:16:09Z

Hmm. I'm not sure if this is the best way to think about it. I think it's going to take an immense amount of work to get to a process that gives an accurate estimate. In the meantime, which could easily be 3-5 years, the answer will depend on how it's calculated. In the meantime, if we want to make something that is failsafe, the answer should be "perform a simulation".

Right, so that's exactly what I'm saying: Until you guys have this science worked out you will want this basically to just perform a simulation. When you have it worked out it would do something more complex. Or if you have it worked out for subsets of cases (e.g. the code decides "these parameters are very similar to those parameters I ran another simulation with and I can accurately reweight in this case because I was asked for a density and I know this amount of parameter perturbation is safe for density calculations...") then it can do that.

the answer will depend on how it's calculated

So we absolutely have to avoid that. This needs to just encapsulate the known, validated way of accurately computing the specified property. Unless we have a better/faster way of doing it that will give us the same answer then we should be using the gold standard approach.

think the multifidelity simulation will be very good at telling us what the impact will be with high accuracy, but potentially not good enough to be gold standard results.

Right. But the idea is that it should be able to help you assess whether a new simulation is needed, right?

The other issues you raise, about what ultimately needs to be stored for provenance, are likely very important and as we start to get the science worked out for multifidelity simulations then we're obviously going to have to think through how to do that.

Tidy up .gitignore Temporary fix for missing amber files when running pytest

Implemented an implementation of pymbars timeseries methods for vector quantities Added methods to extract uncorrelated time series from a calculation Added base methods needed to extract / store trajectories from calculations. General code cleanup / bug fixes.

Implemented a simple abstraction layer for storing calculation data. Added a storage implementation which stores files as pickle objects on the local disk. Fixed a bug where Quantity * scalar returns a scalar in the simulation layer.

…rect ensemble.

Changed the components namespace to workflows. Added mechanism for storing simulation data. Fixed a bug in the statistical ineffiency calculation when data is one dimensional.

…ecorator.

Refactored protocol decorators so that they can now take arguments. Automated the protocol merge behaviour. Begun removing legacy protocol input / parameter code. Added checks to ensure type consistency between protocol I/O

…sion.

…polymorphic data.

Fixed issue where deserialized PolymorphicDataType were returned as the base type, not PolymorphicDataType Fixed issue where serialized protocol schemas lost information about their types.

…rge behaviour

…morphicDataType class. Finished a first pass implementation of the ConditionalGroup multiple condition rewrite.

Added simple sanity checks to ensure correct merging behaviour.

…t the same time. Fixed a bug where merged protocols original uuids were overwritten Fixed a bug where calculation ProtocolPaths weren't being correctly updated when merging took place. Fixed a type checking issue where the system saw int and int64 as different.

Available simulation resources (e.g. number of threads / gpus) now passed to calculation tasks. Fixed tornado client not fully disconnecting when querying server / submitting jobs.

Added better provenance information to the returned calculation results.

…penMM Density now calculated directly from the statistics and not the trajectory as a test case.

jchodera · 2019-01-30T22:21:05Z

The-PropertyEstimator-format.md

+```
+or get the number of components in a mixture:
+```python
+ncomponents = mixture.ncomponents


In other parts of the code, I think we've been naming properties like this n_components.

For example, in the topology branch, check out the attribute names:
https://open-forcefield-toolkit.readthedocs.io/en/topology/api/generated/openforcefield.topology.Topology.html#openforcefield.topology.Topology

jchodera · 2019-01-30T22:24:28Z

The-PropertyEstimator-format.md

+```python
+# Define mixture
+mixture = Mixture()
+mixture.addComponent('water', mole_fraction=0.2)


We've been trying to stick to PEP8 here, so we should change CapWords methods to snake_case.

jchodera · 2019-01-30T22:30:21Z

examples/property_estimator/single_machine.py

+    property_estimator = client.PropertyEstimator()
+    property_estimator.submit_computations(data_set, force_field)
+
+    property_server.run_until_complete()


I'd suggest we refine the API names for submit_computations() and run_until_complete() a bit.
I think PropertyEstimator.estimate(dataset, forcefield) would be much clearer to designate a blocking call, with .estimate_nonblocking() or .estimate_async() a non-blocking call.

andrrizzi · 2019-02-15T09:56:53Z

@SimonBoothroyd should we close this PR?

SimonBoothroyd · 2019-02-15T14:43:44Z

I think so - everything from here should be in the new repo now!

Lnaden added 3 commits January 22, 2018 15:52

Ported the property estimator spec to the repo

c4d138e

Added back in some rigor

c029034

SimonBoothroyd and others added 10 commits November 12, 2018 17:17

Merge branch 'property-estimator' of https://github.com/openforcefiel…

c2c10e7

…d/openforcefield into property-estimator

Merge remote-tracking branch 'origin/master' into property-estimator

069fc6c

Update property-estimate to be level with master

CONNECT statements are now added to the output of packmol

f3ea46f

packmol now properly sets the box vectors of the output topology Added a very rough pass at the property-calculator framework.

Refactored code to remove circular dependencies.

d26a3bc

Added rudimentary printing of calculation results. Cleaned up the calculation protocols. Added better control over simulation threading. Removed (hopefully now) deprecated code. Substance tag now properly includes composition.

Minor code reformatting + cleaning

c005d1a

Major docstrings

SimonBoothroyd closed this Nov 27, 2018

SimonBoothroyd reopened this Nov 27, 2018

Added basic uncertainty calculation to the extract dielectric protocol.

5c9024f

SimonBoothroyd added 2 commits November 30, 2018 11:42

Complete overhaul of the protocol input/output pipes.

c454ebe

Merged topology docs into property-estimator

bb5a4bd

SimonBoothroyd added 22 commits January 18, 2019 11:47

Changed deprecated pydantic get_validators

7bb0881

Tidy up .gitignore Temporary fix for missing amber files when running pytest

Fixed incorrect server method in class examples.

4a90100

Retrigger RTD build.

314778e

Started adding docs for properties and data sets.

3e2affc

Implemented a simple abstraction layer for storing calculation data. Added a storage implementation which stores files as pickle objects on the local disk. Fixed a bug where Quantity * scalar returns a scalar in the simulation layer.

Fixed issue where simulation Ensemble type was lost, leading to incor…

5ed11cd

…rect ensemble.

Removed old schema files.

a014675

Changed the components namespace to workflows. Added mechanism for storing simulation data. Fixed a bug in the statistical ineffiency calculation when data is one dimensional.

Started to develop the property calculation workflow documentation.

ca9fed7

Fixed incorrect docs reference.

1370698

Fixed broken docs path.

50ef690

Fixed RTD error related to an unexpected use of the ProtocolArgumentD…

a6ad8c3

…ecorator.

Moved PropertyEstimatorException to a more sutiable location.

24ec44d

Refactored protocol decorators so that they can now take arguments. Automated the protocol merge behaviour. Begun removing legacy protocol input / parameter code. Added checks to ensure type consistency between protocol I/O

Added protocol schema validation checks to the client, pre job submis…

c6f9245

…sion.

Added a helper class for use when storing collections of potentially …

ffe204d

…polymorphic data.

Polymorphic serialization now supports nested types.

2661fb1

Fixed issue where deserialized PolymorphicDataType were returned as the base type, not PolymorphicDataType Fixed issue where serialized protocol schemas lost information about their types.

Begun refactoring protocol group condition interface ready for new me…

2ce1b94

…rge behaviour

Added simple list and unit.Quantity serialization support to the Poly…

c276b3e

…morphicDataType class. Finished a first pass implementation of the ConditionalGroup multiple condition rewrite.

Finished implementing a robust merge behaviour for protocols and groups.

6d4de7f

Added simple sanity checks to ensure correct merging behaviour.

Fixed simulation box vectors not being set in openmm simulations.

38b1fc7

Available simulation resources (e.g. number of threads / gpus) now passed to calculation tasks. Fixed tornado client not fully disconnecting when querying server / submitting jobs.

Refactored the result object returned by the property estimator.

1141010

Added better provenance information to the returned calculation results.

Added the ability to take averages from a statistics file output by O…

fe78b7c

…penMM Density now calculated directly from the statistics and not the trajectory as a test case.

jchodera reviewed Jan 30, 2019

View reviewed changes

Made a start on implementing enthalpy of mixing as a estimable property.

8bb42d5

andrrizzi closed this Feb 15, 2019

SimonBoothroyd deleted the property-estimator branch June 14, 2019 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking first go at the property estimator layout #99

Taking first go at the property estimator layout #99

Lnaden commented Jan 31, 2018

davidlmobley commented Jan 31, 2018

bmanubay commented Jan 31, 2018 •

edited

Loading

Lnaden commented Jan 31, 2018

bmanubay commented Jan 31, 2018

davidlmobley commented Mar 24, 2018

Lnaden commented Oct 11, 2018

ocmadin commented Nov 27, 2018

SimonBoothroyd commented Nov 27, 2018

davidlmobley commented Nov 28, 2018

mrshirts commented Nov 28, 2018

mrshirts commented Nov 28, 2018

davidlmobley commented Nov 28, 2018

jchodera Jan 30, 2019

jchodera Jan 30, 2019

jchodera Jan 30, 2019

jchodera Jan 30, 2019

andrrizzi commented Feb 15, 2019

SimonBoothroyd commented Feb 15, 2019

Taking first go at the property estimator layout #99

Taking first go at the property estimator layout #99

Conversation

Lnaden commented Jan 31, 2018

Major wall to overcome

Known current problems

davidlmobley commented Jan 31, 2018

bmanubay commented Jan 31, 2018 • edited Loading

Lnaden commented Jan 31, 2018

bmanubay commented Jan 31, 2018

davidlmobley commented Mar 24, 2018

Lnaden commented Oct 11, 2018

ocmadin commented Nov 27, 2018

SimonBoothroyd commented Nov 27, 2018

davidlmobley commented Nov 28, 2018

mrshirts commented Nov 28, 2018

mrshirts commented Nov 28, 2018

davidlmobley commented Nov 28, 2018

jchodera Jan 30, 2019

Choose a reason for hiding this comment

jchodera Jan 30, 2019

Choose a reason for hiding this comment

jchodera Jan 30, 2019

Choose a reason for hiding this comment

jchodera Jan 30, 2019

Choose a reason for hiding this comment

andrrizzi commented Feb 15, 2019

SimonBoothroyd commented Feb 15, 2019

bmanubay commented Jan 31, 2018 •

edited

Loading