Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking first go at the property estimator layout #99

Closed
wants to merge 90 commits into from

Conversation

Lnaden
Copy link
Collaborator

@Lnaden Lnaden commented Jan 31, 2018

Many functions are still in abstract form, so only API purposes.
thermoml_parser picked up and adapted from thermopyl

Major wall to overcome

Determine how to extract Compound/Mixture info from ThermoML

This is actually a really difficult task since the composition can be defined in several locations:

  • As a Variable (controlled quantity that is changing)
  • As a Constraint (constant)
  • As a Property (measured quantity)

Within any PureOrMixtureProperty which is the parent node of all three of the previous ones, the compounds which participate in the measurement are listed at Compound. However, the following can all be mutually true across all three sub-nodes:

  • No quantities need be defined, in which case it is of pure fluid (easy)
  • Quantities come in all types (volume, mass, mole)
  • The quantities of all compounds after considerations in Variable, Constraint, Property need not sum to 1
  • Some components can be dependent of each other (e.g. B = 1-A)
  • Subsets of components can be added together to make 1 (e.g. A+B=1 because this is the solvent that C is dissolved into)

Because these rules all seem to be contextually dependent, it becomes much harder to figure out what the correct composition of the measurement should be.

Known current problems

I have only tested over a small set, but it looks like right now it can only get out the Mole Fraction observable from Gas Chromatography results (tested by going through about half of the JCED set).
It does not appear to be extracting anything else right now.

The good news is I figured out how to get around the PyXB pinning version problem!

cc @jchodera @davidlmobley @mrshirts @bmanubay

Many functions are still in abstract form, so only API purposes.
`thermoml_parser` picked up and adapted from `thermopyl`

### Major wall to overcome
**Determine how to extract Compound/Mixture info from ThermoML**

This is actually a really difficult task since the composition can be defined in several locations:
* As a `Variable` (controlled quantity that is changing)
* As a `Constraint` (constant)
* As a `Property` (measured quantity)

Within any `PureOrMixtureProperty` which is the parent node of all three of the previous ones, the compounds which participate in the measurement are listed at `Compound`. However, the following can all be mutually true across all three sub-nodes:
* No quantities need be defined, in which case it is of pure fluid (easy)
* Quantities come in all types (volume, mass, mole)
* The quantities of all compounds after considerations in `Variable`, `Constraint`, `Property` need not sum to 1
* Some components can be dependent of each other (e.g. B = 1-A)
* Subsets of components can be added together to make 1 (e.g. A+B=1 because this is the solvent that C is dissolved into)

Because these rules all seem to be contextually dependent, it becomes much harder to figure out what the correct composition of the measurement should be.

Right now I have the some of the logic in place to handle this, and it should work for pure fluids, but I have not tested this.
@davidlmobley
Copy link
Contributor

Thanks for getting this up, Levi. It'll take me a while to get to try it (I'm offline tomorrow) so this is just to let you know it's on my list.Hopefully Michael and Bryce can look sooner.

@bmanubay
Copy link

bmanubay commented Jan 31, 2018

This is actually a really difficult task since the composition can be defined in several locations:

As a Variable (controlled quantity that is changing)
As a Constraint (constant)
As a Property (measured quantity)

Hey @Lnaden , do you think you could give a quick example of these to clarify? I'm not sure that I fully understand what the issue is, sorry :(

@Lnaden
Copy link
Collaborator Author

Lnaden commented Jan 31, 2018

do you think you could give a quick example of these to clarify

Sure, I'll modify an example to have all three, I added all the comments <!-- --> so you can follow the logic

<PureOrMixtureData>
    <!-- Define 4 components -->
    <Component>
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>2</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>3</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <Component>
        <RegNum>
            <nOrgNum>4</nOrgNum>
        </RegNum>
        <nSampleNm>1</nSampleNm>
    </Component>
    <!-- Define 1 property we are measuring, the mole fraction of component 1-->
    <Property>
        <nPropNumber>1</nPropNumber>
        <Property-MethodID>
            <PropertyGroup>
                <CompositionAtPhaseEquilibrium>
                    <ePropName>Mole fraction</ePropName>
                    <sMethodName>gravimetric</sMethodName>
                </CompositionAtPhaseEquilibrium>
            </PropertyGroup>
        <!-- Here is where the Property says its component 1 -->
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
        </Property-MethodID>
        <PropPhaseID>
            <ePropPhase>Liquid</ePropPhase>
        </PropPhaseID>
    </Property>
    <!-- Define some phase made of of arbitrary components -->
    <PhaseID>
            <ePhase>Liquid</ePhase>
    </PhaseID>
    <!-- Define another phase made up of component 1 -->
    <PhaseID>
            <ePhase>Crystal</ePhase>
        <RegNum>
            <nOrgNum>1</nOrgNum>
        </RegNum>
    </PhaseID>
    <!-- Constrain mole fraction of 4 in the Liquid phase to x=0.2, it will ALWAYS be this now -->
    <Constraint>
        <ConstraintID>
            <ConstraintType>
                <eSolventComposition>Solvent: Mole fraction</eSolventComposition>
            </ConstraintType>
        </ConstraintID>
        <ConstraintPhaseID>
            <eConstraintPhase>Liquid</eConstraintPhase>
            <!-- Here is where the Constraint says its component 4 -->
            <RegNum>
                <nOrgNum>4</nOrgNum>
            </RegNum>
        </ConstraintPhaseID>
        <nConstraintValue>0.2</nConstraintValue>
        <nConstrDigits>1</nConstrDigits>
    </Constraint>
    <!-- Create a variable we control, in this case, component 3 in a mixture of 2+3+4 -->
    <Variable>
        <nVarNumber>1</nVarNumber> <!-- Index of variable -->
        <VariableID>
            <VariableType>
                <!-- Type of variable -->
                <eSolventComposition>Solvent: Mole fraction</eSolventComposition>
            </VariableType>
        <!-- Here is where the Variable says its component 3 -->
        <RegNum>
            <nOrgNum>3</nOrgNum>
        </RegNum>
        </VariableID>
        <!-- Here we tag what phase its in -->
        <VarPhaseID>
            <eVarPhase>Liquid</eVarPhase>
        </VarPhaseID>
        <!-- Here we define what the phase was made out of -->
        <Solvent>
            <!-- This says it has some 2. This is never explicitly defined! Its only implicitly defined
                 since we constrain 4 and treat 3 as variable
            -->
            <RegNum>
                <nOrgNum>2</nOrgNum>
            </RegNum>
            <!-- This says it has some 3, but we also set above that this is the variable we are controlling -->
            <RegNum>
                <nOrgNum>3</nOrgNum>
            </RegNum>
            <!-- This says it has some 4, but we constrained this in another block -->
            <RegNum>
                <nOrgNum>4</nOrgNum>
            </RegNum>
        </Solvent>
    </Variable>
    <NumValues>
        <!-- Variable 1 (Mole fraction of Component 3) set to 0-->
        <VariableValue>
            <nVarNumber>1</nVarNumber>
            <nVarValue>0</nVarValue>
            <nVarDigits>0</nVarDigits>
        </VariableValue>
        <!-- Measured Mole fraction of Component 1 in response to constraints and variables x=0.3491 -->
        <PropertyValue>
            <nPropNumber>1</nPropNumber>
            <nPropValue>.03491</nPropValue>
            <nPropDigits>4</nPropDigits>
        <PropUncertainty>
            <nUncertAssessNum>1</nUncertAssessNum>
            <nStdUncertValue>.0031419</nStdUncertValue>
        </PropUncertainty>
        </PropertyValue>
        <!-- Mole fraction of component 2 in mixture is 1-{sum of Constraints}-{sum of Variables}
             = 1-0.2-0 = 0.8
             Remember, Component 1 was NOT defined as a substance in the original Liquid, but its measurement 
             changes that composition and we leave it up as an exercise to the reader to figure out the final 
             mole fractions. Easy enough here (1-0.3491)*(pre-measurement mole fractions), but still something 
             I had to infer after reading the description 
        -->
    </NumValues>
</PureOrMixtureData>

@bmanubay
Copy link

Oooookay, that is much clearer now, and, yes, is a bit frustratingly complicated for something as simple as composition data :/

@davidlmobley
Copy link
Contributor

My advice on this, @Lnaden , is that we get just the very basic framework working for this for one or two specific use cases (density, dielectric constant) in the most obvious/easy to handle composition cases, then you hand it off to someone else who will:

  • extend to handle other properties
  • work with NIST to figure out how to extract this information more reliably

As I understand it, NIST has ways of automatically handling these, building error models, curating data, etc., so they must be able to extract things reliably and usefully from these files. We just need to get them to teach us how to handle the full diversity...

@Lnaden
Copy link
Collaborator Author

Lnaden commented Oct 11, 2018

Note to self: dig up the concrete example from what I used above to get in touch with NIST to try and fix.

SimonBoothroyd and others added 10 commits November 12, 2018 17:17
Basic support for local, doi and url loading added.
Imports all compounds and generate a SMIRKS pattern where possible.
Generates a ThermodynamicState (T + p) for each property found.
Maps ThermoML property names and phases onto standardised enum values.
Only loads properties with uncertainties (std. or coverage factor).
Started preparing ThermoMLProperty to inherit from MeasuredProperty.
Swapped enums to be IntFlags ready for bitwise filtering / queries
Extracted all unit detection to single global method for now.
VariableDefinitions and Properties now store which unit they should use
Components now automatically calculate an iupac name
Solvent entries parsed ready for later use
Added basic support for multi-component systems - only mol% for now.
ThermoML now properly and fully inherits MeasureablePhysicalProperty
For now removed the MeasurementMethod design - not sure how this fits in
Swapped enums to be IntFlags ready for bitwise filtering / queries
Extracted all unit detection to single global method for now.
VariableDefinitions and Properties now store which unit they should use
Components now automatically calculate an iupac name
Solvent entries parsed ready for later use
Added basic support for multi-component systems - only mol% for now.
ThermoML now properly and fully inherits MeasureablePhysicalProperty
For now removed the MeasurementMethod design - not sure how this fits in
Added basic filtering of data sets (property and phase).
Parsing phases from strings refactored to a global method.
Phase now detected in both PureOrMixtureData entry or Property entry
IUPAC name generation removed for now as slow / not reliable.
Added support for dielectric constants.
Better logging of unsupported constraints / variables.
Only liquid phase properties are supported for now.
Swapped out deepcopy of cPickle due to heavy performance issues.
ThermoMLDataSet now supports creation from doi / url / path lists
Cleaned up the testing file.
Removed redundant / unused 'first pass' methods.
Update property-estimate to be level with master
packmol now properly sets the box vectors of the output topology
Added a very rough pass at the property-calculator framework.
Moved definition of property calculation templates to XML.
Protocols now track thermodynamic state and substance.
Added rough implementation of extracting properties from calcs.
Fixed incorrect CalculationFidelity implementation.
BuildSmirnoffTopology now calculates charges.
Added rudimentary printing of calculation results.
Cleaned up the calculation protocols.
Added better control over simulation threading.
Removed (hopefully now) deprecated code.
Substance tag now properly includes composition.
@ocmadin
Copy link

ocmadin commented Nov 27, 2018

So it seems like the design of the PropertyCalculationRunner object is a 'cascading' approach, where properties are calculated using the cheapest method as possible and then resorting to more expensive options if necessary. This seems like an appropriate way to do things for property estimation in general, but I want to make sure that this will be compatible with our approach in multi-fidelity sampling.

For the multi-fidelity sampling process, we'll want to be able to have as much control as possible over which level of fidelity we are using (i.e. only MBAR or only surrogate model, etc.), so it would be helpful to have objects for each individual level of fidelity, or options to choose only a single method. We might also want to be able to modify the runner so that it will fit our algorithmic strategy.

Although there's a lot of details left to be filled in, making this runner modular and customizable is a priority for us. I'll let @mrshirts chime in if he has any other thoughts.

@SimonBoothroyd
Copy link
Contributor

Thanks for comments @ocmadin! It shouldn't be too difficult to add that in - I think the CalculationFidelity Enum could be swapped to an IntFlag, and then change the run command to take one as input so that

calculator.run(CalculationFidelity.SurrogateModel)

would only do surrogate modeling,

calculator.run(CalculationFidelity.SurrogateModel | CalculationFidelity.Reweighting)

would only allow both surrogate modeling and reweighting etc. A couple of if statements in the run method would then handle which fidelities are allowed to run.

@davidlmobley
Copy link
Contributor

Also I want to note, @ocmadin , that most people running this are going to want to (and SHOULD) just ask for a calculation of the property they want; your group is going to care deeply about how that property is calculated and is going to have the job of designing things under the hood so that they get back a reliable, accurate, robust estimate of that property. But most users are not going to know how to select which method to use to calculate the property and should not be doing so. In fact eventually this will be something which gets done automatically without someONE even doing it (e.g. we'll come up with a new force field and just calculate a lot of properties to benchmark it, or we'll make a proposed parameter change and want to assess how it impacts things).

In other words your multi-fidelity sampling should be something which is done under the hood; users should not have to know about it/be aware of it.

@mrshirts
Copy link

that most people running this are going to want to (and SHOULD) just ask for a calculation of the property they wan

Hmm. I'm not sure if this is the best way to think about it. I think it's going to take an immense amount of work to get to a process that gives an accurate estimate. In the meantime, which could easily be 3-5 years, the answer will depend on how it's calculated. In the meantime, if we want to make something that is failsafe, the answer should be "perform a simulation".

(e.g. we'll come up with a new force field and just calculate a lot of properties to benchmark it,

I think it depends on what "come up with a new force field" means. If we know exactly what the parameters are, then we're probably better off doing a bunch of simulations.

or we'll make a proposed parameter change and want to assess how it impacts things).

I think the multifidelity simulation will be very good at telling us what the impact will be with high accuracy, but potentially not good enough to be gold standard results.

@mrshirts
Copy link

One other thing to consider is what sort of data will need to be stored for the property estimator in order to have provenance. In all of the multifidelity situations we are dealing with, our estimate will be calculated, ultimately, with a set of simulations "stitched" together. So the estimate will be a function of those simulations. The property estimator should be able to keep track of which simulations were used to construct it's given estimate. Given it's a lot of data, probably some sort of pointers to the data will make the most sense. Same goes with the reweighting (though that could be redone on the fly at medium expense). And then the Gaussian process / polynomial chaos / neural net surrogate model will need to be stored as well.

The model itself will depend to some extent on the experimental data as well, since we are not planning on generating property(parameter) for ALL parameters, but only for the parameters with higher posterior based on the initial set of experimental data.

@davidlmobley
Copy link
Contributor

Hmm. I'm not sure if this is the best way to think about it. I think it's going to take an immense amount of work to get to a process that gives an accurate estimate. In the meantime, which could easily be 3-5 years, the answer will depend on how it's calculated. In the meantime, if we want to make something that is failsafe, the answer should be "perform a simulation".

Right, so that's exactly what I'm saying: Until you guys have this science worked out you will want this basically to just perform a simulation. When you have it worked out it would do something more complex. Or if you have it worked out for subsets of cases (e.g. the code decides "these parameters are very similar to those parameters I ran another simulation with and I can accurately reweight in this case because I was asked for a density and I know this amount of parameter perturbation is safe for density calculations...") then it can do that.

the answer will depend on how it's calculated

So we absolutely have to avoid that. This needs to just encapsulate the known, validated way of accurately computing the specified property. Unless we have a better/faster way of doing it that will give us the same answer then we should be using the gold standard approach.

think the multifidelity simulation will be very good at telling us what the impact will be with high accuracy, but potentially not good enough to be gold standard results.

Right. But the idea is that it should be able to help you assess whether a new simulation is needed, right?

The other issues you raise, about what ultimately needs to be stored for provenance, are likely very important and as we start to get the science worked out for multifidelity simulations then we're obviously going to have to think through how to do that.

SimonBoothroyd added 22 commits January 18, 2019 11:47
Tidy up .gitignore
Temporary fix for missing amber files when running pytest
Implemented an implementation of pymbars timeseries methods for vector quantities
Added methods to extract uncorrelated time series from a calculation
Added base methods needed to extract / store trajectories from calculations.
General code cleanup / bug fixes.
Implemented a simple abstraction layer for storing calculation data.
Added a storage implementation which stores files as pickle objects on the local disk.
Fixed a bug where Quantity * scalar returns a scalar in the simulation layer.
Changed the components namespace to workflows.
Added mechanism for storing simulation data.
Fixed a bug in the statistical ineffiency calculation when data is one dimensional.
Refactored protocol decorators so that they can now take arguments.
Automated the protocol merge behaviour.
Begun removing legacy protocol input / parameter code.
Added checks to ensure type consistency between protocol I/O
Fixed issue where deserialized PolymorphicDataType were returned as the base type, not PolymorphicDataType
Fixed issue where serialized protocol schemas lost information about their types.
…morphicDataType class.

Finished a first pass implementation of the ConditionalGroup multiple condition rewrite.
Added simple sanity checks to ensure correct merging behaviour.
…t the same time.

Fixed a bug where merged protocols original uuids were overwritten
Fixed a bug where calculation ProtocolPaths weren't being correctly updated when merging took place.
Fixed a type checking issue where the system saw int and int64 as different.
Available simulation resources (e.g. number of threads / gpus) now passed to calculation tasks.
Fixed tornado client not fully disconnecting when querying server / submitting jobs.
Added better provenance information to the returned calculation results.
…penMM

Density now calculated directly from the statistics and not the trajectory as a test case.
```
or get the number of components in a mixture:
```python
ncomponents = mixture.ncomponents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other parts of the code, I think we've been naming properties like this n_components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```python
# Define mixture
mixture = Mixture()
mixture.addComponent('water', mole_fraction=0.2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been trying to stick to PEP8 here, so we should change CapWords methods to snake_case.

property_estimator = client.PropertyEstimator()
property_estimator.submit_computations(data_set, force_field)

property_server.run_until_complete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we refine the API names for submit_computations() and run_until_complete() a bit.
I think PropertyEstimator.estimate(dataset, forcefield) would be much clearer to designate a blocking call, with .estimate_nonblocking() or .estimate_async() a non-blocking call.

@andrrizzi
Copy link
Collaborator

@SimonBoothroyd should we close this PR?

@SimonBoothroyd
Copy link
Contributor

I think so - everything from here should be in the new repo now!

@andrrizzi andrrizzi closed this Feb 15, 2019
@SimonBoothroyd SimonBoothroyd deleted the property-estimator branch June 14, 2019 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants