-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Reference by UUIDs #443
Conversation
First step towards an adaptive sampling scheme. |
How is this related to adaptive sampling? |
Because I might use it then later for adaptive sampling :) It has more todo with what I need to be able to do adative sampling like on a cluster. So it is a step to make OPS more useable for the MSM_TIS Adaptice sampling... |
Hmmm, this was more work than expected. I have now something working where you can select the way netcdfplus handles internal referenes. Either by UUID or by integer reference. Default is disabled. That means you can switch to multifile support if you wanted to. using UUID makes the objects unique spanning multiple files, but has some overhead, although the test run almost the same. The mstis_analyis uses 1.5 additional seconds for 500 mcsteps. I guess the benefit for large systems outweighs this, but for small and test systems the current implementation is much better. Single file and short references. Also an additional UUID per object is required. For a million objects this are (currently) 36 MiB more. I can get this down to 16 MiB when storing the uuid as bytes and not a string. The implementation can also be improved upon, but in general this works. |
Trying single trajectory files now. Seems to work very nicely and should make the distributed trajectory generation simpler. |
Closes #98 |
This implements the choice to use UUIDs to reference objects in the storage. Usually for pickling objects are referenced in a storage by using the type of the object and the index in the store.
Now the user has the choice to pick either the old (faster and simpler) way of referencing or switch to using UUIDs. Due to the significant overhead for searching uuids this does not make sense for ToyEngine on a local machine. If you run large systems then the overhead can be neglected.
The benefit is that you can load/save and analyze objects independent of the storage. I will add the possibility to have a split storage. This way you can also run jobs independently and later join the data to one big dataset while maintaining the connections between objects.
Missing
Currently stored and cached CVs cannot be stored in a file with no snapshots. The utility function in it will hence only work for systems where snapshots use kinetics and statics for storage. So the split in
two files works that one contains all trajectories, snapshots, statics and kinetics,
While the other contains all but statics and kinetics.
There is a utility function to split and a function to join Storages
Move to other PRs