-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output document for large MD output files #515
Comments
Copying across comments from @naik-aakash in #703.
@naik-aakash later highlighted that only storing a subset of the ionic steps is possible. @JaGeo suggested setting a sensible default. |
Thinking about it further: |
Just an update on this, but @gpetretto added an option to store the trajectory as ionic steps or a Trajectory object. If a trajectory object is used, this will automatically get put in the datastore. See: materialsproject/emmet#886 One option would be to automatically convert from ionic steps to trajectory object depending on the size of the ionic_steps field? |
I think that one of the main issues, even with the solution that I have implemented, is that it requires to read the full output and create a trajectory object in memory. With ML force fields (be it with VASP ML or ASE+ML force fields) the simulation time can be extremely long and the number of final configuration huge. If storing a whole file as an output is an option, I also believe that it would be better to have a true file store for this purpose. The current file-like stores present in maggma (e.g. |
vaspout.h5 and vasprun.xml might be comparable in size for long MD runs, but I'm not familiar enough with vaspout.h5, and it's only supported in VASP 6.x when an optional compiler flag is specified Still, we probably should add vaspout.h5 support to pymatgen, since there's none at present For forcefield MD, one only needs the structures (+ velocities + temperatures) to establish the trajectory. That's unfortunately the majority of the data stored in a trajectory but removes the need for storing energies, forces, and stresses |
Just to clarify, while I think that it would be beneficial for many reasons, I understand that switching to the parsing of the hdf5 file for all the outputs would require a considerable amount of work and it is not what I had in mind. |
Definitely agree with @gpetretto. The pymatgen Trajectory object and its serialized version are (more or less) fine for relatively small structures and MD (i.e. what you do with plain ab initio MD) but becomes unusable when going to larger MD. As @gpetretto mentioned, any time you will want to perform a long MD (with e.g. vasp on-the-fly ML feature or with ML or classical force-field), parsing and storing as a Trajectory will be a pain if at all possible. In such cases a direct file-based approach is inevitable, and most likely it will initially be code dependent. In the long-term it might be possible to have a common format (and some job performing the conversion for codes which do not support it) but I would not count on that from the beginning. Anyway the possibility to store plain files would also be very useful for other types of files (e.g. in abinit, there are binary files that we'd rather store as plain files instead of data/json objects). |
This is currently our solution in A related question. When running a multi-step classical MD workflow without high-throughput software the directory structure would ideally look something like this:
Here, each stage has it's own trajectory and metadata, making it easy to manually introspect. Generating this sort of file structure would be very convenient for running a few simulations, especially for folks new to atomate2 that don't want to use automated builders. I'm curious if other multi-step MD workflows support this or if it would be desirable? |
I would like to start a discussion about how to handle the outputs in cases where the generated output files have very large size. To be more explicit, I am mainly referring to long MD simulations with VASP, either generated by pure DFT or by ML-FF generated with active learning.
In these cases the parsing of the outputs usually performed in the
TaskDoc.from_directory()
could be rather expensive in terms of time and allocated memory. Here a few points that may be worth considering:Trajectory
usingfrom_structures
. Would it be worth finding a more optimized way to extract the Trajectory alone, without going through the allocation of all the Structure objects?IonicStep
is generated, and would likely not make much sense to have it in the final document. How to handle the output data in that case?Given that the current parsing suffers from the above limitations, what could be an acceptable improvement? Define an ad hoc output document for MD in general? An output document specific for ML-FF calculations with vasp? Or just leave things as they are for standard MD and try improving the current
TaskDoc
to properly handle ML-FF simulations, adding some fields and leave other empty?The text was updated successfully, but these errors were encountered: