Output document for large MD output files #515

gpetretto · 2023-09-09T09:16:10Z

I would like to start a discussion about how to handle the outputs in cases where the generated output files have very large size. To be more explicit, I am mainly referring to long MD simulations with VASP, either generated by pure DFT or by ML-FF generated with active learning.
In these cases the parsing of the outputs usually performed in the TaskDoc.from_directory() could be rather expensive in terms of time and allocated memory. Here a few points that may be worth considering:

Is it worth in the case of an MD parsing all the data that are parsed for a simple SCF or relaxation (e.g. all the content of the OUTCAR and vasprun.xml, including all ionic and electronic steps)? Both in terms of time required to parse and space occupied in the DB.
The main output of a the MD is likely the Trajectory, where only the coordinates are stored. Currently first all the structures are instantiated and then used to build the Trajectory using from_structures. Would it be worth finding a more optimized way to extract the Trajectory alone, without going through the allocation of all the Structure objects?
VASP optionally produces an hdf5 output file. Would it be better and possible to use that to parse the outputs, if present, instead of the other text files? It should be way faster and easier to deal with.
The data in vasprun.xml produced by ML-FF in VASP differ from the standard one. For example no equivalent of the IonicStep is generated, and would likely not make much sense to have it in the final document. How to handle the output data in that case?

Given that the current parsing suffers from the above limitations, what could be an acceptable improvement? Define an ad hoc output document for MD in general? An output document specific for ML-FF calculations with vasp? Or just leave things as they are for standard MD and try improving the current TaskDoc to properly handle ML-FF simulations, adding some fields and leave other empty?

The text was updated successfully, but these errors were encountered:

utf · 2024-02-15T10:49:59Z

Copying across comments from @naik-aakash in #703.

Relax jobs using forcefields fails as the forcefield taskdoc stores all the ionic steps leading to cross the size limit

Steps to reproduce

Run relax jobs using mace forcefield with Fire optimizer for a cell containing more than 1000 atoms with a force requirement of 0.0001

Possible solution

Modify the default interval arg of relax maker from 1 to 100 / larger number (this solution should potentially work as we tested using LBFGS optimizer of ase instead of default Fire optimzer, and same job then runs without any issue as LBFGS stores 100 steps at max)

arg to allow users to skip storing ionic step data in taskdoc

Any thoughts on this or other ideas to tackle this issue are welcome.

@naik-aakash later highlighted that only storing a subset of the ionic steps is possible.

@JaGeo suggested setting a sensible default.

JaGeo · 2024-02-15T11:00:15Z

Thinking about it further:
having an option to store all ionic steps - even for long optmizations - could also be interesting as potential training data for pre-training purposes of other machine learning models. Thus, I would now vote for an option where we can skip this or put these trajectories in a (data) store. In any case, we would need to solve this for any kind of MD application with forcefields via ase (currently completely missing as a job.)

utf · 2024-02-16T12:43:41Z

Just an update on this, but @gpetretto added an option to store the trajectory as ionic steps or a Trajectory object. If a trajectory object is used, this will automatically get put in the datastore. See: materialsproject/emmet#886

One option would be to automatically convert from ionic steps to trajectory object depending on the size of the ionic_steps field?

gpetretto · 2024-02-16T16:02:42Z

I think that one of the main issues, even with the solution that I have implemented, is that it requires to read the full output and create a trajectory object in memory. With ML force fields (be it with VASP ML or ASE+ML force fields) the simulation time can be extremely long and the number of final configuration huge.
I have seen that people can easily generate a varprun.xml of ~10GB which takes a very long time and a large amount of memory to be parsed. I suppose similar cases with very large output files may happen for the ASE trajectory generated with ML forcefields (#722) and with OpenMM flows (#717) as well.
Under such conditions wouldn't it be better to directly store the output in a raw format? Just as an example, for VASP it could be the HDF5 output that would also offer a much faster parsing, for ML forcefields it could be the ASE Trajectory object.
Maybe @esoteric-ephemera and @orionarcher have some comment on this, having worked on MD workflows?

If storing a whole file as an output is an option, I also believe that it would be better to have a true file store for this purpose. The current file-like stores present in maggma (e.g. S3Store, GridFSStore) require reading the file from disk, inserting in a dictionary that is then passed as input to the Store. This is necessary to comply with generic Store API. However in this way they do not take advantage of the option of streaming directly the file to the service, without reading it before. This option will be advantageous if if one needs to upload the content of large files. We discussed this with @davidwaroquiers and we believe it could be interesting to define a different kind of file store base object for these purposes, with a different API compared to the maggma Store. For example exposing methods like put and get.

esoteric-ephemera · 2024-02-16T17:56:00Z

vaspout.h5 and vasprun.xml might be comparable in size for long MD runs, but I'm not familiar enough with vaspout.h5, and it's only supported in VASP 6.x when an optional compiler flag is specified

Still, we probably should add vaspout.h5 support to pymatgen, since there's none at present

For forcefield MD, one only needs the structures (+ velocities + temperatures) to establish the trajectory. That's unfortunately the majority of the data stored in a trajectory but removes the need for storing energies, forces, and stresses

gpetretto · 2024-02-19T12:19:44Z

Just to clarify, while I think that it would be beneficial for many reasons, I understand that switching to the parsing of the hdf5 file for all the outputs would require a considerable amount of work and it is not what I had in mind.
At this point I am more focused on issues that may be related to MD simulations, and in particular ML-MD. In that case I would consider either storing directly the vaspout.h5 or at least using the HDF5 output file to extract the trajectory and possibly store it as a file, rather than as a serialized json. The serialized json would likely require to always load the whole trajectory in memory.

davidwaroquiers · 2024-02-20T10:00:29Z

Definitely agree with @gpetretto. The pymatgen Trajectory object and its serialized version are (more or less) fine for relatively small structures and MD (i.e. what you do with plain ab initio MD) but becomes unusable when going to larger MD. As @gpetretto mentioned, any time you will want to perform a long MD (with e.g. vasp on-the-fly ML feature or with ML or classical force-field), parsing and storing as a Trajectory will be a pain if at all possible. In such cases a direct file-based approach is inevitable, and most likely it will initially be code dependent. In the long-term it might be possible to have a common format (and some job performing the conversion for codes which do not support it) but I would not count on that from the beginning. Anyway the possibility to store plain files would also be very useful for other types of files (e.g. in abinit, there are binary files that we'd rather store as plain files instead of data/json objects).

orionarcher · 2024-02-20T18:35:43Z

The current file-like stores present in maggma (e.g. S3Store, GridFSStore) require reading the file from disk, inserting in a dictionary that is then passed as input to the Store.

This is currently our solution in atomate2.openmm. We use OpenMM to write the trajectory in binary .dcd format, serialize that trajectory to a Python bytes object, and then store that in an S3 bucket. Our trajectories are often tens of thousands of atoms over many hundreds of timesteps, making saving them in an .xml or pandas.Dataframe unfeasable. I would love to have a cleaner way to streaming files to S3 as @gpetretto suggests.

A related question. When running a multi-step classical MD workflow without high-throughput software the directory structure would ideally look something like this:

my_md_flow/
    0_energy_minimization/
        trajectory.dcd
        simulation_info.json
    1_pressure_equilibration/
        trajectory.dcd
        simulation_info.json
    2_annealing/
        trajectory.dcd
        simulation_info.json
    3_production/
        trajectory.dcd
        simulation_info.json

Here, each stage has it's own trajectory and metadata, making it easy to manually introspect. Generating this sort of file structure would be very convenient for running a few simulations, especially for folks new to atomate2 that don't want to use automated builders. I'm curious if other multi-step MD workflows support this or if it would be desirable?

Zhuoying added the discussion API design discussions label Oct 17, 2023

gpetretto mentioned this issue Oct 26, 2023

[Feature Request]: Better handling of parsed trajectory in VASP calculations materialsproject/emmet#872

Open

utf mentioned this issue Feb 15, 2024

BUG: Forcefield relaxmaker jobs fails due to BSONObj size exceeds 16MB for large structures #703

Closed

JaGeo mentioned this issue Feb 16, 2024

Forcefield molecular dynamics and forcefield refactor #722

Merged

This was referenced Feb 26, 2024

Overhead from filewriting Matgenix/jobflow-remote#79

Open

Dealing with large structures in the phonon workflow for forcefields #754

Open

gpetretto mentioned this issue Mar 15, 2024

Using files from other jobs Matgenix/jobflow-remote#94

Open

gpetretto mentioned this issue Apr 10, 2024

Store files as output materialsproject/jobflow#585

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output document for large MD output files #515

Output document for large MD output files #515

gpetretto commented Sep 9, 2023

utf commented Feb 15, 2024 •

edited

Loading

JaGeo commented Feb 15, 2024 •

edited

Loading

utf commented Feb 16, 2024 •

edited

Loading

gpetretto commented Feb 16, 2024

esoteric-ephemera commented Feb 16, 2024

gpetretto commented Feb 19, 2024

davidwaroquiers commented Feb 20, 2024

orionarcher commented Feb 20, 2024

Output document for large MD output files #515

Output document for large MD output files #515

Comments

gpetretto commented Sep 9, 2023

utf commented Feb 15, 2024 • edited Loading

JaGeo commented Feb 15, 2024 • edited Loading

utf commented Feb 16, 2024 • edited Loading

gpetretto commented Feb 16, 2024

esoteric-ephemera commented Feb 16, 2024

gpetretto commented Feb 19, 2024

davidwaroquiers commented Feb 20, 2024

orionarcher commented Feb 20, 2024

utf commented Feb 15, 2024 •

edited

Loading

JaGeo commented Feb 15, 2024 •

edited

Loading

utf commented Feb 16, 2024 •

edited

Loading