Created HDF5 based dataset #227

peastman · 2022-06-27T22:30:09Z

This is a rough draft of the HDF5 based dataset discussed in #214. It can handle arbitrarily large datasets, even ones that are too large to fit in memory. Are you interested in developing this into a supported feature?

Description

I added a HDF5Dataset class that extends AtomicDataset. It loads data from one or more HDF5 files with the format described at https://github.com/torchmd/torchmd-net/blob/main/torchmdnet/datasets/hdf.py. The only data fields it currently supports are positions, atom types, energies, and forces. Others could easily be added.

There's a lot that could be cleaned up. I've only implemented the features that were needed for my own work. That's especially true of statistics(), which only computes a subset of the available statistics. It could be extended to compute the rest of them, but in the long term it might be better to rework the main statistics routine to support streaming calculations so it can work with any dataset class.

The class constructs AtomicData objects as needed in get(), which adds overhead. You can partly compensate for that by increasing dataloader_num_workers to use multiple processes.

Computing statistics on large datasets can be slow. I use dataset_statistics_stride to reduce the cost.

Motivation and Context

All of the existing dataset classes require all the data to be held in memory at once. This limits how much data can be used for training.
Resolves: #214.

How Has This Been Tested?

I used it to train a model and it generally seems to work, but I haven't done extensive testing or written unit tests yet.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds or improves functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

My code follows the code style of this project and has been formatted using black.
All new and existing tests passed, including on GPU (if relevant).
I have added tests that cover my changes (if relevant).
The option documentation (docs/options) has been updated with new or changed options.
I have updated CHANGELOG.md.
I have updated the documentation (if relevant).

Linux-cpp-lisp · 2022-06-29T05:33:16Z

Hi @peastman ,

Thanks very much for posting this! I am away from code development right now and unfortunately will not be able to take a closer look for a little while, but at a glance this looks like a promising start.

You can partly compensate for that by increasing dataloader_num_workers to use multiple processes.

Did you find this to be true in practice? I've had trouble getting anything but slowdowns from adding in the multiprocessing.

peastman · 2022-06-29T16:02:19Z

Thanks! There's no hurry. Whenever you get around to looking at it.

I did find some benefit to using more workers. My GPU utilization increased from 45% with one worker to 55% with three workers. By increasing the batch size I was able to increase it further, up to about 85%.

peastman · 2023-03-02T22:41:47Z

Can I bump this issue? The inability to work with large datasets is really a showstopper for me.

Linux-cpp-lisp · 2023-03-03T22:49:49Z

nequip/data/_dataset/_hdf5_dataset.py

+        self.file_name = file_name
+        self.r_max = extra_fixed_fields["r_max"]
+        self.index = None
+        self.num_molecules = 0


-> num_frames

nequip/data/_dataset/_hdf5_dataset.py

Linux-cpp-lisp · 2023-03-03T23:00:30Z

nequip/data/_dataset/_hdf5_dataset.py

+                if mode == "rms":
+                    total += np.sum(values * values)
+                    count += len(values.flatten())
+                else:
+                    length = len(values.flatten())
+                    if mode == "per_atom_mean_std":
+                        values /= len(data[0][i])
+                    sample_mean = np.mean(values)
+                    new_mean = (total[0] * count + sample_mean * length) / (
+                        count + length
+                    )
+                    total[1] += (
+                        length * (sample_mean - total[0]) * (sample_mean - new_mean)
+                    )
+                    total[0] = new_mean
+                    count += length


See https://github.com/mir-group/pytorch_runstats / https://pytorch-runstats.readthedocs.io/en/latest/index.html

which can do this for rms and mean already with unit tests and correct numerics. We already depend on this library.

Easy to do an OK std with two-pass using this (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm) or to implement Welford's online algorithm into torch_runstats maybe (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)?

Or can just return NaN for the std if you only ever use the mean for now.

Linux-cpp-lisp · 2023-03-03T23:06:32Z

Hi @peastman ,

I've just brought this branch up-to-date with develop and would be very happy to get it merged for the next release. I've left a few comments.

rework the main statistics routine to support streaming calculations so it can work with any dataset class.

For the simple subset of statistics implemented here, I think this can be done without too much work. The only problem is that it would call .get() repeatedly to stream over AtomicData objects so the overhead might be high with your neighborlist calculation.

I think the best solution for that is to go to the root of the problem: if you want to implement an alternative neighborlist backend (torch-nl or matscipy?) that could solve it.

But as long as this implementation correctly errors out on cases it does not support, we can also just merge it for now and come back to it later.

nequip/data/_dataset/_hdf5_dataset.py

peastman · 2023-03-03T23:12:33Z

Thanks! I'll try to get the changes in shortly.

Linux-cpp-lisp · 2023-03-03T23:27:15Z

Great, thanks @peastman !

peastman · 2023-03-03T23:54:08Z

After the updates, I now get an exception when I try to train a model with a HDF5 dataset. Here's the output.

Number of weights: 254968
Number of trainable weights: 254968
! Starting training ...
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1660070785140/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [72,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1660070785140/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [80,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

and then after many repetitions of similar lines,

Traceback (most recent call last):
  File "/home/peastman/miniconda3/envs/torchmd-net/bin/nequip-train", line 33, in <module>
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
  File "/home/peastman/workspace/nequip/nequip/scripts/train.py", line 102, in main
    trainer.train()
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 779, in train
    self.epoch_step()
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 912, in epoch_step
    self.batch_step(
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 809, in batch_step
    out = self.model(data_for_loss)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_graph_model.py", line 112, in forward
    data = self.model(new_data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_rescale.py", line 144, in forward
    data = self.model(data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_grad_output.py", line 310, in forward
    data = self.func(data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_graph_mixin.py", line 356, in forward
    input = module(input)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/embedding/_one_hot.py", line 41, in forward
    one_hot = torch.nn.functional.one_hot(
RuntimeError: CUDA error: device-side assert triggered

Any idea what would cause that?

Linux-cpp-lisp · 2023-03-04T00:01:15Z

Yeah we've seen this before when the there are atom indexes in the edge_index that are larger than the number of atoms. (Or more generally, when there are invalid indexes in edge_index or possibly batch.)

peastman · 2023-03-04T00:17:05Z

Where would the invalid indices come from? Nothing in this class sets them. The AtomicData objects get created like this:

        args = {
            "pos": data[1][i],
            "r_max": self.r_max,
            AtomicDataDict.ATOM_TYPE_KEY: data[0][i],
            AtomicDataDict.TOTAL_ENERGY_KEY: data[2][i],
        }
        if self.has_forces:
            args[AtomicDataDict.FORCE_KEY] = data[3][i]
        return AtomicData.from_points(**args)

peastman · 2023-03-07T01:10:28Z

I made most of the changes, and also added unit tests. The one thing I haven't done is switch statistics() to use pytorch_runstats, since it doesn't support standard deviation. If that feature is added to it, it should be easy to make statistics() use it. In the mean time, all the tests in TestStatistics are now being run on it to verify the results are correct.

peastman · 2023-03-08T19:37:53Z

The tests are failing because h5py isn't installed in the test environment. I can fix it in either of two ways. 1) Add h5py as a dependency in setup.py so it will always get installed, or 2) make it skip those tests if it isn't installed. Which do you prefer? I suggest 1, but I don't want to add a dependency without checking first.

Linux-cpp-lisp · 2023-03-08T20:36:59Z

The one thing I haven't done is switch statistics() to use pytorch_runstats, since it doesn't support standard deviation. If that feature is added to it, it should be easy to make statistics() use it.

Online one-pass stddev might be tricky, but I can look later at how bad it would be to implement Welford's algorithm in the simple case.

The tests are failing because h5py isn't installed in the test environment. I can fix it in either of two ways. 1) Add h5py as a dependency in setup.py so it will always get installed, or 2) make it skip those tests if it isn't installed. Which do you prefer? I suggest 1, but I don't want to add a dependency without checking first.

I don't want to add it as a mandatory dependency, so why don't we skipif the tests on h5py but also just add a pip install h5py to the test workflow YAMLs? That way it still gets tested. (My understanding is that h5py isn't pure Python, which means inevitably on some clusters installing it will be a unnecessary mess.)

peastman · 2023-03-08T20:48:38Z

It already uses Welford's algorithm. It's mainly a matter of upstreaming the code to pytorch_runstats, though that will also add some complexity since pytorch_runstats supports features that aren't needed here.

why don't we skipif the tests on h5py but also just add a pip install h5py to the test workflow YAMLs?

Sounds good.

Linux-cpp-lisp · 2023-03-09T00:44:30Z

It already uses Welford's algorithm. It's mainly a matter of upstreaming the code to pytorch_runstats, though that will also add some complexity since pytorch_runstats supports features that aren't needed here.

Ah I see, great--- you mean accumulate_by and reduce_dims? Reduce dims should be easy and we can just not support accumulate_by for std for now, just throw NotImplementedError, I think that's fine.

peastman · 2023-03-14T20:26:55Z

This should be ready to merge, unless you need other changes first.

Linux-cpp-lisp · 2023-03-15T14:26:45Z

Looks good to me, no need to wait on statistics stuff to do a first merge--- I'll get this merged later today. Thanks @peastman !

Linux-cpp-lisp · 2023-03-17T20:27:10Z

Merged! If I manage to get around to implementing Welford's into runstats, we can revisit migrating to that and making this a "first class" dataset in terms of feature support in another PR later.

Thanks for your efforts on this @peastman !

peastman · 2023-03-17T22:05:39Z

Thank you!

Added HDF5 dataset

70759ee

Linux-cpp-lisp added 2 commits March 3, 2023 17:18

Merge branch 'develop' into hdf5

391413c

refactor

54029bb

Linux-cpp-lisp reviewed Mar 3, 2023

View reviewed changes

nequip/data/_dataset/_hdf5_dataset.py Show resolved Hide resolved

peastman added 3 commits March 3, 2023 17:45

Updates based on review comments

726fccd

Bug fixes

cc304e4

Test cases for HDF5Dataset

01f29f0

peastman and others added 2 commits March 7, 2023 15:04

Fixed formatting

8ba9763

Merge branch 'develop' into hdf5

c572d1d

Fixed test failures

cbe036c

Linux-cpp-lisp approved these changes Mar 15, 2023

View reviewed changes

changelog

747e39a

Linux-cpp-lisp merged commit 3d14cbe into mir-group:develop Mar 17, 2023

peastman deleted the hdf5 branch March 17, 2023 22:05

Linux-cpp-lisp mentioned this pull request May 10, 2024

0.6.0 #429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created HDF5 based dataset #227

Created HDF5 based dataset #227

peastman commented Jun 27, 2022 •

edited by Linux-cpp-lisp

Loading

Linux-cpp-lisp commented Jun 29, 2022

peastman commented Jun 29, 2022

peastman commented Mar 2, 2023

Linux-cpp-lisp Mar 3, 2023

Linux-cpp-lisp Mar 3, 2023

Linux-cpp-lisp commented Mar 3, 2023 •

edited

Loading

peastman commented Mar 3, 2023

Linux-cpp-lisp commented Mar 3, 2023

peastman commented Mar 3, 2023

Linux-cpp-lisp commented Mar 4, 2023 •

edited

Loading

peastman commented Mar 4, 2023

peastman commented Mar 7, 2023

peastman commented Mar 8, 2023

Linux-cpp-lisp commented Mar 8, 2023 •

edited

Loading

peastman commented Mar 8, 2023

Linux-cpp-lisp commented Mar 9, 2023 •

edited

Loading

peastman commented Mar 14, 2023

Linux-cpp-lisp commented Mar 15, 2023

Linux-cpp-lisp commented Mar 17, 2023

peastman commented Mar 17, 2023

Created HDF5 based dataset #227

Created HDF5 based dataset #227

Conversation

peastman commented Jun 27, 2022 • edited by Linux-cpp-lisp Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Linux-cpp-lisp commented Jun 29, 2022

peastman commented Jun 29, 2022

peastman commented Mar 2, 2023

Linux-cpp-lisp Mar 3, 2023

Choose a reason for hiding this comment

Linux-cpp-lisp Mar 3, 2023

Choose a reason for hiding this comment

Linux-cpp-lisp commented Mar 3, 2023 • edited Loading

peastman commented Mar 3, 2023

Linux-cpp-lisp commented Mar 3, 2023

peastman commented Mar 3, 2023

Linux-cpp-lisp commented Mar 4, 2023 • edited Loading

peastman commented Mar 4, 2023

peastman commented Mar 7, 2023

peastman commented Mar 8, 2023

Linux-cpp-lisp commented Mar 8, 2023 • edited Loading

peastman commented Mar 8, 2023

Linux-cpp-lisp commented Mar 9, 2023 • edited Loading

peastman commented Mar 14, 2023

Linux-cpp-lisp commented Mar 15, 2023

Linux-cpp-lisp commented Mar 17, 2023

peastman commented Mar 17, 2023

peastman commented Jun 27, 2022 •

edited by Linux-cpp-lisp

Loading

Linux-cpp-lisp commented Mar 3, 2023 •

edited

Loading

Linux-cpp-lisp commented Mar 4, 2023 •

edited

Loading

Linux-cpp-lisp commented Mar 8, 2023 •

edited

Loading

Linux-cpp-lisp commented Mar 9, 2023 •

edited

Loading