Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created HDF5 based dataset #227

Merged
merged 10 commits into from
Mar 17, 2023
Merged

Created HDF5 based dataset #227

merged 10 commits into from
Mar 17, 2023

Conversation

peastman
Copy link
Contributor

@peastman peastman commented Jun 27, 2022

This is a rough draft of the HDF5 based dataset discussed in #214. It can handle arbitrarily large datasets, even ones that are too large to fit in memory. Are you interested in developing this into a supported feature?

Description

I added a HDF5Dataset class that extends AtomicDataset. It loads data from one or more HDF5 files with the format described at https://github.com/torchmd/torchmd-net/blob/main/torchmdnet/datasets/hdf.py. The only data fields it currently supports are positions, atom types, energies, and forces. Others could easily be added.

There's a lot that could be cleaned up. I've only implemented the features that were needed for my own work. That's especially true of statistics(), which only computes a subset of the available statistics. It could be extended to compute the rest of them, but in the long term it might be better to rework the main statistics routine to support streaming calculations so it can work with any dataset class.

The class constructs AtomicData objects as needed in get(), which adds overhead. You can partly compensate for that by increasing dataloader_num_workers to use multiple processes.

Computing statistics on large datasets can be slow. I use dataset_statistics_stride to reduce the cost.

Motivation and Context

All of the existing dataset classes require all the data to be held in memory at once. This limits how much data can be used for training.
Resolves: #214.

How Has This Been Tested?

I used it to train a model and it generally seems to work, but I haven't done extensive testing or written unit tests yet.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds or improves functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

  • My code follows the code style of this project and has been formatted using black.
  • All new and existing tests passed, including on GPU (if relevant).
  • I have added tests that cover my changes (if relevant).
  • The option documentation (docs/options) has been updated with new or changed options.
  • I have updated CHANGELOG.md.
  • I have updated the documentation (if relevant).

@Linux-cpp-lisp
Copy link
Collaborator

Hi @peastman ,

Thanks very much for posting this! I am away from code development right now and unfortunately will not be able to take a closer look for a little while, but at a glance this looks like a promising start.

You can partly compensate for that by increasing dataloader_num_workers to use multiple processes.

Did you find this to be true in practice? I've had trouble getting anything but slowdowns from adding in the multiprocessing.

@peastman
Copy link
Contributor Author

Thanks! There's no hurry. Whenever you get around to looking at it.

I did find some benefit to using more workers. My GPU utilization increased from 45% with one worker to 55% with three workers. By increasing the batch size I was able to increase it further, up to about 85%.

@peastman
Copy link
Contributor Author

peastman commented Mar 2, 2023

Can I bump this issue? The inability to work with large datasets is really a showstopper for me.

self.file_name = file_name
self.r_max = extra_fixed_fields["r_max"]
self.index = None
self.num_molecules = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> num_frames

nequip/data/_dataset/_hdf5_dataset.py Outdated Show resolved Hide resolved
nequip/data/_dataset/_hdf5_dataset.py Outdated Show resolved Hide resolved
nequip/data/_dataset/_hdf5_dataset.py Show resolved Hide resolved
Comment on lines 111 to 126
if mode == "rms":
total += np.sum(values * values)
count += len(values.flatten())
else:
length = len(values.flatten())
if mode == "per_atom_mean_std":
values /= len(data[0][i])
sample_mean = np.mean(values)
new_mean = (total[0] * count + sample_mean * length) / (
count + length
)
total[1] += (
length * (sample_mean - total[0]) * (sample_mean - new_mean)
)
total[0] = new_mean
count += length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/mir-group/pytorch_runstats / https://pytorch-runstats.readthedocs.io/en/latest/index.html

which can do this for rms and mean already with unit tests and correct numerics. We already depend on this library.

Easy to do an OK std with two-pass using this (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm) or to implement Welford's online algorithm into torch_runstats maybe (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)?

Or can just return NaN for the std if you only ever use the mean for now.

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Mar 3, 2023

Hi @peastman ,

I've just brought this branch up-to-date with develop and would be very happy to get it merged for the next release. I've left a few comments.

rework the main statistics routine to support streaming calculations so it can work with any dataset class.

For the simple subset of statistics implemented here, I think this can be done without too much work. The only problem is that it would call .get() repeatedly to stream over AtomicData objects so the overhead might be high with your neighborlist calculation.

I think the best solution for that is to go to the root of the problem: if you want to implement an alternative neighborlist backend (torch-nl or matscipy?) that could solve it.

But as long as this implementation correctly errors out on cases it does not support, we can also just merge it for now and come back to it later.

@peastman
Copy link
Contributor Author

peastman commented Mar 3, 2023

Thanks! I'll try to get the changes in shortly.

@Linux-cpp-lisp
Copy link
Collaborator

Great, thanks @peastman !

@peastman
Copy link
Contributor Author

peastman commented Mar 3, 2023

After the updates, I now get an exception when I try to train a model with a HDF5 dataset. Here's the output.

Number of weights: 254968
Number of trainable weights: 254968
! Starting training ...
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1660070785140/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [72,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1660070785140/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [80,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

and then after many repetitions of similar lines,

Traceback (most recent call last):
  File "/home/peastman/miniconda3/envs/torchmd-net/bin/nequip-train", line 33, in <module>
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
  File "/home/peastman/workspace/nequip/nequip/scripts/train.py", line 102, in main
    trainer.train()
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 779, in train
    self.epoch_step()
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 912, in epoch_step
    self.batch_step(
  File "/home/peastman/workspace/nequip/nequip/train/trainer.py", line 809, in batch_step
    out = self.model(data_for_loss)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_graph_model.py", line 112, in forward
    data = self.model(new_data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_rescale.py", line 144, in forward
    data = self.model(data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_grad_output.py", line 310, in forward
    data = self.func(data)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/_graph_mixin.py", line 356, in forward
    input = module(input)
  File "/home/peastman/miniconda3/envs/torchmd-net/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/peastman/workspace/nequip/nequip/nn/embedding/_one_hot.py", line 41, in forward
    one_hot = torch.nn.functional.one_hot(
RuntimeError: CUDA error: device-side assert triggered

Any idea what would cause that?

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Mar 4, 2023

Yeah we've seen this before when the there are atom indexes in the edge_index that are larger than the number of atoms. (Or more generally, when there are invalid indexes in edge_index or possibly batch.)

@peastman
Copy link
Contributor Author

peastman commented Mar 4, 2023

Where would the invalid indices come from? Nothing in this class sets them. The AtomicData objects get created like this:

        args = {
            "pos": data[1][i],
            "r_max": self.r_max,
            AtomicDataDict.ATOM_TYPE_KEY: data[0][i],
            AtomicDataDict.TOTAL_ENERGY_KEY: data[2][i],
        }
        if self.has_forces:
            args[AtomicDataDict.FORCE_KEY] = data[3][i]
        return AtomicData.from_points(**args)

@peastman
Copy link
Contributor Author

peastman commented Mar 7, 2023

I made most of the changes, and also added unit tests. The one thing I haven't done is switch statistics() to use pytorch_runstats, since it doesn't support standard deviation. If that feature is added to it, it should be easy to make statistics() use it. In the mean time, all the tests in TestStatistics are now being run on it to verify the results are correct.

@peastman
Copy link
Contributor Author

peastman commented Mar 8, 2023

The tests are failing because h5py isn't installed in the test environment. I can fix it in either of two ways. 1) Add h5py as a dependency in setup.py so it will always get installed, or 2) make it skip those tests if it isn't installed. Which do you prefer? I suggest 1, but I don't want to add a dependency without checking first.

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Mar 8, 2023

The one thing I haven't done is switch statistics() to use pytorch_runstats, since it doesn't support standard deviation. If that feature is added to it, it should be easy to make statistics() use it.

Online one-pass stddev might be tricky, but I can look later at how bad it would be to implement Welford's algorithm in the simple case.

The tests are failing because h5py isn't installed in the test environment. I can fix it in either of two ways. 1) Add h5py as a dependency in setup.py so it will always get installed, or 2) make it skip those tests if it isn't installed. Which do you prefer? I suggest 1, but I don't want to add a dependency without checking first.

I don't want to add it as a mandatory dependency, so why don't we skipif the tests on h5py but also just add a pip install h5py to the test workflow YAMLs? That way it still gets tested. (My understanding is that h5py isn't pure Python, which means inevitably on some clusters installing it will be a unnecessary mess.)

@peastman
Copy link
Contributor Author

peastman commented Mar 8, 2023

It already uses Welford's algorithm. It's mainly a matter of upstreaming the code to pytorch_runstats, though that will also add some complexity since pytorch_runstats supports features that aren't needed here.

why don't we skipif the tests on h5py but also just add a pip install h5py to the test workflow YAMLs?

Sounds good.

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Mar 9, 2023

It already uses Welford's algorithm. It's mainly a matter of upstreaming the code to pytorch_runstats, though that will also add some complexity since pytorch_runstats supports features that aren't needed here.

Ah I see, great--- you mean accumulate_by and reduce_dims? Reduce dims should be easy and we can just not support accumulate_by for std for now, just throw NotImplementedError, I think that's fine.

@peastman
Copy link
Contributor Author

This should be ready to merge, unless you need other changes first.

@Linux-cpp-lisp
Copy link
Collaborator

Looks good to me, no need to wait on statistics stuff to do a first merge--- I'll get this merged later today. Thanks @peastman !

@Linux-cpp-lisp Linux-cpp-lisp merged commit 3d14cbe into mir-group:develop Mar 17, 2023
@Linux-cpp-lisp
Copy link
Collaborator

Merged! If I manage to get around to implementing Welford's into runstats, we can revisit migrating to that and making this a "first class" dataset in terms of feature support in another PR later.

Thanks for your efforts on this @peastman !

@peastman
Copy link
Contributor Author

Thank you!

@peastman peastman deleted the hdf5 branch March 17, 2023 22:05
@Linux-cpp-lisp Linux-cpp-lisp mentioned this pull request May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants