Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak when Slicing Dataset #1176

Closed
colehurwitz opened this issue Feb 14, 2019 · 19 comments
Closed

Memory Leak when Slicing Dataset #1176

colehurwitz opened this issue Feb 14, 2019 · 19 comments

Comments

@colehurwitz
Copy link

To whom it may concern,

I recently ran into an issue where when I tried to slice a HDF5 dataset (similar to how I would slice a numpy array), my RAM kept filling up until I had to kill the program.

Here are the commands I entered in:

dataset = h5py.File(dataset_directory + recording_name)
print(dataset['3BData/Raw'][0:1000:2])

I basically tried to slice out every other element in the dataset, but this function never completed and it filled my entire RAM. Here are the dataset details:

<HDF5 dataset "Raw": shape (3224391680,), type "<u2">

Here are the specifications I am using:

python -c 'import h5py; print(h5py.version.info)'
h5py 2.8.0
HDF5 1.10.2
Python 3.7.1 (default, Oct 23 2018, 19:19:42)
[GCC 7.3.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.15.3

@aparamon
Copy link
Member

aparamon commented Feb 15, 2019

Hi @colehurwitz31!
How much RAM do you have on your instance?
Your slice (every other element in the dataset) needs ~3GB RAM.

@colehurwitz
Copy link
Author

I had about 70 GB of RAM (I am was working on a remote server with plenty of space). When I did the slicing procedure, it quickly increased from 5 GB used all the way up to 70GB before I terminated the program.

@tacaswell
Copy link
Member

What is the chunking on that dataset?

Can you pull the whole data set up successfully?

@colehurwitz
Copy link
Author

I can pull up the whole dataset successfully. I am not sure what the chunking is though.

@tacaswell
Copy link
Member

RE chunking see https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

print(dataset['3BData/Raw'][0:1000:2]) # fails
print(dataset['3BData/Raw'][0:1000][::2])  # works?

My guess is something is wrong in

def __getitem__(self, args):

@colehurwitz31 Can you provide a script to generate a file that fails is this way (random data should be fine).

This is super weird and troubling...

@vasole
Copy link
Contributor

vasole commented Feb 17, 2019

I'm trying to imagine what could make one work and the other one not.

The only thing that comes to my mind is that in the second case one reads a continuous buffer and takes one element out of two, while in the first case one may be forced to allocate a destination buffer and copy element by element to that buffer. If to read each element one has to read a big chunk of data instead of just reading one element that could explain some huge memory usage.

@vasole
Copy link
Contributor

vasole commented Feb 17, 2019

This reproduces the issue on my windows machine. The code has to be run twice.

Of course I would never do such things, it was just to reproduce the problem

import os
import h5py
import numpy

fname = "dummy.h5"
length = 100000000
if not os.path.exists(fname):
    data = numpy.random.random(length)
    h5 = h5py.File(fname,"w")
    dset = h5.create_dataset("data", shape=(length,),
                      dtype=numpy.float,
                      chunks=(length,),
                      compression="gzip")
    dset[:] = data
    h5.close()
    data = None
    print("GENERATED")
else:
    h5 = h5py.File(fname,"r")
    print("READING SAFE")
    print(h5["/data"][0:length][::2])
    print("READING UNSAFE")
    print(h5["/data"][0:length:2])
    h5.close()

@tacaswell
Copy link
Member

I have a story for this:

  • in the "read it all" case, we pull up the one chunk from the file, copy it to the waiting numpy buffer and then use numpy striding to get back a view of every-other.
  • in the slice-at-dataset-level case we are constructing an hdf5 selector that is going to walk through an fill in just the data we need
    • we may not be constructing that selector in the optimal way?
    • because there is only one (very large) chunk what may be happening is we pull it up to get the first element, then because it does not fit in the chunk cache it gets mostly discarded so for the second element the chunk is not in the cache so it gets pulled up again, the second element read out, and then because it again does not fit in the cache it gets mostly discarded, and so on. Eventually we end up with N copies of the single chunk is some sort of purgatory and OOM the machine.

Not clear if the error is on the h5py side or the hdf5 side.

@colehurwitz
Copy link
Author

Thanks for making a script to reproduce the error @vasole and thanks @tacaswell for looking into it. Hopefully a good solution is found!

@aparamon
Copy link
Member

Apparently, the problem is in HDF5 library; reported upstream.

@colehurwitz
Copy link
Author

colehurwitz commented Feb 20, 2019

That is good to know! Is there a place where I can follow updates for this library? Thanks for the help!

@aparamon
Copy link
Member

aparamon commented Feb 20, 2019

@colehurwitz31 I have sent report to https://forum.hdfgroup.org but it's currently down so I can't get the link...
Hopefully, it is up soon, the report is read, and JIRA issue is created by HDF Group members.

@epourmal
Copy link

Hello, H5Pyers,

Looks like the FORUM is down and I reported it to our sysadmin. It is little bit early here ;-)

Just one comment: please make sure that you close all handles. It is a known issue with HDF5 hyperslab selection code that internal data structures are growing and are not released until the library is closed. Please send us a C example that reproduces the issue.

Thank you!
Elena

@aparamon
Copy link
Member

@epourmal Good morning! :-)
Here you go:

#include <stdio.h>
#include <stdlib.h>
#include "hdf5.h"

int main() {
   const hsize_t size = 100000000L;
   hid_t fapl, file, dataset, dcpl, memspace, dataspace;
   herr_t status;
   hsize_t start = 0;
   hsize_t stride = 2;
   hsize_t halfsize = size/2;
   float *data;

   printf("Creating data file...");
   file = H5Fcreate("dummy.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

   dataspace = H5Screate_simple(1, &size, NULL);
   dcpl = H5Pcreate(H5P_DATASET_CREATE);
   status = H5Pset_chunk(dcpl, 1, &size);
   status = H5Pset_deflate(dcpl, 3);

   dataset = H5Dcreate2(file, "data", H5T_INTEL_F32, dataspace,
   			H5P_DEFAULT, dcpl, H5P_DEFAULT);

   status = H5Sclose(dataspace);
   status = H5Pclose(dcpl);

   data = malloc(sizeof(float)*size);
   for(hsize_t i=0; i<size; i++)
     data[i] = (float)rand()/(float)(RAND_MAX/1.);
   memspace = H5Screate_simple(1, &size, NULL);
   dataspace = H5Dget_space(dataset);
   status = H5Dwrite(dataset, H5T_NATIVE_FLOAT, memspace, dataspace, H5P_DEFAULT, data);
   status = H5Sclose(dataspace);
   status = H5Sclose(memspace);
   free(data);

   status = H5Dclose(dataset);
   status = H5Fclose(file);
   printf(" done.\n");

   printf("Loading data...");
   file = H5Fopen("dummy.h5", H5F_ACC_RDONLY, H5P_DEFAULT);
   dataset = H5Dopen2(file, "data", H5P_DEFAULT);

   data = malloc(sizeof(float)*halfsize);
   memspace = H5Screate_simple(1, &halfsize, NULL);
   dataspace = H5Dget_space(dataset);
   status = H5Sselect_hyperslab(dataspace, H5S_SELECT_SET, &start, &stride, &halfsize, NULL);
   status = H5Dread(dataset, H5T_NATIVE_FLOAT, memspace, dataspace, H5P_DEFAULT, data);
   status = H5Sclose(dataspace);
   status = H5Sclose(memspace);
   free(data);

   status = H5Dclose(dataset);
   status = H5Fclose(file);
   printf(" done.\n");
}

@epourmal
Copy link

Thank you! Forwarded to our Helpdesk.

Which version of HDF5 are you using? Did you try the same program without compression? Which version of zlib is used? More detailed information will help.

Elena

@aparamon
Copy link
Member

aparamon commented Feb 20, 2019

@epourmal Reproducible for me on Windows, HDF5 1.10.4.
Commenting out just
status = H5Pset_deflate(dcpl, 3)
doesn't help, only removing
status = H5Pset_chunk(dcpl, 1, &size)
makes it run nicely.

@aparamon
Copy link
Member

Please track upstream report at
https://jira.hdfgroup.org/browse/HDFFV-10709

@tacaswell
Copy link
Member

I'm going to close this as it is an upstream bug.

@epourmal
Copy link

Interesting but doesn't make sense :-) We will investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants