Very slow slicing of virtual datasets with chunked source data #1597

takluyver · 2020-07-28T12:45:25Z

I came across a case where reading a virtual dataset is very slow - more than an order of magnitude slower than reading the same data by iterating over it in a Python loop. Scripts to reproduce this are below. I suspect this is coming from HDF5 itself (cc @epourmal ), but I haven't yet tried to reproduce it without Python.

Version info:

h5py    2.10.0
HDF5    1.12.0
Python  3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.19.1

create.py

import h5py
import numpy as np

layout = h5py.VirtualLayout((1000, 16, 512, 128), dtype=np.uint32)

for i in range(16):
    arr = np.full((1000, 512, 128), i, dtype=np.uint32)
    with h5py.File(f'{i}.h5', 'w') as f:
        ds = f.create_dataset('a', data=arr, chunks=(1, 512, 128))
        layout[:, i] = h5py.VirtualSource(ds)

with h5py.File('vds.h5', 'w') as f:
    f.create_virtual_dataset('a', layout)

read.py

import h5py
import numpy as np
import time

print(h5py.version.info)

f = h5py.File('vds.h5', 'r')
ds = f['a']

t0 = time.perf_counter()
arr1 = ds[:50]
t1 = time.perf_counter()
print(f"Slicing: {t1 - t0:.03f} s")

arr2 = np.zeros((50,) + ds.shape[1:], dtype=ds.dtype)
for i in range(50):
    arr2[i] = ds[i]
t2 = time.perf_counter()
print(f"Loop: {t2 - t1:.03f} s")

np.testing.assert_array_equal(arr1, arr2)

I would expect slicing (ds[:50]) to be at least as fast as reading the same data in a loop, but I consistently see slicing taking about 13 seconds, and the loop taking about 0.25 s. I don't see this when reading a chunked dataset directly, nor a virtual dataset where the source data is contiguous.

@dallanto originally noticed this with real data. I've reproduced it using sample data, HDF5 1.12 and h5py master.

The text was updated successfully, but these errors were encountered:

takluyver · 2020-07-28T13:36:03Z

I can reproduce this in C, by modifying the h5_read.c example, so it's definitely coming from HDF5.

My modification of the C code is below. It works with the files made by create.py above. To test it:

h5cc -o h5_read h5_read.c
time ./h5_read

h5_read.c

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 * Copyright by The HDF Group.                                               *
 * Copyright by the Board of Trustees of the University of Illinois.         *
 * All rights reserved.                                                      *
 *                                                                           *
 * This file is part of HDF5.  The full HDF5 copyright notice, including     *
 * terms governing use, modification, and redistribution, is contained in    *
 * the COPYING file, which can be found at the root of the source code       *
 * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases.  *
 * If you do not have access to either file, you may request a copy from     *
 * help@hdfgroup.org.                                                        *
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

#include <stdlib.h>
#include "hdf5.h"

#define H5FILE_NAME        "vds.h5"
#define DATASETNAME "a"
#define NX 50           /* output buffer dimensions */
#define NY 16
#define NZ  512
#define NZZ 128
#define RANK         4

int
main (void)
{
    hid_t       file, dataset;         /* handles */
    hid_t       datatype, dataspace;
    hid_t       memspace;
    H5T_class_t t_class;                 /* data type class */
    H5T_order_t order;                 /* data order */
    size_t      size;                  /*
				        * size of the data element
				        * stored in file
				        */
    hsize_t     dims_out[4];           /* dataset dimensions */
    herr_t      status;

    int *data_out; /* output buffer */

    hsize_t      count[4];              /* size of the hyperslab in the file */
    hsize_t      offset[4];             /* hyperslab offset in the file */
    int          i, j, k, l, status_n, rank;

    data_out = malloc(NX * NY * NZ * NZZ * sizeof(int));

    /*
     * Open the file and the dataset.
     */
    file = H5Fopen(H5FILE_NAME, H5F_ACC_RDONLY, H5P_DEFAULT);
    dataset = H5Dopen2(file, DATASETNAME, H5P_DEFAULT);

    /*
     * Get datatype and dataspace handles and then query
     * dataset class, order, size, rank and dimensions.
     */
    datatype  = H5Dget_type(dataset);     /* datatype handle */
    t_class     = H5Tget_class(datatype);
    if (t_class == H5T_INTEGER) printf("Data set has INTEGER type \n");
    order     = H5Tget_order(datatype);
    if (order == H5T_ORDER_LE) printf("Little endian order \n");

    size  = H5Tget_size(datatype);
    printf(" Data size is %d \n", (int)size);

    dataspace = H5Dget_space(dataset);    /* dataspace handle */
    rank      = H5Sget_simple_extent_ndims(dataspace);
    status_n  = H5Sget_simple_extent_dims(dataspace, dims_out, NULL);
    printf("rank %d, dimensions %lu x %lu x %lu x %lu \n", rank,
	   (unsigned long)(dims_out[0]), (unsigned long)(dims_out[1]),
       (unsigned long)(dims_out[2]), (unsigned long)(dims_out[3]));

    /*
     * Define hyperslab in the dataset.
     */
    offset[0] = 0;
    offset[1] = 0;
    offset[2] = 0;
    offset[3] = 0;
    count[0]  = NX;
    count[1]  = NY;
    count[2]  = NZ;
    count[3]  = NZZ;
    status = H5Sselect_hyperslab(dataspace, H5S_SELECT_SET, offset, NULL,
				 count, NULL);
    printf("Selected in dataset %d\n", H5Sget_select_npoints(dataspace));

    /*
     * Define the memory dataspace.
     */
    memspace = H5Screate_simple(RANK, count, NULL);
    status = H5Sselect_all(memspace);
    printf("Selected in memspace %d\n", H5Sget_select_npoints(memspace));

    /*
     * Read data from hyperslab in the file into the hyperslab in
     * memory.
     */
    status = H5Dread(dataset, H5T_NATIVE_INT, memspace, dataspace,
		     H5P_DEFAULT, data_out);

    /*
     * Close/release resources.
     */
    free(data_out);
    H5Tclose(datatype);
    H5Dclose(dataset);
    H5Sclose(dataspace);
    H5Sclose(memspace);
    H5Fclose(file);

    return 0;
}

epourmal · 2020-07-28T13:56:37Z

We do have reports about VDS performance. I entered an issue HDFFV-11124 for this one. Unfortunately, I don't think we will have a bandwidth to investigate before October. Any help with profiling and more in depth analysis will be highly appreciated.
Thank you for reporting!
Elena

takluyver · 2020-07-28T14:15:06Z

Thanks Elena! I'm not sure if I'll get time to delve into it further, but it's good to know that it's tracked, in any case.

epourmal · 2020-07-28T14:18:40Z

One thing I noticed that there is datatype conversion since memory buffer is native int vs. unsigned int, but I think the main issue comes from hyperslab selection. We do have similar reports and were going to look into optimizations. It is on our radar and has high priority.

dallanto · 2020-07-28T14:19:09Z

Thank you both. Having noticed this with h5py and real data, I wouldn't have stated the problem as clearly as @takluyver did.

takluyver · 2020-07-28T14:29:41Z

Thanks, well spotted. I just tried using unsigned int and H5T_NATIVE_UINT, and the timings are still much the same.

epourmal · 2020-07-28T14:59:09Z

Thank you for checking! Yes, I wouldn't expect datatype conversion being a huge contributor to performance drop.

takluyver added the bug-in-external-lib label Jul 28, 2020

takluyver added the performance label Jul 31, 2020

takluyver added the vds Virtual datasets label Dec 4, 2020

SamuelBorden mentioned this issue Nov 1, 2023

DataLoader performance is bad? legend-exp/pygama#521

Open

lvarriano mentioned this issue Nov 1, 2023

read_object() with idx parameter is slow legend-exp/legend-pydataobj#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow slicing of virtual datasets with chunked source data #1597

Very slow slicing of virtual datasets with chunked source data #1597

takluyver commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020

dallanto commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020

Very slow slicing of virtual datasets with chunked source data #1597

Very slow slicing of virtual datasets with chunked source data #1597

Comments

takluyver commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020

dallanto commented Jul 28, 2020

takluyver commented Jul 28, 2020

epourmal commented Jul 28, 2020