-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow slicing of virtual datasets with chunked source data #1597
Comments
I can reproduce this in C, by modifying the My modification of the C code is below. It works with the files made by
h5_read.c/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by The HDF Group. *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the COPYING file, which can be found at the root of the source code *
* distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. *
* If you do not have access to either file, you may request a copy from *
* help@hdfgroup.org. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
#include <stdlib.h>
#include "hdf5.h"
#define H5FILE_NAME "vds.h5"
#define DATASETNAME "a"
#define NX 50 /* output buffer dimensions */
#define NY 16
#define NZ 512
#define NZZ 128
#define RANK 4
int
main (void)
{
hid_t file, dataset; /* handles */
hid_t datatype, dataspace;
hid_t memspace;
H5T_class_t t_class; /* data type class */
H5T_order_t order; /* data order */
size_t size; /*
* size of the data element
* stored in file
*/
hsize_t dims_out[4]; /* dataset dimensions */
herr_t status;
int *data_out; /* output buffer */
hsize_t count[4]; /* size of the hyperslab in the file */
hsize_t offset[4]; /* hyperslab offset in the file */
int i, j, k, l, status_n, rank;
data_out = malloc(NX * NY * NZ * NZZ * sizeof(int));
/*
* Open the file and the dataset.
*/
file = H5Fopen(H5FILE_NAME, H5F_ACC_RDONLY, H5P_DEFAULT);
dataset = H5Dopen2(file, DATASETNAME, H5P_DEFAULT);
/*
* Get datatype and dataspace handles and then query
* dataset class, order, size, rank and dimensions.
*/
datatype = H5Dget_type(dataset); /* datatype handle */
t_class = H5Tget_class(datatype);
if (t_class == H5T_INTEGER) printf("Data set has INTEGER type \n");
order = H5Tget_order(datatype);
if (order == H5T_ORDER_LE) printf("Little endian order \n");
size = H5Tget_size(datatype);
printf(" Data size is %d \n", (int)size);
dataspace = H5Dget_space(dataset); /* dataspace handle */
rank = H5Sget_simple_extent_ndims(dataspace);
status_n = H5Sget_simple_extent_dims(dataspace, dims_out, NULL);
printf("rank %d, dimensions %lu x %lu x %lu x %lu \n", rank,
(unsigned long)(dims_out[0]), (unsigned long)(dims_out[1]),
(unsigned long)(dims_out[2]), (unsigned long)(dims_out[3]));
/*
* Define hyperslab in the dataset.
*/
offset[0] = 0;
offset[1] = 0;
offset[2] = 0;
offset[3] = 0;
count[0] = NX;
count[1] = NY;
count[2] = NZ;
count[3] = NZZ;
status = H5Sselect_hyperslab(dataspace, H5S_SELECT_SET, offset, NULL,
count, NULL);
printf("Selected in dataset %d\n", H5Sget_select_npoints(dataspace));
/*
* Define the memory dataspace.
*/
memspace = H5Screate_simple(RANK, count, NULL);
status = H5Sselect_all(memspace);
printf("Selected in memspace %d\n", H5Sget_select_npoints(memspace));
/*
* Read data from hyperslab in the file into the hyperslab in
* memory.
*/
status = H5Dread(dataset, H5T_NATIVE_INT, memspace, dataspace,
H5P_DEFAULT, data_out);
/*
* Close/release resources.
*/
free(data_out);
H5Tclose(datatype);
H5Dclose(dataset);
H5Sclose(dataspace);
H5Sclose(memspace);
H5Fclose(file);
return 0;
} |
We do have reports about VDS performance. I entered an issue HDFFV-11124 for this one. Unfortunately, I don't think we will have a bandwidth to investigate before October. Any help with profiling and more in depth analysis will be highly appreciated. |
Thanks Elena! I'm not sure if I'll get time to delve into it further, but it's good to know that it's tracked, in any case. |
One thing I noticed that there is datatype conversion since memory buffer is native int vs. unsigned int, but I think the main issue comes from hyperslab selection. We do have similar reports and were going to look into optimizations. It is on our radar and has high priority. |
Thank you both. Having noticed this with h5py and real data, I wouldn't have stated the problem as clearly as @takluyver did. |
Thanks, well spotted. I just tried using |
Thank you for checking! Yes, I wouldn't expect datatype conversion being a huge contributor to performance drop. |
I came across a case where reading a virtual dataset is very slow - more than an order of magnitude slower than reading the same data by iterating over it in a Python loop. Scripts to reproduce this are below. I suspect this is coming from HDF5 itself (cc @epourmal ), but I haven't yet tried to reproduce it without Python.
Version info:
create.py
read.py
I would expect slicing (
ds[:50]
) to be at least as fast as reading the same data in a loop, but I consistently see slicing taking about 13 seconds, and the loop taking about 0.25 s. I don't see this when reading a chunked dataset directly, nor a virtual dataset where the source data is contiguous.@dallanto originally noticed this with real data. I've reproduced it using sample data, HDF5 1.12 and h5py master.
The text was updated successfully, but these errors were encountered: