Unable to read dataset larger than Integer.MAX_VALUE bytes #485

jbellis · 2023-07-30T21:41:36Z

Exception: Failed to map data buffer for dataset '/train'
        at org.example.Texmex.lambda$main$3(Texmex.java:110)
        at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: io.jhdf.exceptions.HdfException: Failed to map data buffer for dataset '/train'
        at io.jhdf.dataset.ContiguousDatasetImpl.getDataBuffer(ContiguousDatasetImpl.java:44)
        at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:133)
        at org.example.Texmex.computeRecallFor(Texmex.java:70)
        at org.example.Texmex.lambda$main$3(Texmex.java:108)
        ... 1 more
Caused by: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1185)
        at io.jhdf.storage.HdfFileChannel.mapNoOffset(HdfFileChannel.java:74)
        at io.jhdf.storage.HdfFileChannel.map(HdfFileChannel.java:66)
        at io.jhdf.dataset.ContiguousDatasetImpl.getDataBuffer(ContiguousDatasetImpl.java:40)

The dataset in question is 3848008288 bytes (http://ann-benchmarks.com/deep-image-96-angular.hdf5).

The text was updated successfully, but these errors were encountered:

jamesmudd · 2023-08-02T09:06:48Z

Thanks for raising this and providing a sample file. This is a limitation currently with contiguous datasets, it would be possible to split the mapping up and read contigious datasets more like chunked datasets are read. This in theory would be a nice way to parrelise the reading as well to gain performance.

In the meantime you could try using the slice reading. Using Dataset#getData(long[] sliceOffset, int[] sliceDimensions).

Some code like seems to work (it is definitly not optimal takes about 30 secs to read on my system)

public class ReadDataset {
	public static void main(String[] args) {
		try (HdfFile hdfFile = new HdfFile(Paths.get("/path/to/deep-image-96-angular.hdf5"))) {
			Dataset dataset = hdfFile.getDatasetByPath("/train");
			int[] dimensions = dataset.getDimensions();
			float[][] data = (float[][]) Array.newInstance(dataset.getJavaType(), dimensions);
			for (int i = 0; i < dimensions[0]; i++) {
				data[i] = ((float[][]) dataset.getData(new long[]{i, 0}, new int[]{1, dimensions[1]}))[0];
			}
			System.out.println("Finished read");
		}
	}
}

jamesmudd added the bug Something isn't working label Aug 2, 2023

Apollo3zehn mentioned this issue Aug 3, 2023

Super issue Apollo3zehn/PureHDF#4

Open

70 tasks

jbellis mentioned this issue Aug 11, 2023

Make it work with large datasets jbellis/hnswrecall#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read dataset larger than Integer.MAX_VALUE bytes #485

Unable to read dataset larger than Integer.MAX_VALUE bytes #485

jbellis commented Jul 30, 2023

jamesmudd commented Aug 2, 2023

Unable to read dataset larger than Integer.MAX_VALUE bytes #485

Unable to read dataset larger than Integer.MAX_VALUE bytes #485

Comments

jbellis commented Jul 30, 2023

jamesmudd commented Aug 2, 2023