Feature Request: cache chunk data #15

AStupidBear · 2020-04-09T09:46:05Z

It will be more efficient to cache the current chunk in use when using a for loop to iterate over the data because each reading has a non-neglegible overhead (e.g. HDF5.jl).

The text was updated successfully, but these errors were encountered:

meggart · 2020-04-09T12:05:46Z

This is exactly what this package is for and what all implemented methods for reduction, broadcast etc are doing. Did you run into a case where you still get problems with this package?

AStupidBear · 2020-04-10T05:55:24Z

A = DiskArray(hdf5_dataset)
for t in 1:size(A, 2), n in 1:size(A, 1)
    complex_operations_for(A[n, t])
end

Will this still be efficient? Eeach getindex will still invoke one A.ds[...] with a non-neglegible overhead. If we cache the current block it is using, then the next contiguous getindex can be directly handled by the cached block.

In my use case, it's hard to formulate the above complex operations using broadcast or map without breaking the entire code base.

meggart · 2020-04-11T13:37:34Z

No, this example would not be efficient. However, for example:

A = DiskArray(hdf5_dataset)
broadcast(A) do a
    complex_operations_for(a)
end

would be efficient and be done chunk by chunk. Adding map and foreach is on the list as well. The problem with your code is that, even if it would do caching, it does not respect the chunking of the dataset. Assume your dataset has size 1000x1000 consisting of chunks of size 100x100, then operating column by column would be very inefficient, because every chunk is accessed 100 times.

If you really insist on avoiding map, foreach, and broadcast, one could think about defining a stateful eachindex, which returns an iterator containing a chunk cache. This would make code like the following possible:

A = DiskArray(hdf5_dataset)
for i in eachindex(A)
    complex_operations_for(A[i])
end

I have actually thought about implementing this, but I think there might be many problems with the implementation. In addition I did not find a real-world use case for yet that I could not express with the broadcast-like constructs mentioned above.

meggart · 2020-04-11T13:42:35Z

Another way to approach this would of course be to use https://github.com/JuliaCollections/LRUCache.jl at the cost of a Dict lookup for every access to the Array. However, as I said so far I would like to keep the single element access 0-overhead at the cost of having to use some functional programming.

maxfreu mentioned this issue Nov 10, 2020

GDALarray doesn't load lazily rafaqz/Rasters.jl#89

Closed

rafaqz mentioned this issue Jun 19, 2021

Implement chunked foreach/non allocating chunked iteration. #37

Open

rafaqz closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: cache chunk data #15

Feature Request: cache chunk data #15

AStupidBear commented Apr 9, 2020

meggart commented Apr 9, 2020

AStupidBear commented Apr 10, 2020

meggart commented Apr 11, 2020 •

edited

Loading

meggart commented Apr 11, 2020

Feature Request: cache chunk data #15

Feature Request: cache chunk data #15

Comments

AStupidBear commented Apr 9, 2020

meggart commented Apr 9, 2020

AStupidBear commented Apr 10, 2020

meggart commented Apr 11, 2020 • edited Loading

meggart commented Apr 11, 2020

meggart commented Apr 11, 2020 •

edited

Loading