Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: cache chunk data #15

Closed
AStupidBear opened this issue Apr 9, 2020 · 4 comments
Closed

Feature Request: cache chunk data #15

AStupidBear opened this issue Apr 9, 2020 · 4 comments

Comments

@AStupidBear
Copy link

It will be more efficient to cache the current chunk in use when using a for loop to iterate over the data because each reading has a non-neglegible overhead (e.g. HDF5.jl).

@meggart
Copy link
Owner

meggart commented Apr 9, 2020

This is exactly what this package is for and what all implemented methods for reduction, broadcast etc are doing. Did you run into a case where you still get problems with this package?

@AStupidBear
Copy link
Author

A = DiskArray(hdf5_dataset)
for t in 1:size(A, 2), n in 1:size(A, 1)
    complex_operations_for(A[n, t])
end

Will this still be efficient? Eeach getindex will still invoke one A.ds[...] with a non-neglegible overhead. If we cache the current block it is using, then the next contiguous getindex can be directly handled by the cached block.

In my use case, it's hard to formulate the above complex operations using broadcast or map without breaking the entire code base.

@meggart
Copy link
Owner

meggart commented Apr 11, 2020

No, this example would not be efficient. However, for example:

A = DiskArray(hdf5_dataset)
broadcast(A) do a
    complex_operations_for(a)
end

would be efficient and be done chunk by chunk. Adding map and foreach is on the list as well. The problem with your code is that, even if it would do caching, it does not respect the chunking of the dataset. Assume your dataset has size 1000x1000 consisting of chunks of size 100x100, then operating column by column would be very inefficient, because every chunk is accessed 100 times.

If you really insist on avoiding map, foreach, and broadcast, one could think about defining a stateful eachindex, which returns an iterator containing a chunk cache. This would make code like the following possible:

A = DiskArray(hdf5_dataset)
for i in eachindex(A)
    complex_operations_for(A[i])
end

I have actually thought about implementing this, but I think there might be many problems with the implementation. In addition I did not find a real-world use case for yet that I could not express with the broadcast-like constructs mentioned above.

@meggart
Copy link
Owner

meggart commented Apr 11, 2020

Another way to approach this would of course be to use https://github.com/JuliaCollections/LRUCache.jl at the cost of a Dict lookup for every access to the Array. However, as I said so far I would like to keep the single element access 0-overhead at the cost of having to use some functional programming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants