Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp Dask Array creation logic to be fully blockwise #105

Closed
gjoseph92 opened this issue Dec 16, 2021 · 2 comments
Closed

Revamp Dask Array creation logic to be fully blockwise #105

gjoseph92 opened this issue Dec 16, 2021 · 2 comments

Comments

@gjoseph92
Copy link
Owner

When dask/dask#7417 gets in, it may both break the current dask-array-creation hacks, and open the door for a much, much simpler approach: just use da.from_array.

We didn't use da.from_array originally and went through all the current rigamarole because from_array generates a low-level graph, which can be enormous (and slow) for the large datasets we load. But once from_array uses Blockwise, it will be far simpler and more efficient.

We'll just switch to having an array-like class that wraps the asset table and other parameters, and whose __getitem__ basically calls fetch_raster_window. However, it's likely worth thinking about just combining all this into the Reader protocol in some way.

This will also make it easier to do #62 (and may even inadvertently solve it).

@gjoseph92
Copy link
Owner Author

Thinking about this a bit more, we may want to stick with the current approach (or a form of it).

Currently, we make one initial array of (time, band)—this is the "asset table". Each asset (file) is its own chunk. We map a function over this opening each file.

Then we make an array of (y, x) containing the slice needed to do a windowed read to cover that spatial area. This is currently low-level, but could easily become blockwise using a BlockwiseDep like this.

When you broadcast those two arrays together, you get the (time, band, y, x) final array.

The nice things are:

  1. The Reader (rio Dataset) for each file is a separate dask key. This is reasonable—Datasets are not lightweight things to move around.

  2. Opening each Dataset gets its own task.

  3. Reading from the dataset depends on the dataset's key (as opposed to implicitly opening the dataset). This means Dask will prefer to schedule reads from a given Dataset onto workers that already have that dataset open.

    Without this, I'd worry that we'd basically end up re-opening every dataset on every worker.

If we used from_array with a giant array-like object, we'd kind of be saying that opening any particular file, at any particular coordinates, had equal cost. Opening two different files at the same coordinates appear to be equivalent to opening the same file at two different coordinates—clearly not true.

@gjoseph92
Copy link
Owner Author

Addressed in #116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant