Dask support for first, last, first_n and last_n reductions #1214

ianthomas23 · 2023-05-11T17:21:03Z

This fixes the Dask parts of issues #1182 and #1207. CUDA support will follow in a separate PR, this is already too big.

Full list of example reductions this supports on CPU with and without Dask:

first('value') - returns single value per pixel
last('value') - ditto
first_n('value', n=3) - returns n=3 values per pixel
last_n('value', n=3) - ditto
where(first('value')) - returns single row index per pixel
where(last('value')) - ditto
where(first_n('value', n=3)) - returns n=3 row indexes per pixel
where(last_n('value', n=3)) - ditto
where(first('value'), 'other') - returns single other per pixel
where(last('value'), 'other') - ditto
where(first_n('value', n=3), 'other') - returns n=3 others per pixel
where(last_n('value', n=3), 'other') - ditto

Summary of changes:

Some reduction behaviour is now dependent on whether supplied data is partitioned (dask or CUDA) or not, so extra arguments cuda and partitioned are passed through the class hierarchy.
I have factored out some common first and last code into a private base class _first_or_last to simplify, and similar applies to _first_n_or_last_n.
The fundamental addition is that first('value') using Dask is implemented as where(_min_row_index(), 'value'). This takes advantage of the existing where machinery and a new class _min_row_index which is very similar to a min reduction except that it accepts a virtual row index column, not a real column supplied by the user. I could have extended the min class to cover this, but I thought it better to keep the new functionality in a new private class. It is possible to instantiate the private class of course, although no user code should do this, and I have done this in the test suite. There are equivalent new private classes for last, first_n and last_n.

I do need to recheck that trying to use CUDA with these reductions raises informative error messages not crashes.

codecov · 2023-05-11T18:00:11Z

Codecov Report

Merging #1214 (25bd257) into main (8092f4d) will increase coverage by 0.31%.
The diff coverage is 95.09%.

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   84.52%   84.83%   +0.31%     
==========================================
  Files          35       35              
  Lines        8369     8561     +192     
==========================================
+ Hits         7074     7263     +189     
- Misses       1295     1298       +3

Impacted Files	Coverage Δ
datashader/data_libraries/pandas.py	`100.00% <ø> (ø)`
datashader/reductions.py	`84.68% <93.15%> (+1.56%)`	⬆️
datashader/compiler.py	`91.07% <100.00%> (+0.50%)`	⬆️
datashader/data_libraries/dask.py	`95.23% <100.00%> (+0.07%)`	⬆️
datashader/data_libraries/dask_xarray.py	`98.95% <100.00%> (ø)`
datashader/utils.py	`81.86% <100.00%> (+2.61%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jbednar

Whew! That didn't look fun, but I'm very glad to see it. I didn't spot any issues, but I may or may not have glazed over when reviewing certain bits. :-)

ianthomas23 · 2023-05-16T11:39:08Z

I may or may not have glazed over when reviewing certain bits. :-)

It hadn't occurred to me to be entertaining. I can try that approach, but it may come at the expense of code accuracy 😁

ianthomas23 · 2023-05-16T11:40:51Z

I've rebased it on main to pick up the fast CUDA mutex PR, and added a couple of error messages. There are no functional code changes so I will merge then CI is green.

ianthomas23 added the enhancement label May 11, 2023

ianthomas23 added this to the v0.14.5 milestone May 11, 2023

jbednar approved these changes May 11, 2023

View reviewed changes

ianthomas23 added 9 commits May 15, 2023 08:40

New _max_row_index and _min_row_index reductions

42011a8

first and last reductions using dask on cpu

72bf0e9

New _max_n_row_index and _min_n_row_index reductions

56114ff

first_n and last_n reductions using dask on cpu

2764640

Correct where(first) and similar

df70993

Handle valid/invalid values in append call using where

a1b7bce

Improved tests

a3d238b

Handle checking NaNs in a separate column

46edb15

Better cuda error messages

25bd257

ianthomas23 mentioned this pull request May 16, 2023

Consistent handling of NaNs in where reductions #1215

Closed

ianthomas23 merged commit d385061 into holoviz:main May 16, 2023
16 checks passed

ianthomas23 deleted the 1182_first_last_dask_cuda branch May 16, 2023 12:46

This was referenced May 16, 2023

Dask and CUDA support for first and last reductions #1182

Closed

Dask and CUDA support for first_n and last_n reductions #1207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask support for first, last, first_n and last_n reductions #1214

Dask support for first, last, first_n and last_n reductions #1214

ianthomas23 commented May 11, 2023

codecov bot commented May 11, 2023 •

edited

jbednar left a comment

ianthomas23 commented May 16, 2023

ianthomas23 commented May 16, 2023

Dask support for first, last, first_n and last_n reductions #1214

Dask support for first, last, first_n and last_n reductions #1214

Conversation

ianthomas23 commented May 11, 2023

codecov bot commented May 11, 2023 • edited

Codecov Report

jbednar left a comment

Choose a reason for hiding this comment

ianthomas23 commented May 16, 2023

ianthomas23 commented May 16, 2023

codecov bot commented May 11, 2023 •

edited