New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized datashader aggregation of NdOverlays #1430

Merged
merged 5 commits into from May 12, 2017

Conversation

Projects
None yet
3 participants
@philippjfr
Member

philippjfr commented May 11, 2017

This PR provides major optimizations when using the datashader operations to aggregate multiple objects in an NdOverlay using the count, sum, and mean operations. Each Element is aggregated separately and the individual aggregates are summed. A small complication is that NaNs have to be replaced by zeros and masked at the end. mean is supported by dividing sum and count aggregates. This avoids the large memory and performance overhead of concatenating multiple dataframes together. I'm still working on adding an optimization for count_cat but it should also be fairly straightforward.

@jbednar

This comment has been minimized.

Member

jbednar commented May 11, 2017

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 11, 2017

Excellent, thanks! I'm too tired to try to parse the code, but is it using the ability of datashader to compute multiple aggregations in a single pass?

No, even so it's still faster, which is perhaps a bit surprising. I'll do some more profiling tomorrow.

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 11, 2017

I was wrong it's slightly slower, but including the concatenation step it still wins out massively both on performance and memory load.

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 11, 2017

Means it's perhaps still worth optimizing get_agg_data directly to ensure it only has to be done once. Particularly worth testing how well concatenating multiple dask dataframes performs.

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 11, 2017

count_cat now optimized as well.

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 11, 2017

Here are some benchmarks, the data here are 12 curves of increasing length where 1 minute is equivalent to 60000*60 samples. The four conditions are comparing line aggregation of multiple curves either by summing the aggregates (the new approach) or by aggregating over concatenated curves separated by NaNs.

bokeh_plot 47

bokeh_plot 53

You can see that the new approach is generally slightly slower than aggregating over already concatenated lines, but it scales much better when using dask.

@jlstevens

This comment has been minimized.

Member

jlstevens commented May 12, 2017

@philippjfr Thanks for fixing the warning!

Is it now ready to merge or is there something else you wish to do first?

@philippjfr

This comment has been minimized.

Member

philippjfr commented May 12, 2017

Yes, this is ready to merge now. Further optimizations can come in later PRs.

@jlstevens

This comment has been minimized.

Member

jlstevens commented May 12, 2017

Great! Merging.

@jlstevens jlstevens merged commit a99833f into master May 12, 2017

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 78.859%
Details
s3-reference-data-cache Test data is cached.
Details

@philippjfr philippjfr deleted the datashader_ndoverlay_opt branch May 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment