Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(dask): avoid triggering compute for dynamic limit/offset #8747

Conversation

mfatihaktas
Copy link
Contributor

Description of changes

Aims to close #8186.

@mfatihaktas mfatihaktas self-assigned this Mar 22, 2024
@mfatihaktas mfatihaktas added performance Issues related to ibis's performance dask The Dask backend labels Mar 22, 2024
@mfatihaktas mfatihaktas force-pushed the perf/dask/avoid-triggering-compute-for-offset branch from 3630b55 to eec1af8 Compare March 22, 2024 20:27
@mfatihaktas mfatihaktas changed the title perf(dask): avoid triggering compute for dynamic limit/offset perf(dask): avoid triggering compute for dynamic limit/offset [WIP] Mar 22, 2024
@mfatihaktas mfatihaktas force-pushed the perf/dask/avoid-triggering-compute-for-offset branch from eec1af8 to 544eb4a Compare March 22, 2024 21:03
@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

With the caveat that I don't know much about Dask, the code no longer looks like it obviously triggers compute.

@jcrist @ncclementi Can one of y'all review?

@cpcloud cpcloud added this to the 9.0 milestone Mar 25, 2024
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better than what's there currently, but we could still do better in the long run:

  • For the common case of literal limits and no offset we should be able to take a more efficient path of calling ddf.head(n, npartitions=-1). We're currently hampered by the rewrites in the pandas backend always upcasting these scalars to tables with a single element (making the literal scalar nature opaque to the dask backend).
  • Even for dynamic limits we should be able to rely on ddf.head(n, npartitions=-1), but that also only works if the input is represented as scalar and not a table.
  • Only the case where offset is non-0 requires something like what we're doing here (though again if offset is a scalar instead of a table there are more ergonomic ways of spelling this). IMO offset is a rare enough case that if we wanted to error at compile time for non-zero offset in the dask backend that would be fine.

Anyway, this should be fine to go in as is.

@cpcloud cpcloud marked this pull request as ready for review March 26, 2024 10:34
@cpcloud cpcloud changed the title perf(dask): avoid triggering compute for dynamic limit/offset [WIP] perf(dask): avoid triggering compute for dynamic limit/offset Mar 26, 2024
@cpcloud cpcloud merged commit b3e27eb into ibis-project:main Mar 26, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask The Dask backend performance Issues related to ibis's performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf(dask): avoid triggering compute for dynamic limit/offset
3 participants