Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Extremely large jobs often run out of memory on the driver #14584

Open
patrick-schultz opened this issue Jun 18, 2024 · 2 comments
Open
Assignees

Comments

@patrick-schultz
Copy link
Collaborator

Spark breaks down when a job has too many partitions. We should modify the implementation of CollectDistributedArray on the Spark backend to automatically break up jobs that are above some threshold of number of partitions into a few sequential smaller jobs. This would have a large impact on groups like AoU who are using Hail on the biggest datasets, who currently have to hack around this issue with trial and error.

@chrisvittal
Copy link
Collaborator

Part 1 is #14590, making this some sort of default will be part 2.

@chrisvittal chrisvittal changed the title [query] Automatically break up big spark jobs [query] Extremely large jobs often run out of memory on the driver Oct 7, 2024
@chrisvittal
Copy link
Collaborator

Some discussion from 10/7

Maybe use fast external storage to keep and then query job results such that we never materialize all the results while the job is running.

The call caching framework may help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants