Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve optimizer to traverse the dask graph from the requested key #3

Open
maurosilber opened this issue Feb 2, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@maurosilber
Copy link
Owner

Currently, the optimizer step traverses the full dask graph in no particular order:

https://github.com/maurosilber/pipeline/blob/0cac8b8954b4def43e593040dced79807ac37f3a/pipeline/storage.py#L50-L58

It would be better to traverse it starting from the requested key(s), following through with their dependencies. A task does not need to be checked if all its dependents are already stored and will be loaded.

For instance, consider the following graph, a -> b -> c, where b is already stored.

dsk = {
    "a": (task_a,),
    "b": (task_b, "a"),
    "c": (task_c, "b"),
}

If we request c, which is not stored, then we would need to check b, which is stored and hence loaded. Then, we don't need to check a, which is simply removed from the graph.

optimized_dsk = {
    # "a": (task_a,),
    "b": (load, task_b),
    "c": (task_c, "b"),
}

We could adapt the dask.cull implementation.

@maurosilber maurosilber added the enhancement New feature or request label Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant