-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resume pipeline suggestion for SequentialRunner #1795
Conversation
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
122839f
to
af405ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great PR! ⭐ Thank you for doing that performance analysis. Completely agree with you that the performance hit here is perfectly acceptable for any pipeline of a realistic size. Having more than 1000 nodes between persisted datasets is not going to happen. In hindsight, given that people tend to over- rather than under-persist datasets, probably it's not even common to have pipelines with more than 10 such nodes. So all good here 👍
Also, just for reference so you have rough numbers in your head: I guess you were doing your time complexity testing with fake nodes that don't really do any heavy data processing (which makes total sense from the point of view of testing). But in reality if you're running a pipeline with 100 nodes that actual do useful things then that will take way longer than 0.3s, like several minutes. So the increase in runtime incurred by adding this feature would be even smaller in relative terms for a real world pipeline.
Just one question from before that's left over but very happy to approve here! 🙂
does this work for parallel runner? There's some funky stuff going on there with
SharedMemoryDataSet
so would be good to check it still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work tackling a tough issue! I don't have much concern since this feature wasn't doing much before, even if it doesn't work for all cases it will still be an improvement. But like Antony said it would be good to check if it works for other Runner
.
Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. |
Cool, no worries. As @noklam says, if it doesn't work then it's not a showstopper. I'm happy to merge with it just working on sequential runner and we can fall back on using the previous inferior |
Alright, I finished my investigation into It is possible to implement the new scheme proposed in this PR for Unfortunately, it isn't of much use, since the sequence in which nodes are run (and the resulting exception is reached) is not deterministic for ParallelRunner. This causes problems for both the new and the existing logic for generating suggestions. For example, with the existing logic a run with
Another identical run will (stochastically) produce the message:
Similar results are seen for the new logic implemented in this PR. One message is correct while the other isn't. Since these conflicting messages occur with roughly the same frequency, I don't think we should be suggesting a resume command at all at the moment for @noklam @AntonyMilneQB it would be good to hear your thoughts on this. If you agree with me, I will turn off this feature for |
I am happy that this is added just for Note that there may be 2 sources of non-deterministic behavior:
It's impossible to have deterministic nodes execution order for |
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Thanks for the feedback @noklam and @AntonyMilneQB! It's much appreciated. @noklam thanks for the hint in 1. Regarding 2, you're right about this, the execution order is inherently indeterminate. Nonetheless I think we can at least reach a deterministic 'solution' (in this case, the correct warning) using join(s). I will open an issue and explain my thinking. |
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…kedro-org/kedro into feat/improve-resume-scenario-suggestion Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
* Add _find_first_persistent_ancestors and stubs for supporting functions. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add body to _enumerate_parents. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add function to check persistence of node outputs. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify _suggest_resume_scenario to use _find_first_persistent_ancestors Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Pass catalog to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Track and return all ancestor nodes that must be re-run during DFS. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Integrate DFS with original _suggest_resume_scenario. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Implement backwards-DFS strategy on all boundary nodes. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Switch to multi-node start BFS approach to finding persistent ancestors. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add a useful error message if no nodes ran. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add docstrings to new functions. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add catalog argument to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify exception_fn to allow it to take multiple arguments Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for AbstractRunner._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add docstring for _suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Improve formatting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move new functions out of AbstractRunner Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove bare except Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix broad except clause Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Access datasets __dict__ using vars() Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Sort imports Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Improve resume message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add a space to resume suggestion message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify DFS logic to eliminate possible queue duplicates Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify catalog.datasets to catalog._data_sets w/ disabled linter warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move all pytest fixtures to conftest.py Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify all instances of Pipeline to pipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix typo in the name of TestSequentialRunnerBranchedPipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious assert in save of persistent_dataset_catalog Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Replace instantiations of Pipeline with pipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify test_suggest_resume_scenario fixture to use node names Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add disable=unused-argument to _save Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove resume suggestion for ParallelRunner Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious try / except Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>
Description
Resolves #1477
Development notes
After a failed run, Kedro suggests a command to the user:
You can resume the pipeline run by adding the following argument to your previous command: --from-nodes "node4_B"
Before this PR, the suggested command will run from the last nodes to be executed, regardless of whether their input was persisted or not. If any of the inputs to the listed nodes is not persisted, the run immediately fails again.
After this PR, the suggested command will run from the closest successfully executed nodes with persisted inputs:
You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "node1_B,node1_A"
This is achieved by performing a breadth-first search, starting at the last successfully executed nodes. This backward search yields a set of the nearest nodes that have persisted inputs.
Six tests are added to the
test_sequential_runner
test suite to test different cases on an X-shaped pipeline.Limitations
This change is a significant improvement, but there are still two important limitations:
MemoryDataSet
s. This definition has limitations; it does not account for custom datasets that are not persisted.In the future, I think it would be a good idea to add a method to the API of AbstractDataSet that checks for persistence. I would love to hear thoughts on this.
Checklist
RELEASE.md
file