Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resume pipeline suggestion for SequentialRunner #1795

Merged
merged 38 commits into from
Sep 2, 2022

Conversation

jmholzer
Copy link
Contributor

@jmholzer jmholzer commented Aug 18, 2022

Description

Resolves #1477

Development notes

After a failed run, Kedro suggests a command to the user:

You can resume the pipeline run by adding the following argument to your previous command: --from-nodes "node4_B"

Before this PR, the suggested command will run from the last nodes to be executed, regardless of whether their input was persisted or not. If any of the inputs to the listed nodes is not persisted, the run immediately fails again.

After this PR, the suggested command will run from the closest successfully executed nodes with persisted inputs:

You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "node1_B,node1_A"

This is achieved by performing a breadth-first search, starting at the last successfully executed nodes. This backward search yields a set of the nearest nodes that have persisted inputs.

Six tests are added to the test_sequential_runner test suite to test different cases on an X-shaped pipeline.

Limitations

This change is a significant improvement, but there are still two important limitations:

  • Persisted inputs are defined to be any that are not MemoryDataSets. This definition has limitations; it does not account for custom datasets that are not persisted.
  • Neither the approach in this PR nor the previous one handle the case where nodes append to datasets. Running these nodes repeatedly could have unintended consequences.

In the future, I think it would be a good idea to add a method to the API of AbstractDataSet that checks for persistence. I would love to hear thoughts on this.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer force-pushed the feat/improve-resume-scenario-suggestion branch from 122839f to af405ed Compare August 19, 2022 09:54
@jmholzer jmholzer marked this pull request as ready for review August 19, 2022 10:29
@jmholzer jmholzer requested a review from idanov as a code owner August 19, 2022 10:29
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great PR! ⭐ Thank you for doing that performance analysis. Completely agree with you that the performance hit here is perfectly acceptable for any pipeline of a realistic size. Having more than 1000 nodes between persisted datasets is not going to happen. In hindsight, given that people tend to over- rather than under-persist datasets, probably it's not even common to have pipelines with more than 10 such nodes. So all good here 👍

Also, just for reference so you have rough numbers in your head: I guess you were doing your time complexity testing with fake nodes that don't really do any heavy data processing (which makes total sense from the point of view of testing). But in reality if you're running a pipeline with 100 nodes that actual do useful things then that will take way longer than 0.3s, like several minutes. So the increase in runtime incurred by adding this feature would be even smaller in relative terms for a real world pipeline.

Just one question from before that's left over but very happy to approve here! 🙂

does this work for parallel runner? There's some funky stuff going on there with SharedMemoryDataSet so would be good to check it still works.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work tackling a tough issue! I don't have much concern since this feature wasn't doing much before, even if it doesn't work for all cases it will still be an improvement. But like Antony said it would be good to check if it works for other Runner.

@jmholzer
Copy link
Contributor Author

jmholzer commented Sep 1, 2022

Just one question from before that's left over but very happy to approve here! 🙂

does this work for parallel runner? There's some funky stuff going on there with SharedMemoryDataSet so would be good to check it still works.

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

@antonymilne
Copy link
Contributor

antonymilne commented Sep 1, 2022

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

Cool, no worries. As @noklam says, if it doesn't work then it's not a showstopper. I'm happy to merge with it just working on sequential runner and we can fall back on using the previous inferior _suggest_resume_scenario for the parallel runner case if it's not easy to fix. Would be nice to have it working for parallel runner too, but it's not worth spending a huge amount of time on.

@jmholzer
Copy link
Contributor Author

jmholzer commented Sep 2, 2022

Alright, I finished my investigation into ParallelRunner.

It is possible to implement the new scheme proposed in this PR for ParallelRunner, though it involves some workarounds due to _SharedMemoryDataSet.

Unfortunately, it isn't of much use, since the sequence in which nodes are run (and the resulting exception is reached) is not deterministic for ParallelRunner. This causes problems for both the new and the existing logic for generating suggestions. For example, with the existing logic a run with ParallelRunner will produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node3_B,node4_A"

Another identical run will (stochastically) produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node4_A"

Similar results are seen for the new logic implemented in this PR. One message is correct while the other isn't. Since these conflicting messages occur with roughly the same frequency, I don't think we should be suggesting a resume command at all at the moment for ParallelRunner. I think implementing this will first require that the order of execution of nodes is made deterministic, which is a large enough task to be a separate PR.

@noklam @AntonyMilneQB it would be good to hear your thoughts on this. If you agree with me, I will turn off this feature for ParallelRunner for the time being in a new commit, merge this PR and then write up an issue.

@antonymilne
Copy link
Contributor

That sounds like a perfect plan, thanks very much @jmholzer. Note that until recently the sequential runner was also not deterministic in the order of running nodes (something @noklam fixed). I don't know if the same sort of fix would be relevant for the parallel runner.

@noklam
Copy link
Contributor

noklam commented Sep 2, 2022

I am happy that this is added just for SequentialRunner.

Note that there may be 2 sources of non-deterministic behavior:

  1. Kedro itself order the nodes in a non-deterministic (Used to be the case with SequentialRunner due to some set operation) -> Distributed these nodes into subprocesses.
  2. The nature of parallelism, the order of execution depends on if the computation is finished or not, so I think is non-deterministic by nature.

It's impossible to have deterministic nodes execution order for ParallelRunner, but there may be things that can be more deterministic for 1.

@jmholzer
Copy link
Contributor Author

jmholzer commented Sep 2, 2022

Thanks for the feedback @noklam and @AntonyMilneQB! It's much appreciated.

@noklam thanks for the hint in 1. Regarding 2, you're right about this, the execution order is inherently indeterminate. Nonetheless I think we can at least reach a deterministic 'solution' (in this case, the correct warning) using join(s). I will open an issue and explain my thinking.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…kedro-org/kedro into feat/improve-resume-scenario-suggestion

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer merged commit 6428dd9 into main Sep 2, 2022
@jmholzer jmholzer deleted the feat/improve-resume-scenario-suggestion branch September 2, 2022 14:59
@noklam noklam changed the title Improve resume pipeline suggestion Improve resume pipeline suggestion for SequentialRunner Sep 20, 2022
nickolasrm pushed a commit to ProjetaAi/kedro that referenced this pull request Oct 26, 2022
* Add _find_first_persistent_ancestors and stubs for supporting functions.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add body to _enumerate_parents.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add function to check persistence of node outputs.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify _suggest_resume_scenario to use _find_first_persistent_ancestors

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Pass catalog to self._suggest_resume_scenario

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Track and return all ancestor nodes that must be re-run during DFS.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Integrate DFS with original _suggest_resume_scenario.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Implement backwards-DFS strategy on all boundary nodes.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Switch to multi-node start BFS approach to finding persistent ancestors.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add a useful error message if no nodes ran.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add docstrings to new functions.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add catalog argument to self._suggest_resume_scenario

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify exception_fn to allow it to take multiple arguments

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for AbstractRunner._suggest_resume_scenario

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add docstring for _suggest_resume_scenario

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Improve formatting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move new functions out of AbstractRunner

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove bare except

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix broad except clause

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Access datasets __dict__ using vars()

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Sort imports

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Improve resume message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add a space to resume suggestion message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify DFS logic to eliminate possible queue duplicates

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify catalog.datasets to catalog._data_sets w/ disabled linter warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move all pytest fixtures to conftest.py

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify all instances of Pipeline to pipeline

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix typo in the name of TestSequentialRunnerBranchedPipeline

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious assert in save of persistent_dataset_catalog

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Replace instantiations of Pipeline with pipeline

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify test_suggest_resume_scenario fixture to use node names

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add disable=unused-argument to _save

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove resume suggestion for ParallelRunner

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious try / except

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants