Improve resume pipeline suggestion for SequentialRunner #1795

jmholzer · 2022-08-18T14:09:39Z

Description

Resolves #1477

Development notes

After a failed run, Kedro suggests a command to the user:

You can resume the pipeline run by adding the following argument to your previous command: --from-nodes "node4_B"

Before this PR, the suggested command will run from the last nodes to be executed, regardless of whether their input was persisted or not. If any of the inputs to the listed nodes is not persisted, the run immediately fails again.

After this PR, the suggested command will run from the closest successfully executed nodes with persisted inputs:

You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "node1_B,node1_A"

This is achieved by performing a breadth-first search, starting at the last successfully executed nodes. This backward search yields a set of the nearest nodes that have persisted inputs.

Six tests are added to the test_sequential_runner test suite to test different cases on an X-shaped pipeline.

Limitations

This change is a significant improvement, but there are still two important limitations:

Persisted inputs are defined to be any that are not MemoryDataSets. This definition has limitations; it does not account for custom datasets that are not persisted.
Neither the approach in this PR nor the previous one handle the case where nodes append to datasets. Running these nodes repeatedly could have unintended consequences.

In the future, I think it would be a good idea to add a method to the API of AbstractDataSet that checks for persistence. I would love to hear thoughts on this.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

antonymilne

Really great PR! ⭐ Thank you for doing that performance analysis. Completely agree with you that the performance hit here is perfectly acceptable for any pipeline of a realistic size. Having more than 1000 nodes between persisted datasets is not going to happen. In hindsight, given that people tend to over- rather than under-persist datasets, probably it's not even common to have pipelines with more than 10 such nodes. So all good here 👍

Also, just for reference so you have rough numbers in your head: I guess you were doing your time complexity testing with fake nodes that don't really do any heavy data processing (which makes total sense from the point of view of testing). But in reality if you're running a pipeline with 100 nodes that actual do useful things then that will take way longer than 0.3s, like several minutes. So the increase in runtime incurred by adding this feature would be even smaller in relative terms for a real world pipeline.

Just one question from before that's left over but very happy to approve here! 🙂

does this work for parallel runner? There's some funky stuff going on there with SharedMemoryDataSet so would be good to check it still works.

noklam

Great work tackling a tough issue! I don't have much concern since this feature wasn't doing much before, even if it doesn't work for all cases it will still be an improvement. But like Antony said it would be good to check if it works for other Runner.

jmholzer · 2022-09-01T16:47:14Z

Just one question from before that's left over but very happy to approve here! 🙂

does this work for parallel runner? There's some funky stuff going on there with SharedMemoryDataSet so would be good to check it still works.

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

antonymilne · 2022-09-01T16:51:30Z

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

Cool, no worries. As @noklam says, if it doesn't work then it's not a showstopper. I'm happy to merge with it just working on sequential runner and we can fall back on using the previous inferior _suggest_resume_scenario for the parallel runner case if it's not easy to fix. Would be nice to have it working for parallel runner too, but it's not worth spending a huge amount of time on.

jmholzer · 2022-09-02T11:37:57Z

Alright, I finished my investigation into ParallelRunner.

It is possible to implement the new scheme proposed in this PR for ParallelRunner, though it involves some workarounds due to _SharedMemoryDataSet.

Unfortunately, it isn't of much use, since the sequence in which nodes are run (and the resulting exception is reached) is not deterministic for ParallelRunner. This causes problems for both the new and the existing logic for generating suggestions. For example, with the existing logic a run with ParallelRunner will produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node3_B,node4_A"

Another identical run will (stochastically) produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node4_A"

Similar results are seen for the new logic implemented in this PR. One message is correct while the other isn't. Since these conflicting messages occur with roughly the same frequency, I don't think we should be suggesting a resume command at all at the moment for ParallelRunner. I think implementing this will first require that the order of execution of nodes is made deterministic, which is a large enough task to be a separate PR.

@noklam @AntonyMilneQB it would be good to hear your thoughts on this. If you agree with me, I will turn off this feature for ParallelRunner for the time being in a new commit, merge this PR and then write up an issue.

antonymilne · 2022-09-02T12:42:32Z

That sounds like a perfect plan, thanks very much @jmholzer. Note that until recently the sequential runner was also not deterministic in the order of running nodes (something @noklam fixed). I don't know if the same sort of fix would be relevant for the parallel runner.

noklam · 2022-09-02T13:18:16Z

I am happy that this is added just for SequentialRunner.

Note that there may be 2 sources of non-deterministic behavior:

Kedro itself order the nodes in a non-deterministic (Used to be the case with SequentialRunner due to some set operation) -> Distributed these nodes into subprocesses.
The nature of parallelism, the order of execution depends on if the computation is finished or not, so I think is non-deterministic by nature.

It's impossible to have deterministic nodes execution order for ParallelRunner, but there may be things that can be more deterministic for 1.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer · 2022-09-02T14:18:37Z

Thanks for the feedback @noklam and @AntonyMilneQB! It's much appreciated.

@noklam thanks for the hint in 1. Regarding 2, you're right about this, the execution order is inherently indeterminate. Nonetheless I think we can at least reach a deterministic 'solution' (in this case, the correct warning) using join(s). I will open an issue and explain my thinking.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

…kedro-org/kedro into feat/improve-resume-scenario-suggestion Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add _find_first_persistent_ancestors and stubs for supporting functions. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add body to _enumerate_parents. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add function to check persistence of node outputs. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify _suggest_resume_scenario to use _find_first_persistent_ancestors Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Pass catalog to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Track and return all ancestor nodes that must be re-run during DFS. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Integrate DFS with original _suggest_resume_scenario. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Implement backwards-DFS strategy on all boundary nodes. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Switch to multi-node start BFS approach to finding persistent ancestors. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add a useful error message if no nodes ran. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add docstrings to new functions. Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add catalog argument to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify exception_fn to allow it to take multiple arguments Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for AbstractRunner._suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add docstring for _suggest_resume_scenario Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Improve formatting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move new functions out of AbstractRunner Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove bare except Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix broad except clause Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Access datasets __dict__ using vars() Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Sort imports Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Improve resume message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add a space to resume suggestion message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify DFS logic to eliminate possible queue duplicates Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify catalog.datasets to catalog._data_sets w/ disabled linter warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move all pytest fixtures to conftest.py Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify all instances of Pipeline to pipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix typo in the name of TestSequentialRunnerBranchedPipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious assert in save of persistent_dataset_catalog Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Replace instantiations of Pipeline with pipeline Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify test_suggest_resume_scenario fixture to use node names Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add disable=unused-argument to _save Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove resume suggestion for ParallelRunner Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious try / except Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: nickolasrm <nickolasrochamachado@gmail.com>

jmholzer added 23 commits August 19, 2022 10:53

Add _find_first_persistent_ancestors and stubs for supporting functions.

426824a

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add body to _enumerate_parents.

f90daf7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add function to check persistence of node outputs.

e1cb2e3

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify _suggest_resume_scenario to use _find_first_persistent_ancestors

18a6105

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Pass catalog to self._suggest_resume_scenario

7753486

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Track and return all ancestor nodes that must be re-run during DFS.

a402aa7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Integrate DFS with original _suggest_resume_scenario.

699a9f5

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Implement backwards-DFS strategy on all boundary nodes.

a49a6f7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Switch to multi-node start BFS approach to finding persistent ancestors.

7955d0d

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add a useful error message if no nodes ran.

68764f7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add docstrings to new functions.

74c60f7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add catalog argument to self._suggest_resume_scenario

958fb91

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify exception_fn to allow it to take multiple arguments

d61a19b

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add test for AbstractRunner._suggest_resume_scenario

8724923

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add docstring for _suggest_resume_scenario

f57c431

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Improve formatting

9fda4c0

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Move new functions out of AbstractRunner

3a79059

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Remove bare except

13063dd

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Fix broad except clause

f29bbf5

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Access datasets __dict__ using vars()

01d5ab0

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Sort imports

1dae5e7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Improve resume message

d572896

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add a space to resume suggestion message

af405ed

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer force-pushed the feat/improve-resume-scenario-suggestion branch from 122839f to af405ed Compare August 19, 2022 09:54

Merge branch 'main' into feat/improve-resume-scenario-suggestion

d1b6693

jmholzer marked this pull request as ready for review August 19, 2022 10:29

jmholzer requested a review from idanov as a code owner August 19, 2022 10:29

jmholzer requested review from merelcht, AhdraMeraliQB and antonymilne August 19, 2022 10:30

antonymilne approved these changes Sep 1, 2022

View reviewed changes

antonymilne requested review from AhdraMeraliQB and noklam September 1, 2022 13:04

noklam approved these changes Sep 1, 2022

View reviewed changes

jmholzer requested review from noklam and antonymilne September 2, 2022 11:39

antonymilne approved these changes Sep 2, 2022

View reviewed changes

noklam approved these changes Sep 2, 2022

View reviewed changes

jmholzer and others added 2 commits September 2, 2022 14:56

Remove resume suggestion for ParallelRunner

99e0bb2

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Merge branch 'main' into feat/improve-resume-scenario-suggestion

85f7609

jmholzer added 2 commits September 2, 2022 15:23

Remove spurious try / except

a74fef6

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Merge branch 'feat/improve-resume-scenario-suggestion' of github.com:…

74d0cec

…kedro-org/kedro into feat/improve-resume-scenario-suggestion Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer merged commit 6428dd9 into main Sep 2, 2022

jmholzer deleted the feat/improve-resume-scenario-suggestion branch September 2, 2022 14:59

noklam mentioned this pull request Sep 5, 2022

kedro run CLI incorrectly splits the names of nodes at commas #1828

Closed

This was referenced Sep 5, 2022

Add resume suggestion to parallel runner #1830

Open

Replace Pipeline with pipeline across all tests #1833

Closed

noklam mentioned this pull request Sep 6, 2022

Workflow of debugging Kedro pipeline in notebook #1832

Open

3 tasks

noklam changed the title ~~Improve resume pipeline suggestion~~ Improve resume pipeline suggestion for SequentialRunner Sep 20, 2022

rashidakanchwala mentioned this pull request Sep 20, 2022

Create massive pipeline to test with flowchart on Kedro-viz kedro-org/kedro-viz#1064

Open

1 task

jmholzer mentioned this pull request Oct 6, 2022

Add an attribute to dataset classes to flag persistence #1910

Closed

ondrejzacha mentioned this pull request Sep 3, 2023

Improve resume pipeline suggestions #3002

Closed

AhdraMeraliQB mentioned this pull request Jan 5, 2024

Create QA Kedro test projects for stress testing and performance and evaluation #3489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resume pipeline suggestion for SequentialRunner #1795

Improve resume pipeline suggestion for SequentialRunner #1795

jmholzer commented Aug 18, 2022 •

edited

Loading

antonymilne left a comment •

edited

Loading

noklam left a comment

jmholzer commented Sep 1, 2022

antonymilne commented Sep 1, 2022 •

edited

Loading

jmholzer commented Sep 2, 2022

antonymilne commented Sep 2, 2022

noklam commented Sep 2, 2022

jmholzer commented Sep 2, 2022

Improve resume pipeline suggestion for SequentialRunner #1795

Improve resume pipeline suggestion for SequentialRunner #1795

Conversation

jmholzer commented Aug 18, 2022 • edited Loading

Description

Development notes

Limitations

Checklist

antonymilne left a comment • edited Loading

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

jmholzer commented Sep 1, 2022

antonymilne commented Sep 1, 2022 • edited Loading

jmholzer commented Sep 2, 2022

antonymilne commented Sep 2, 2022

noklam commented Sep 2, 2022

jmholzer commented Sep 2, 2022

jmholzer commented Aug 18, 2022 •

edited

Loading

antonymilne left a comment •

edited

Loading

antonymilne commented Sep 1, 2022 •

edited

Loading