Make temp dir executed checkpoint experiment return result to workspace. #8668

karajan1001 · 2022-12-07T13:09:31Z

fix: #8612
For checkpoint experiments in some cases, users might want to give it an early stopping to reduce variance. But currently, if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we clean up all the running directly after the process failed.

We raise CheckpointKilledError instead of StageCmdFailed error if at least one checkpoint had been committed.
Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted.
Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded.
Add a new functional test for this

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

karajan1001 · 2022-12-07T13:12:29Z

dvc/stage/monitor.py

                    except Exception:  # pylint: disable=broad-except
                        logger.exception(
                            "Error running '%s' task, '%s' will be aborted",
                            task.name,
                            task.stage,
                        )
                        Monitor.kill(task.proc)
-                        task.killed.set()


Previously task.killed will only be set if the checkpoint (Not the actual training progress) raises an exception. The new task.update will be set if at least one checkpoint has been committed. It is the condition that we need to collect the checkpoint result even if it didn't finish all of the iterations.

codecov · 2022-12-07T13:16:54Z

Codecov Report

Base: 93.52% // Head: 93.52% // No change to project coverage 👍

Coverage data is based on head (28848c5) compared to base (28848c5).
Patch has no changes to coverable lines.

❗ Current head 28848c5 differs from pull request most recent head a36c00c. Consider uploading reports for the commit a36c00c to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8668   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files         457      457           
  Lines       36139    36139           
  Branches     5229     5229           
=======================================
  Hits        33800    33800           
  Misses       1836     1836           
  Partials      503      503

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pmrowla · 2022-12-08T02:07:36Z

dvc/repo/reproduce.py

+    if notrun:
+        logger.warning(
+            "Some of the stages '%s' were not processed because "
+            "something wrong occurred in the previous stages",
+            ",".join([stage.addressing for stage in notrun]),
+        )


I don't think we actually need to log this given that we don't log additional un-run stages when a normal error occurs for any stage

To follow up on this, if we decide we would like to log the skipped stages, it will be cleaner to just adjust the loop to use

for i, stage in enumerate(steps): try: except CheckpointKilledError: ... logger.warning("skipped stages '%s'", ", ".join((s.addressing for s in steps[i + 1:])) break

rather than creating the additional list of unrun stages and continuing the loop

dvc/repo/reproduce.py

pmrowla · 2022-12-08T02:33:08Z

@karajan1001 I'm actually not sure that the checkpoint handling behavior belongs in stage.run/repo.reproduce at all. I think there is some confusion in the original issue. Right now we really only need to make --temp behave the same way as --queue (so it should always cleanup/collect the tempdir executor for checkpoints, which will make it save the successful iterations). see: #8612 (comment)

karajan1001 · 2022-12-16T02:27:40Z

dberenbaum · 2022-12-16T18:35:04Z

@karajan1001 Maybe there is someone else in @iterative/dvc who can review while @pmrowla is out so we can get it merged?

dberenbaum

LGTM!

karajan1001 · 2022-12-20T03:22:49Z

@karajan1001 Maybe there is someone else in @iterative/dvc who can review while @pmrowla is out so we can get it merged?

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this

dberenbaum · 2022-12-27T19:34:35Z

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

@karajan1001 Is this still an issue?

karajan1001 · 2022-12-30T07:14:20Z

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

@karajan1001 Is this still an issue?

This behavior is different from --temp in which we return completed checkpoint results even if the tasks failed. I think the --temp behavior is more reasonable.

karajan1001 added A: experiments Related to dvc exp bugfix fixes bug labels Dec 7, 2022

karajan1001 requested a review from pmrowla December 7, 2022 13:09

karajan1001 self-assigned this Dec 7, 2022

karajan1001 commented Dec 7, 2022

View reviewed changes

pmrowla suggested changes Dec 8, 2022

View reviewed changes

karajan1001 force-pushed the fix8612 branch from c39c047 to 67ccf3b Compare December 16, 2022 02:27

karajan1001 requested a review from pmrowla December 16, 2022 02:27

dberenbaum approved these changes Dec 16, 2022

View reviewed changes

pmrowla approved these changes Dec 26, 2022

View reviewed changes

pmrowla force-pushed the fix8612 branch from 67ccf3b to a36c00c Compare December 26, 2022 08:07

skshetry merged commit 08ae68c into iterative:main Dec 26, 2022

karajan1001 deleted the fix8612 branch December 30, 2022 07:12

dberenbaum mentioned this pull request Dec 30, 2022

queue: preserve checkpoints for failed experiments #8750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make temp dir executed checkpoint experiment return result to workspace. #8668

Make temp dir executed checkpoint experiment return result to workspace. #8668

karajan1001 commented Dec 7, 2022

karajan1001 Dec 7, 2022

codecov bot commented Dec 7, 2022 •

edited

pmrowla Dec 8, 2022

pmrowla Dec 8, 2022

pmrowla commented Dec 8, 2022 •

edited

karajan1001 commented Dec 16, 2022

dberenbaum commented Dec 16, 2022

dberenbaum left a comment

karajan1001 commented Dec 20, 2022

dberenbaum commented Dec 27, 2022

karajan1001 commented Dec 30, 2022

Make temp dir executed checkpoint experiment return result to workspace. #8668

Make temp dir executed checkpoint experiment return result to workspace. #8668

Conversation

karajan1001 commented Dec 7, 2022

karajan1001 Dec 7, 2022

Choose a reason for hiding this comment

codecov bot commented Dec 7, 2022 • edited

Codecov Report

pmrowla Dec 8, 2022

Choose a reason for hiding this comment

pmrowla Dec 8, 2022

Choose a reason for hiding this comment

pmrowla commented Dec 8, 2022 • edited

karajan1001 commented Dec 16, 2022

dberenbaum commented Dec 16, 2022

dberenbaum left a comment

Choose a reason for hiding this comment

karajan1001 commented Dec 20, 2022

dberenbaum commented Dec 27, 2022

karajan1001 commented Dec 30, 2022

codecov bot commented Dec 7, 2022 •

edited

pmrowla commented Dec 8, 2022 •

edited