Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make temp dir executed checkpoint experiment return result to workspace. #8668

Merged
merged 1 commit into from
Dec 26, 2022

Conversation

karajan1001
Copy link
Contributor

fix: #8612
For checkpoint experiments in some cases, users might want to give it an early stopping to reduce variance. But currently, if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we clean up all the running directly after the process failed.

  1. We raise CheckpointKilledError instead of StageCmdFailed error if at least one checkpoint had been committed.
  2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted.
  3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded.
  4. Add a new functional test for this

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@karajan1001 karajan1001 added A: experiments Related to dvc exp bugfix fixes bug labels Dec 7, 2022
@karajan1001 karajan1001 self-assigned this Dec 7, 2022
except Exception: # pylint: disable=broad-except
logger.exception(
"Error running '%s' task, '%s' will be aborted",
task.name,
task.stage,
)
Monitor.kill(task.proc)
task.killed.set()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously task.killed will only be set if the checkpoint (Not the actual training progress) raises an exception. The new task.update will be set if at least one checkpoint has been committed. It is the condition that we need to collect the checkpoint result even if it didn't finish all of the iterations.

@codecov
Copy link

codecov bot commented Dec 7, 2022

Codecov Report

Base: 93.52% // Head: 93.52% // No change to project coverage 👍

Coverage data is based on head (28848c5) compared to base (28848c5).
Patch has no changes to coverable lines.

❗ Current head 28848c5 differs from pull request most recent head a36c00c. Consider uploading reports for the commit a36c00c to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8668   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files         457      457           
  Lines       36139    36139           
  Branches     5229     5229           
=======================================
  Hits        33800    33800           
  Misses       1836     1836           
  Partials      503      503           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines 234 to 239
if notrun:
logger.warning(
"Some of the stages '%s' were not processed because "
"something wrong occurred in the previous stages",
",".join([stage.addressing for stage in notrun]),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need to log this given that we don't log additional un-run stages when a normal error occurs for any stage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up on this, if we decide we would like to log the skipped stages, it will be cleaner to just adjust the loop to use

for i, stage in enumerate(steps):
    try:
    except CheckpointKilledError:
        ...
        logger.warning("skipped stages '%s'", ", ".join((s.addressing for s in steps[i + 1:]))
        break

rather than creating the additional list of unrun stages and continuing the loop

dvc/repo/reproduce.py Outdated Show resolved Hide resolved
dvc/repo/reproduce.py Outdated Show resolved Hide resolved
@pmrowla
Copy link
Contributor

pmrowla commented Dec 8, 2022

@karajan1001 I'm actually not sure that the checkpoint handling behavior belongs in stage.run/repo.reproduce at all. I think there is some confusion in the original issue. Right now we really only need to make --temp behave the same way as --queue (so it should always cleanup/collect the tempdir executor for checkpoints, which will make it save the successful iterations). see: #8612 (comment)

@karajan1001
Copy link
Contributor Author

image

@dberenbaum
Copy link
Contributor

@karajan1001 Maybe there is someone else in @iterative/dvc who can review while @pmrowla is out so we can get it merged?

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@karajan1001
Copy link
Contributor Author

@karajan1001 Maybe there is someone else in @iterative/dvc who can review while @pmrowla is out so we can get it merged?

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

fix: iterative#8612
For checkpoint experiment in some case users might want to give it a early
stopping to reduce variance.  But currently if we interrupt/kill the
experiment it will be marked as failed, and all of the completed
checkpoints will be removed as we cleanup all the running directly after
the process failed.

1. We raise CheckpointKilledError intead of StageCmdFailed error if at
   least one checkpoint had been commited.
2. Temp_dir executor will continue collecting data if the Checkpoint
   stage was interrupted.
3. Raise warnings if a checkpoint stage was incomplete and the other
   stages were not forwarded.
4. Add a new functional test for this
@dberenbaum
Copy link
Contributor

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

@karajan1001 Is this still an issue?

@karajan1001 karajan1001 deleted the fix8612 branch December 30, 2022 07:12
@karajan1001
Copy link
Contributor Author

One problem remained for the --queue experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

@karajan1001 Is this still an issue?

This behavior is different from --temp in which we return completed checkpoint results even if the tasks failed. I think the --temp behavior is more reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bugfix fixes bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exp: Checkpoints created during dvc exp run --temp run are lost after failure (e.g., kill -9)
4 participants