checkpoints: num(ber)/epoch awareness #113

casperdcl · 2021-06-15T17:38:30Z

Solves two problems:

dvc exp run && dvc exp run shouldn't re-run everything a second time
- Yes, the user code can check this but so what? Same argument for non-checkpointed dvc repro && dvc repro. DVC should non re-run.
interrupting dvc exp run (e.g. due to a runner timeout) and resuming shouldn't re-start from checkpoint zero.

Part of DVC feature requests cml#560
Possible solution Iterable wrapper for dvclive #68

The text was updated successfully, but these errors were encountered:

pmrowla · 2021-06-16T00:11:06Z

I'm not sure I understand the issue here?

For checkpoints dvc exp run && dvc exp run should already not be restarting anything. The second exp run should already resume from the last generated checkpoint, even if the first run was interrupted.

edit: this might be a side effect of iterative/dvc#6180 - the checkpoint resume feature requires the dvc.lock for the checkpoint you want to resume, but if CML isn't seeing dvc.lock files for previously generated checkpoints, then that's why DVC will restart from checkpoint 0 each time

casperdcl · 2021-06-16T13:02:33Z

This is more about the user code not having to look at the workspace to determine what checkpoint it's currently on. Ideally DVC itself should determine the checkpoint number and not bother running the user code if it's at the final checkpoint already.

dberenbaum · 2021-06-16T13:48:37Z

This is more about the user code not having to look at the workspace to determine what checkpoint it's currently on. Ideally DVC itself should determine the checkpoint number and not bother running the user code if it's at the final checkpoint already.

This makes sense to me if the user code does not include iterations or make_checkpoint and instead trains a single epoch and exits. In that case, running the stage a fixed number of times makes sense. This seems like the cleanest workflow, but it also seems atypical for many users, who may create checkpoints inside a callback to a fit method that iterates over a defined number of epochs.

If the user code includes some iteration over make_checkpoint, this idea seems confusing since the user code will define the number of iterations. What if the user code has fewer or more iterations than the number of checkpoints defined in dvc? In iterative/dvc#6104 (reply in thread), @pmrowla suggested how user code can be used to track the number of epochs using a parameter (that may not be considered a dependency?). However, that will still run the checkpoints stage every time, it just may exit immediately without doing anything if the max number of checkpoints has been reached.

dberenbaum · 2021-06-16T20:53:57Z

@casperdcl What do you think about a workflow like this?

for _ in dvclive.range(100):
    ....
    dvclive.log("acc", acc)

dvclive.range could start from the tracked epoch number and exit immediately if there are no remaining epochs. If dvc/dvclive handles the iteration, then this makes sense to me.

See #68

casperdcl · 2021-06-17T11:47:40Z

yup that looks like a clean solution

dberenbaum · 2021-06-17T18:57:59Z

@pmrowla @pared Thoughts?

pared · 2021-06-18T09:26:51Z

Do I understand this use case correctly?

User uses dvclive.range to define max number of epochs
If, for some reason execution stops and checkpoints are used, dvclive should remember latest epoch number
Upon next run dvclive should pick up from latest checkpoint

dberenbaum · 2021-07-05T17:09:40Z

Should we move this to dvclive? Is there any underlying dvc issue to keep open if this dvclive feature is added?

daavoo · 2021-07-16T08:28:46Z

@casperdcl What do you think about a workflow like this?
for _ in dvclive.range(100):
    ....
    dvclive.log("acc", acc)
dvclive.range could start from the tracked epoch number and exit immediately if there are no remaining epochs. If dvc/dvclive handles the iteration, then this makes sense to me.

See iterative/dvclive#68

I'm not sure how someone using some ML Framework for training could adapt it's training code to use this.

dberenbaum · 2021-07-16T16:36:36Z

Since dvclive has a resume option, can this be handled using a dvclive.step property that can be passed to the framework? For example, model.fit(initial_epoch=dvclive.step) in keras.

casperdcl · 2021-07-16T16:40:29Z

not sure; dvclive documentation doesn't really exist but sounds like it would work

dberenbaum · 2021-07-16T18:54:35Z

Sorry @casperdcl, I was responding to @daavoo. I still think having dvclive.range() would be nice for a manual training loop.

not sure; dvclive documentation doesn't really exist but sounds like it would work

Not for long! iterative/dvc.org#2632

casperdcl · 2021-07-16T20:28:35Z

Ah I see. Well for step in dvclive.range(...) would make dvclive.step a bit redundant.

dberenbaum · 2021-07-16T21:04:04Z

dvclive.step is for use with frameworks where the code looks like model.fit(initial_epoch=dvclive.step, epochs=100) instead of for step in dvclive.range(100): train(model). There's no manual training loop, which I assumed was the concern raised by @daavoo.

daavoo · 2021-07-19T07:58:45Z

dvclive.step is for use with frameworks where the code looks like model.fit(initial_epoch=dvclive.step, epochs=100) instead of for step in dvclive.range(100): train(model). There's no manual training loop, which I assumed was the concern raised by @daavoo.

Yes, my concerns were about ML Frameworks where there is not an easy way to integrate the usage of dvclive.step, also was worried about the consistency between different dvclive<>ML Framework integrations given that not every ML Framework might have support for "resuming" or indicating an initial epoch that is not 0.

Ah I see. Well for step in dvclive.range(...) would make dvclive.step a bit redundant.

For some ML Frameworks, where the user writes it's own training loop, something like the code block bellow could be used instead of the hypothetical dvclive.range (unless we would want to wrap additional functionality under dvclive.range):

while dvclive.get_step() < 100:
    # train for one epoch

Anyhow, it looks that adding a new public method dvclive.get_step() would be trivial and has many potential usages, cc @dberenbaum

dberenbaum transferred this issue from iterative/dvc Jul 16, 2021

daavoo added discussion requires active participation to reach a conclusion feature request labels Jul 19, 2021

pmrowla mentioned this issue Jul 28, 2021

Push running checkpoint to remote iterative/dvc#6332

Merged

2 tasks

dberenbaum mentioned this issue Aug 5, 2021

dvclive.init: Future of arguments #128

Closed

daavoo added the p1-important Include in the next sprint label Aug 24, 2021

daavoo self-assigned this Aug 25, 2021

daavoo added this to To do in DVC 24 Aug - 07 Sep 2021 via automation Aug 25, 2021

daavoo moved this from To do to In progress in DVC 24 Aug - 07 Sep 2021 Aug 25, 2021

daavoo mentioned this issue Aug 25, 2021

Added dvclive.get_step #142

Merged

2 tasks

daavoo closed this as completed in #142 Aug 27, 2021

DVC 24 Aug - 07 Sep 2021 automation moved this from In progress to Done Aug 27, 2021

daavoo removed this from Done in DVC 24 Aug - 07 Sep 2021 Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints: num(ber)/epoch awareness #113

checkpoints: num(ber)/epoch awareness #113

casperdcl commented Jun 15, 2021

pmrowla commented Jun 16, 2021 •

edited

casperdcl commented Jun 16, 2021

dberenbaum commented Jun 16, 2021

dberenbaum commented Jun 16, 2021 •

edited

casperdcl commented Jun 17, 2021

dberenbaum commented Jun 17, 2021

pared commented Jun 18, 2021

dberenbaum commented Jul 5, 2021

daavoo commented Jul 16, 2021

dberenbaum commented Jul 16, 2021

casperdcl commented Jul 16, 2021

dberenbaum commented Jul 16, 2021

casperdcl commented Jul 16, 2021

dberenbaum commented Jul 16, 2021 •

edited

daavoo commented Jul 19, 2021

checkpoints: num(ber)/epoch awareness #113

checkpoints: num(ber)/epoch awareness #113

Comments

casperdcl commented Jun 15, 2021

pmrowla commented Jun 16, 2021 • edited

casperdcl commented Jun 16, 2021

dberenbaum commented Jun 16, 2021

dberenbaum commented Jun 16, 2021 • edited

casperdcl commented Jun 17, 2021

dberenbaum commented Jun 17, 2021

pared commented Jun 18, 2021

dberenbaum commented Jul 5, 2021

daavoo commented Jul 16, 2021

dberenbaum commented Jul 16, 2021

casperdcl commented Jul 16, 2021

dberenbaum commented Jul 16, 2021

casperdcl commented Jul 16, 2021

dberenbaum commented Jul 16, 2021 • edited

daavoo commented Jul 19, 2021

pmrowla commented Jun 16, 2021 •

edited

dberenbaum commented Jun 16, 2021 •

edited

dberenbaum commented Jul 16, 2021 •

edited