Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: intro to experiment checkpoints #2518

Closed
iesahin opened this issue Jun 1, 2021 · 10 comments
Closed

start: intro to experiment checkpoints #2518

iesahin opened this issue Jun 1, 2021 · 10 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: start Content of /doc/start type: discussion Requires active participation to reach a conclusion.

Comments

@iesahin
Copy link
Contributor

iesahin commented Jun 1, 2021

The basic checkpoints (without dvclive or make_checkpoint or signal-file) seem to be undercovered in the docs. We have a way to add them to experiments using dvc stage add -c model ... or editing dvc.yaml and probably the easiest way to start with the checkpoints.

The Checkpoints Tutorial covers the dvclive usage. We also need documents for signal-file and make_checkpoint but they may be considered advanced.

This is related to #2496

Related iterative/katacoda-scenarios#62

@jorgeorpinel

This comment was marked as outdated.

@shcheklein shcheklein added the A: docs Area: user documentation (gatsby-theme-iterative) label Jun 2, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Jun 2, 2021

I wrote a first draft for the document in #2528.

I think we discussed, at least I mentioned this as my plan during last week's meeting.

Can you clarify what do basic, signal-file, etc mean? They're in back quotes so I assume there's a specific meaning, are they branches of an example repo? Please link to give full context.

These are the different ways of using the checkpoints, and also tags in get-started-checkpoints repository. DVClive is covered in the UG document, but the other three ways are not covered in detail.

For the GS level, this basic stuff should be OK. We may need to update the code in the UG document to conform to get-started-checkpoints, and add two other documents related to signal-file and python-api (or make_checkpoint() as you may have seen).

@dberenbaum
Copy link
Contributor

@iesahin There was an issue in #2292, which was why the basic branch was developed. TBH I'm no longer sure the basic branch is worth supporting as its own workflow since it seems like a pretty unrealistic one, and we have so many to cover. One of the goals of basic was to make it easier to teach checkpoints, but I worry it might actually just add confusion. What do you think?

@iesahin
Copy link
Contributor Author

iesahin commented Jun 2, 2021

I think your comment in #2292 still seems valid @dberenbaum and basic checkpoints may both have an educational and introductory benefit.

basic checkpoints provide a means to use the checkpoints without changing the code. I think this is valuable by itself. Introducing other ways of using checkpoints needs to alter the code, add code snippets, etc.

Also, I'm not sure that a typical user will need more than one checkpoint in the pipeline. I'll add if you want to use in such and such way, you can do so with other methods at the end. (1) DVClive provides automated metrics tracking, (2) you can save arbitrary checkpoints with make_checkpoint and (3) if you want to use checkpoints in R/Julia/Java/C++, you can do so with signal-files.

It might add confusion but I think other ways are too much for an introductory material. I wouldn't add make_checkpoint based tutorial to Get Started.

@dberenbaum
Copy link
Contributor

I wouldn't add make_checkpoint based tutorial to Get Started.

My concern would be that make_checkpoint or dvclive are more common or useful workflows, and we should start with whatever we think a typical user would want. Maybe introducing the basic single checkpoint works if we can find a way to make clear that users can inject checkpoints into loops and callbacks and refer to the user guide section that explains in detail.

@shcheklein
Copy link
Member

basic checkpoints provide a means to use the checkpoints without changing the code.

does it require making your code do one (or some number) of epochs at a time?

@iesahin
Copy link
Contributor Author

iesahin commented Jun 8, 2021

basic checkpoints provide a means to use the checkpoints without changing the code.

does it require making your code do one (or some number) of epochs at a time?

Ah, yes, that may be necessary if there is no resume where the training left off functionality is available. Nevertheless I think this is easier to explain than make_checkpoint or signal-file.

@iesahin iesahin added the C: start Content of /doc/start label Oct 20, 2021
@jorgeorpinel jorgeorpinel changed the title start: Write an introductory document for checkpoints start: intro to experiment checkpoints Sep 22, 2022
@jorgeorpinel jorgeorpinel added type: discussion Requires active participation to reach a conclusion. status: stale You've been groomed! labels Sep 22, 2022
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 22, 2022

@dberenbaum do you think we still want to introduce checkpoints at the Get Started level? To me it sounds like the feature not at that level of maturity but not sure. But if the answer is no feel free to close this thanks.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 22, 2022

Some comments for the record (maybe we can address these points at least):

We also need documents for signal-file and make_checkpoint

Indeed signal files are barely mentioned in https://dvc.org/doc/user-guide/experiment-management/running-experiments#checkpoint-experiments, notably not even mentioned in https://dvc.org/doc/user-guide/experiment-management/checkpoints, and mentioned a bit more (but still not explained) in https://dvc.org/doc/dvclive/dvclive-with-dvc#dvclive-with-dvc. Is this something we want to document going fwd though?

make_checkpoint is mentioned in https://dvc.org/doc/user-guide/experiment-management/checkpoints#registering-checkpoints-in-your-code but only somewhat explained in its ref, https://dvc.org/doc/api-reference/make_checkpoint.

@jorgeorpinel jorgeorpinel removed the status: stale You've been groomed! label Sep 22, 2022
@dberenbaum
Copy link
Contributor

I have not seen anyone ask about language-agnostic checkpoints, so I wouldn't prioritize "signal files" until someone does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: start Content of /doc/start type: discussion Requires active participation to reach a conclusion.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants