Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push running checkpoint to remote #6332

Merged
merged 24 commits into from Aug 12, 2021
Merged

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Jul 19, 2021

Fix #6182

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@karajan1001 karajan1001 requested a review from a team as a code owner July 19, 2021 12:06
@karajan1001 karajan1001 requested a review from efiop July 19, 2021 12:06
@karajan1001 karajan1001 marked this pull request as draft July 19, 2021 12:06
@karajan1001 karajan1001 requested a review from pmrowla July 19, 2021 12:07
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a couple small comments

dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
@@ -464,6 +466,20 @@ def checkpoint_callback(
exp_rev = cls.commit(
scm, exp_hash, exp_name=name, force=force, checkpoint=True
)

git_remote = os.environ.get("DVC_EXP_AUTO_PUSH", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this env var into dvc/env.py and use it as a var.

Copy link
Contributor

@casperdcl casperdcl Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what about a different meaning, e.g. DVC_EXP_AUTO_PUSH=10 meaning "push every 10 checkpoints?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@casperdcl
DVC_EXP_AUTO_PUSH is the remote git repo for auto push. Maybe this name needs to be modified. But it is better to discuss this problem in the PR for dvc.org

Copy link
Contributor

@casperdcl casperdcl Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR for dvc.org

I don't follow. I understand what you're using DVC_EXP_AUTO_PUSH for. I'm asking for:

  • DVC_EXP_AUTO_PUSH to be changed to mean something else (push every N epochs)
  • or a new variable DVC_EXP_AUTO_PUSH_EVERY added (push every N epochs)
  • or 2 different variables with less ambiguous names DVC_EXP_AUTOPUSH_CHECKPOINT=0, DVC_EXP_AUTOPUSH_REMOTE=""

Basically I think a "push every N epochs" config should be supported in this PR rather than in a follow-up so that we can keep API consistent. WDYT @efiop @pmrowla?

Copy link
Contributor

@pmrowla pmrowla Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@casperdcl
Is "push every N commits" actually a requirement for CML? This seems like an additional feature request that's unrelated to being able to recover checkpoints from a dead CI runner.

And it seems like there's still some behavior discussion that would need to be worked out there, i.e. what happens if I have it set to 10 and my experiment generates less than 10 commits, should DVC not push anything?

It also seems like something that should maybe be tied to dvclive and not DVC itself, since DVC doesn't even have a real concept of epoch count/iteration awareness (related dvclive ticket: iterative/dvclive#113). Since dvclive is aware of actual epoch counts, it can just call experiments.push at the appropriate time without the need for adding more env vars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • DVC_EXP_GIT_REMOTE is used for the default exp push git remote. Both dvc exp push and auto push can use it (But the auto push cann't work without it, while dvc exp push can run git remote arguments).

But it is a totally new feature, involving changes on dvc exp push should not be mixed in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring multiple git remotes is a pretty typical use case

ok in that case I'd go with

  1. Split into two env vars

i.e. DVC_EXP_CHECKPOINT_PUSH:bool and DVC_EXP_GIT_REMOTE:str?

Regarding:

I think that the behavior of auto push should go with dvc exp push

This also makes sense to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not specific to checkpoints, regular experiments will end be pushed the same way if the env var is set.

I think we need to keep CHECKPOINT out of the names of these env vars for this reason.


Splitting into two env vars makes sense, but it does make the UX more complex. I think it's worth making sure this is really necessary before creating extra complexity for hypothetical future needs, so I have a couple follow-ups:

Why not rename the env var to DVC_EXP_GIT_REMOTE and make no other changes? That provides future flexibility but also keeps the UX concise.

Alternatively, I think we could reconsider having a default remote.

Configuring multiple git remotes is a pretty typical use case (original/upstream repo vs my fork/origin). If I'm working on a fork and want my run experiments pushed to the original repo but not my fork, I need to be able to configure the remote.

  1. Is this likely in CI?
  2. Even locally, are users likely to do git push -u upstream mybranch? My typical workflow would be to pull from upstream and push to origin by default.
  3. If we aren't following Git default remote configuration, why not document that origin is the default instead of requiring it to be explicit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this likely in CI?

in CI probably not

Even locally, are users likely to do git push -u upstream mybranch? My typical workflow would be to pull from upstream and push to origin by default.

I do this regularly with DVC - we have CI configured to run a different set of actions on pull requests vs branches pushed directly to iterative/dvc, so there are times when I want to push a certain branch for specific features (that will require the stricter branch tests) to upstream rather than my personal fork

If we aren't following Git default remote configuration, why not document that origin is the default instead of requiring it to be explicit?

This is an option, although as someone who is going to have to answer support questions regarding "why is the default for this experiment origin instead of the default I set on my git branch", my preference would be to continue requiring the explicit argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum

Alternatively, I think we could reconsider having a default remote.

I think having a default remote for the dvc exp push is reasonable, both dvc push and git push have default remotes.
How about adding it to the dvc.config, it is more manageable.

And if a default remote for dvc exp push is setted, the only env arguments here to control the auto push process is DVC_EXP_CHECKPOINT_PUSH:bool.

I don't think users will directly use auto push without using dvc exp push previously. If they are familiar with dvc exp push, and knows how to set up a default remote. The auto push configuration is quite a simple one.

karajan1001 and others added 4 commits July 27, 2021 09:58
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <peter@pmrowla.com>
1. move env name to dvc/env.py.
2. add some tests for it.
dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
tests/func/experiments/test_checkpoints.py Outdated Show resolved Hide resolved
1. change the behaviour of self remote
2. do not use string DVC_EXP_AUTO_PUSH
3. downgrade the logger level in auto push
4. use full branch ref
@karajan1001 karajan1001 marked this pull request as ready for review July 27, 2021 10:10
@karajan1001 karajan1001 requested a review from pmrowla July 27, 2021 10:28
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <peter@pmrowla.com>
Comment on lines 462 to 464
f"try to auto checkpoints to {git_remote} which is the "
"running repository dvc cache will be pushed to the "
"default remote while git references will not be pushed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this message. Maybe something like

Try to save the experiment status to {git_remote} after every checkpoint.
It's data will be uploaded to the default remote (if any).

Not sure what Git refecences we're talking about. Meaning the experiments themselves? If so what gets saved to {git_remote} ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are git pushing the currently running experiment ref. In this condition, git_remote points to the current git repo itself, so nothing actually happens on the git side. However, when this env var is set, we also automate dvc pushing used cache objects and run cache for the currently running experiment as well.

So the side effect here is that nothing will be "pushed" anywhere on the git side of things, but DVC outputs and run cache will still be pushed to the default configured DVC remote.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message should probably just be something like

'{git_remote}' points to the current Git repo, experiment Git refs will not be pushed. DVC cache and run cache will automatically be pushed to the default DVC remote (if any) on each experiment commit.

@karajan1001
Copy link
Contributor Author

karajan1001 commented Aug 3, 2021

In summary,

  1. split the current env into two (confirm the usage with CML team).
  2. create a new ticket for the default dvc exp push git remote.
  3. auto push on every several epochs -- a new feature not to discuss here (also a new ticket ) .

had I missed something?

@casperdcl
Copy link
Contributor

@iterative/cml

@karajan1001 karajan1001 changed the title [WIP] Push running checkpoint to remote Push running checkpoint to remote Aug 4, 2021
dvc/repo/experiments/executor/base.py Outdated Show resolved Hide resolved
tests/func/experiments/test_checkpoints.py Show resolved Hide resolved
karajan1001 and others added 2 commits August 10, 2021 16:45
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <peter@pmrowla.com>
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, I think this should be the last set of changes to do

tests/func/experiments/test_checkpoints.py Outdated Show resolved Hide resolved
dvc/scm/git/backend/base.py Outdated Show resolved Hide resolved
dvc/scm/git/backend/dulwich.py Outdated Show resolved Hide resolved
karajan1001 and others added 4 commits August 11, 2021 14:02
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <peter@pmrowla.com>
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <peter@pmrowla.com>
@casperdcl casperdcl added A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension labels Aug 11, 2021
@classmethod
def _validate_remotes(cls, dvc: "Repo", git_remote: Optional[str]):

if git_remote == dvc.root_dir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to have this explicit check? Would this qualify as a valid git remote? Is there some reason to think a user might set DVC_EXP_GIT_REMOTE to their local project's root dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why fail for an invalid Git remote but only warn for this scenario?

Copy link
Contributor

@pmrowla pmrowla Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git_remote is just a variable that is either a path/URL to a git repo (including local ones) or a remote name. If this check is hit it means it's a local path that points directly to the current DVC root directory, which in most cases is a valid git repo path.

This behavior can be useful - if you do DVC_EXP_GIT_REMOTE=. the end result would be that we auto-push DVC cache for your local experiment runs. (The git push step becomes a no-op, the same thing that happens if you do a CLI git push to your own current repo directory)


if git_remote == dvc.root_dir:
logger.warning(
f"'{git_remote}' points to the current Git repo, experiment "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                f"'{git_remote}' points to the current Git repo, experiment "

I'm not sure this is helpful since a valid remote would also point to the current Git repo, right? Maybe something like "local workspace" makes more sense than "current Git repo"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local workspace is your current Git repo. https://github.com/my/repo.git is not your current Git repo - it's a remote Git repo, and potentially where your current Git repo is cloned from, but "current repo" would always be your current local workspace repo

@karajan1001 karajan1001 merged commit 6dd435e into iterative:master Aug 12, 2021
@karajan1001 karajan1001 deleted the fix6182 branch August 12, 2021 07:17
@casperdcl
Copy link
Contributor

thanks @karajan1001 - is there a follow-up PR in https://github.com/iterative/dvc.org/pulls?

@karajan1001
Copy link
Contributor Author

karajan1001 commented Aug 13, 2021

@casperdcl related iterative/dvc.org#2672

skshetry added a commit to skshetry/dvc that referenced this pull request Aug 13, 2021
* 'python310' of github.com:skshetry/dvc:
  Unpin networkx
  Remove a special queued experiments (iterative#6393)
  build(deps): bump google-cloud-storage from 1.41.1 to 1.42.0 (iterative#6415)
  Push running checkpoint to remote (iterative#6332)
@efiop efiop added the feature is a feature label Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature is a feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

checkpoints: flag to exp push && push
7 participants