Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: new code (or any untracked files) can't be queued [QA] #5801

Closed
jorgeorpinel opened this issue Apr 12, 2021 · 13 comments
Closed

exp run: new code (or any untracked files) can't be queued [QA] #5801

jorgeorpinel opened this issue Apr 12, 2021 · 13 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 12, 2021

Extracted from #5593

In the docs we say "Before running an experiment, you'll probably want to make modifications such as data and code updates, or hyperparameter tuning." Is this the intended behavior?

Because, if we see dvc.yaml as code — and I think we do as we one of DVC's principles is pipeline codification — then this statement isn't completely true when it comes to queueing experiments.

Specifically, if you create (or modify) dvc.yaml between queued runs and then try to --run-all, you get errors. Example:

# With brand-new DVC repo:

$ dvc stage add -n hi -o hi "echo hey > hi"
Creating 'dvc.yaml'
Adding stage 'hi' in 'dvc.yaml'
...
$ dvc exp run
... # works

$ dvc exp run --queue
Queued experiment '16c7340' for future execution.
$ dvc stage add -fn hi -o hello "echo hi > hello"
Modifying stage 'hi' in 'dvc.yaml'
...
$ dvc exp run --queue
Queued experiment '41791e2' for future execution.

$ dvc exp run --run-all
!ERROR: 'dvc.yaml' does not exist
ERROR: 'dvc.yaml' does not exist
ERROR: Failed to reproduce experiment '41791e2'
ERROR: Failed to reproduce experiment '16c7340'

It works with regular experiments though, which can access all workspace files (not in a tmp dir).

@jorgeorpinel jorgeorpinel added the discussion requires active participation to reach a conclusion label Apr 12, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 12, 2021

@pmrowla's reply (#5593 (comment)):

The issue with creating a new dvc.yaml file is the same as creating any new code file for experiments. In order for any new (completely untracked) files to be included in the queued experiment, you must at least stage them with git add (but they do not need to be committed) (it looks like this is not currently documented, see: #5029). For workspace runs since the files are present in the workspace itself, this issue doesn't show up.

The alternative would be for us to explicitly include all of the untracked files in our experiment git commits. Doing so will likely result in us git-committing objects which do not belong in git (like large binary files, or things like venv dirs or pycache dirs that the user forgot to git ignore), or files which the user was explicitly not tracking in DVC or git for a reason (like authentication credentials).

@dberenbaum's reply (#5593 (comment)):

Files not tracked by Git will only be reflected in workspace experiments and not in temp experiments. Workspace experiments will use whatever is in the workspace, but temp experiments copy files to the temp dir by checking them out with Git. In this case, I think it's unclear what the expected behavior is.

@jorgeorpinel
Copy link
Contributor Author

the alternative would be for us to explicitly include all of the untracked files in our experiment git commits. Doing so will likely result in us git-committing objects which do not belong in git

Yes, I see @pmrowla. We could also just copy all the workspace contents to the tmp dir without ever committing untracked files, but then the resulting exp commit could be missing files needed to reproduce its results. Tricky! I guess queued/temp runs are fundamentally different from workspace runs in some ways.

Should def. document this...

temp experiments copy files to the temp dir by checking them out with Git. In this case, I think it's unclear what the expected behavior is

I don't think it's just a git checkout rn @dberenbaum, since git-staged files from the workspace are also copied over. Anyway: yeah, it's wasn't obvious at first, but now I see that we shouldn't silently include untracked files to tmp exp dirs.

Maybe adding an option like exp run --queue/temp --all-files to explicitly include untracked files (somewhat similar to git commit --all) would be best? See #5029 (comment).

@shcheklein
Copy link
Member

A related edge case in this discussion - do we capture the --local DVC config when we run experiments in background?

@shcheklein
Copy link
Member

Before running an experiment, you'll probably want to make modifications such as data and code updates, or hyperparameter tuning." Is this the intended behavior?

btw, not sure what is the problem with this phrase specifically?

it looks this is a duplicate/related to this one #5800 - same problem pretty much

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2021

A related edge case in this discussion - do we capture the --local DVC config when we run experiments in background?

We do not, the issue here is similar to where we do not respect the local config when doing certain commands which use our erepo clone functionality.

it looks this is a duplicate/related to this one #5800 - same problem pretty much

And yes, this is the same issue.


We can copy untracked files (and things like the local config) directly into the temp directory, but that adds the question of how to "stage" this copy for multiple queued runs. Some options here would be:

  • Include these untracked files in our queued stash commit.
    • This was actually the behavior used in a real early experiments feature iteration: experiments: support binary and untracked files in experiment patches #4452
    • It turned out to be not such a good idea since we potentially end up shoving large binary files into a git commit, and git is bad at storing large binary files
    • This also does potentially shove things like not-gitignored auth credentials into a git commit object (even though it should be garbage collected eventually, this is still bad practice for DVC)
  • Copy the files into the temp directory at queue time, and have them sit there until the experiment is actually executed
    • This is also not ideal since we will potentially end up storing multiple copies of large binary files on disk

One other potential thing that I came up with at the time of that early PR was #4452 (comment)

As a potential future improvement, we could maybe consider doing something like generating a directory containing all of the untracked files generated by an experiment run (along with a manifest or something containing the original relpaths for the untracked files), and then storing that as an experiment-specific dvc-tracked directory.

This would fix the "storing multiple copies of large binary files" issue, but the downside would be that these untracked files end up cached by DVC until the next time the user runs gc (we could consider using some separate experiments specific cache dir as well I suppose)


Finally, none of these solutions would directly address the issue in #5800, where the user has gitignored files that also need to be present in the workspace in order to run their pipeline. In this case, it's credentials, but in theory, stuff like (properly gitignored) venvs also needs to be copied over as well.


This issue is also related to cloud execution: iterative/enhancement-proposals#3

Running experiments on remote machines or in CI will also require making sure that these files exist on the remote machine (i.e. the same issue as making sure they exist in the local temp directory)

@jorgeorpinel jorgeorpinel changed the title exp run: not all changes to "code" can be queued [QA] exp run: new code can be queued [QA] Apr 13, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 13, 2021

do we capture the --local DVC config?
We do not

What could that affect? Remotes shouldn't be a problem. Wrong cache link types is the only significant problem I can think of rn. (Do we need a separate issue for this?)

not sure what is the problem with this phrase specifically?

True @shcheklein, prob. not the specific text we'll need to update. Code changes (including to dvc.yaml) do get picked up by queued exps. I updated this issues' title for clarity.

it looks this is a duplicate/related to this one #5800
yes, this is the same issue.

I don't think it's the same problem (maybe same solution thugh?). I'm suggesting a new option to copy untracked files into tmp dirs (which can currently be done, mostly, by git-staging them, see #5029), but that EXCLUDES git/dvc-ignored ones to my mind. In #5800 they're requesting certain git-ignored files to be copied over as well.

@jorgeorpinel jorgeorpinel added the feature request Requesting a new feature label Apr 13, 2021
@jorgeorpinel jorgeorpinel changed the title exp run: new code can be queued [QA] exp run: new code can't be queued [QA] Apr 13, 2021
@jorgeorpinel jorgeorpinel changed the title exp run: new code can't be queued [QA] exp run: new code (or any untracked files) can't be queued [QA] Apr 13, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 13, 2021

generating a directory containing all of the untracked files generated by an experiment run (along with a manifest...), and then storing that... using some separate experiments specific cache dir

I like the direction of this thinking @pmrowla. How about just copying untracked files to tmp exp dirs (IMO preferably with a new --all-files option so that we totally decouple this from Git) BUT WITHOUT git-committing them to the exp. Then, instead of deleting the tmp dir, keep it around. Or at least save those untracked files somewhere in the DVC cache like in your idea. Do this in such a way (manifest file?) that exp apply can restore them to the workspace as untracked files (which should help address #5800), so they never get to Git 🙂

Running experiments on remote machines or in CI will also require making sure that these files exist on the remote machine

➕. More of a reason to figure this out (at some point).

@dberenbaum
Copy link
Contributor

dberenbaum commented Apr 13, 2021

Excluding the gitignore issue in #5800, one option is to leave as is and document. If we're only focused on the narrow issue here of new code or other files that should be Git-tracked, users ought to be able to work around this limitation pretty easily with git add, even if it feels a bit clunky and inconsistent. We can always revisit if needed. Thoughts?

EDIT: I missed that this has already been proposed in #5029 . Do we need a docs issue created? Can we convert this ticket into a discussion until there are some other actions decided?

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 14, 2021

Yep, documentation for #5029 was missed. If we're not addressing this issue soon, I hope that can be done for now.

users ought to be able to work around this limitation pretty easily with git add
We can always revisit...
Can we convert this ticket into a discussion?

Sure. And this issue is basically about revisiting the current behavior at this point. Summarizing my reasons:

  • It's a bit strange to use a Git command as a proxy UI to a DVC feature.
    You would potentially git add/remove between queueing experiments just to signal something to DVC.
  • Git-staged files end up committed in the exp (which can be a security concern, esp. with git add --force).
  • You can't include empty dirs via git add (not a problem if stage add: fails with dirs that will be created later by the cmd(s) #5802 is addressed).

Proposed actions:

  1. Adding an option to exp run is better than relying on git add, as a first step.
  2. On top of that, a new caching mechanism so that untracked files included to tmp dirs don't get git-committed, but can still be restored by exp apply, should resolve most of the concerns here.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 14, 2021

  • It's a bit strange to use a Git command as a proxy UI to a DVC feature.
    You would potentially git add/remove between queueing experiments just to signal something to DVC.

You already have to git add and git commit things for other DVC features to work properly.

  1. Adding an option to exp run is better than relying on git add, as a first step.

I still disagree with this, IMO the user should have to be specific about what files they want to stage, and we should not provide an option to blanket add all untracked files.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 14, 2021

You already have to git add and git commit things for other DVC features to work properly.

This is the only situation where DVC bases anything off staged-only files AFAIK. Git-committing yes, as we use Git revisions (for versioning); Git-adding by itself, not that I know of.

IMO the user should have to be specific about what files they want to stage

Sure. The UI can be different e.g. some sort of exp.config file (a manifest like you suggested), or new field in dvc.yaml to list which untracked files to include in the exp. I wouldn't call it "staging" or "adding" though, since ideally we never want those in Git. I don't think we should use Git at all here.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 14, 2021

I guess I don't really see the difference between "requiring users to run 2 git commands first in order for some DVC functionality to work" and "requiring users to run 1 git command first in order for some DVC functionality to work".

e:
basically

git add dvc.lock
git commit
dvc diff <rev that was just committed> <some other rev> # diff data from one version of dvc.lock against another version of dvc.lock

is the now established workflow for using dvc diff.

This is the only situation where DVC bases anything off staged-only files AFIK. Git-committing yes, as we use Git revisions (for versioning); Git-adding by itself, not that I know of.

Sure, this is true. And experiments is a new feature and is the only situation where DVC is trying to execute a pipeline outside of the user's workspace. Since experiments is brand new, there is no established workflow for experiments at all. If we want to tell users that the workflow for using experiments is

git add my_new_model_code.py dvc.yaml
dvc exp run --queue # queue experiment containing my brand new and untracked python and dvc.yaml files

we can, because experiments is new and we can set the "standard practice" workflow to whatever we want it to be.

If we wanted to, we could even document that using git add for newly untracked files should be considered best practice all of time (even though it is technically not required for workspace runs).

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Apr 14, 2021

difference between "requiring users to run 2 git commands ... and "requiring users to run 1 git command

I think it's not about Git commands. We integrate with Git versions ("revisions") regardless of what Git interface was used to create them. I don't think we should integrate with the Git index ("staging area").

experiments is new and we can set the "standard practice" workflow
document that using git add for newly untracked files should be considered best practice

We could. But a workflow like git add + dvc exp run seems too coupled IMO. Almost like we're trying to replace Git.

I could be wrong. Let's make this a discussion and invite others to weight in? For now BTW, #5800 (comment) seems like it's becoming the same issue indeed (a request along the line of @pmrowla's exp manifest/cache idea and to my new option suggestion, I think).

@iterative iterative locked and limited conversation to collaborators Apr 14, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

4 participants