Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiments: ignore untracked files #5029

Merged
merged 10 commits into from Dec 4, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Dec 4, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • Untracked files are now ignored when populating experiment executor. Only workspace changes to git-tracked files will be passed into the executor.
  • To explicitly pass untracked files (such as a new code classes/files) into the experiment, the file should be staged as a new file via git add <file> or git add --intent-to-add <file> before running the experiment.
  • Only tracked files (including any "new" files which were staged via git add) will be included in the experiment commit.
    • This is a change from prior behavior (experiments: support binary and untracked files in experiment patchesΒ #4452). Support for the inclusion of untracked files should no longer be needed, as the types of intermediate files which were discussed in the original issue, should now be explicitly handled as checkpoint outs.
    • DVC outputs/artifacts will always be included in the experiment commit, this includes dvc.lock files and uncached (git-tracked) outputs which did not exist in git prior to running the experiment.

This change is needed to avoid issues where users are not .gitignoreing things properly (such as build directories, __pycache__, pyc files, etc). Otherwise, all of these types of run/build artifiacts would be included in the experiment git commits.

@pmrowla pmrowla added A: experiments Related to dvc exp bugfix fixes bug labels Dec 4, 2020
@pmrowla pmrowla self-assigned this Dec 4, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 12, 2021

Interesting! This needs documenting πŸ™‚ (unchecked the box for now).

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 12, 2021

To explicitly pass untracked files, the file should be staged as a new file via git add or git add --intent-to-add before running the experiment.

This is a great option but IMO the UI isn't very intuitive (using Git operations to determine DVC behavior). How about introducing a new option like exp run --queue/temp --all-files (similar to git commit --all) to include untracked files (except those git-ignored) for queued/temp executors? Or --with-untracked, or some other name. Cc @dberenbaum

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 12, 2021

BTW you can't stage an empty dir so that's still a limitation now, while exp run --queue/temp --all-files could copy empty dirs in the tmp exp dirs (e.g. placeholders for future outputs, rel. #5802).

Lmk if I should open a separate issue about all this...

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 13, 2021

You can't stage or add empty dirs in git at all, as git does not track directories, it only tracks files.

How about introducing a new option like exp run --queue/temp --all-files (similar to git commit --all) to include untracked files (except those git-ignored) for queued/temp executors?

This is dangerous for DVC and not a good idea IMO. If users have not explicitly git-ignored things like auth credentials, they will end up committed and in git history. Likewise, if we may also end up shoving things like cached python bytecode or entire venv directories into git history, which is also really bad practice.

I'd say this is one of those cases where we have to force the user to be explicit, even if it is inconvenient, because the alternative is providing a convenient way for users inadvertently to do something extremely dangerous.

@jorgeorpinel
Copy link
Contributor

This is dangerous for DVC... If users have not explicitly git-ignored things like auth credentials, they will end up committed and in git history...
one of those cases where we have to force the user to be explicit

That's why I'm suggesting an explicit option, so users have to know what they're doing). We already allow the danger via git add --force (e.g. as a workaround for #5800). I think I'm just proposing a better UI that's decoupled from Git (and which also supports empty dirs).

we may also end up shoving things like cached python bytecode or entire venv directories into git history, which is also really bad practice

I definitely see your points but philosophically I think we don't need to protect users from themselves.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 13, 2021

I definitely see your points but philosophically I think we don't need to protect users from themselves.

I'm not so sure about this given that the original reason for this PR was that users trying out the experiments feature ended up being confused as to why __pycache__ and assorted .pyc files ended up in their git history after running experiments in this way

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 13, 2021

OK. For now would you like to send a docs update proposal with this info (current behavior)? I can take it over once it's up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bugfix fixes bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc repro -e does not work with files which are not tracked by git
2 participants