Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc exp run --temp: dvc-tracked dependencies are not checked out #10056

Open
ralfbanisch opened this issue Oct 27, 2023 · 10 comments
Open

dvc exp run --temp: dvc-tracked dependencies are not checked out #10056

ralfbanisch opened this issue Oct 27, 2023 · 10 comments

Comments

@ralfbanisch
Copy link

ralfbanisch commented Oct 27, 2023

Bug Report

Issue name

dvc exp run --temp: dvc-tracked dependencies are not checked out

Description

dvc exp run --temp will not dvc checkout the dependency file.txt from file.txt.dvc if file.txt has been modified.

Reproduce

  1. Setup a git repo with the following minimal structure:
├── code
│   └── test.py
├── data
│   └── file.txt
├── dvc.yaml
└── README.md

code/test/py just prints the contents the file.txt, and file.txt contains the single line "foo". file.txt is gitignored, all other files are git tracked. Content of dvc.yaml is:

❯ cat dvc.yaml
stages:
  generate_data:
    cmd: python3 code/test.py
    deps:
      - data/file.txt
  1. dvc init && dvc add data/file.txt && git add data/file.txt.dvc data/.gitignore
  2. dvc exp run --dry
  3. dvc exp run --temp -> prints "foo", as expected.
  4. rm data/file.txt && dvc exp run --temp -> checks out file.txt and prints "foo" as expected.
'data/file.txt.dvc' didn't change, skipping                                                                                   
Running stage 'generate_data':                                                                                                
> python3 code/test.py
foo
  1. touch data/file.txt && dvc exp run --temp -> fails to dvc check out file.txt and runs with empty file.txt instead
WARNING: 'data/file.txt' is empty.                                                                                            
Running stage 'generate_data':
> python3 code/test.py

Expected

dvc exp run --temp should not copy file.txt to the temporary folder, since it is not git tracked, and dvc checkout file.txt from the local cache, just as it does when file.txt is not present at all.

Environment information

dvc==3.27.0

Output of dvc doctor:

dvc doctor
DVC version: 3.27.0 (pip)
-------------------------
Platform: Python 3.8.12 on Linux-5.15.0-87-generic-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 2.18.2
        dvc_objects = 1.0.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.4.0
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.76),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/ralf/.config/dvc
        System: /etc/xdg/xdg-ubuntu/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cae49d9183304046b1da8acdeb50f42f

Additional Information (if any):

@pmrowla
Copy link
Contributor

pmrowla commented Oct 27, 2023

This is expected behavior. dvc exp run runs the experiment using the state of your current workspace. If you have a modified DVC dependency (in this case the empty file), DVC will use that modified state when it runs that experiment, it does not revert to the last DVC-committed state from your .dvc file. Internally DVC does the equivalent of dvc commit prior to running the experiment.

The behavior for checking out the dependency when does not exist at all (but is tracked in the .dvc file) is handled as a special case rather than assuming you intended to commit deleting that dependency

@ralfbanisch
Copy link
Author

Well if it's expected ok, I do find it confusing though. Especially the difference in behaviour when you compare what happens if file.txt is not present (it gets dvc checked out and the experiment runs) vs. when it is modified (where the modified file is then copied to the tmp dir and used).

You also state here in your docs that "Git-ignored files/dirs are excluded from queued/temp runs" and this is a git-ignored file, so I did expect it to be in fact excluded. https://dvc.org/doc/user-guide/experiment-management/running-experiments#how-are-experiments-isolated

@pmrowla
Copy link
Contributor

pmrowla commented Oct 27, 2023

We can clarify that in the docs, but that really means "git-ignored files which are also not tracked by DVC" (since all DVC tracked files are git-ignored)

If you had file.txt listed as a pipeline stage dependency, git-ignored it, and also did not track it with DVC (meaning it has no corresponding foo.txt.dvc file), then you would get the behavior where file.txt would not be copied into the temp experiment workspace. In that scenario there would also not be a version of the file to be dvc-checked out, so you would explicitly have to use -C/--copy-paths if you wanted to copy a specific version of file.txt into the temp workspace.

@lefos99
Copy link

lefos99 commented Oct 30, 2023

Well if it's expected ok, I do find it confusing though. Especially the difference in behaviour when you compare what happens if file.txt is not present (it gets dvc checked out and the experiment runs) vs. when it is modified (where the modified file is then copied to the tmp dir and used).

I agree with @ralfbanisch on this one. So all in all, imagine you have a file.txt (git ignored) and a file.txt.dvc and you would have three different cases:

  1. file.txt doesn't exist and file.txt.dvc exists -> Running dvc exp run --temp (isolated experiment), then your experiment will end up using the correct (by the corresponding .dvc file) file.txt, because of internal dvc checkout file.txt.dvc. ✔️
  2. file.txt exists and file.txt.dvc exists -> Running dvc exp run --temp (isolated experiment), then your experiment will end up using the correct (by the corresponding .dvc file) file.txt. ✔️
  3. file.txt exists but it's modified and file.txt.dvc exists -> Running dvc exp run --temp (isolated experiment), then your experiment will end up using the modified file.txt and different from what the .dvc file indicates. ❌

To me 1 and 2 are contradictory to 3, because it's unclear who the winner is each time. Is it the actual txt file? Is it the .dvc file, which will be checked out? 🤷‍♀️

@pmrowla
Copy link
Contributor

pmrowla commented Oct 30, 2023

The distinction is that DVC runs the dvc commit internally, so file.txt.dvc is modified, and the temp workspace will contain the modified file.txt.dvc. Inside the temp workspace, the dvc checkout step then uses the modified file.txt.dvc which results in checking out the modified file.txt.

This is consistent with what happens if you have unstaged changes in any other git-tracked file. The experiment will contain any unstaged and uncommitted changes to git-tracked files and will also contain any unstaged and uncommitted changes to DVC-tracked files (with the caveat that the DVC tracked file itself is used as the source of truth and not the .dvc file).

Let's say you have a git tracked params.yaml in your workspace as well as a git tracked python stage train.py. If you make modifications to those params or to train.py in your workspace, but do not stage or commit them in git, and then run dvc exp run --temp, would you expect the experiment to contain those modifications? Or would you expect DVC to revert them before running the experiment?

We used to have a dvc exp run --reset flag which was tied to the old checkpoints behavior, but we could consider bringing it back to indicate that you want DVC to reset any modifications to all git and dvc tracked files (so DVC would run the experiment with everything reverted to the last git-committed state)

cc @dberenbaum

@ralfbanisch
Copy link
Author

ralfbanisch commented Oct 30, 2023

Ok, so I agree that if I make changes to the git-tracked files train.py and/or params.yaml, but don't stage or commit them to git, and then run dvc exp run --temp, then I expect the experiment to contain those modifications.

I think the confusion here is in case of the dvc-tracked file.txt, it's not clear what the source of truth is - file.txt or file.txt.dvc? After all, I could have presented the behaviour above also by modifying file.txt.dvc so that it points to some different version of file.txt in the cache, which perhaps contains the line bar instead of foo. In line with the above behaviour for git-tracked files (file.txt.dvc is git tracked), I would fully expect that the modification to file.txt.dvc gets applied to the experiment, the other version of file.txt gets checked out, and the experiment returns bar. That is not what happens. I have lost quite some compute time because this behaviour surprised me, and other members in my group did so as well.

It is good to understand now why it happens like this (because of dvc commit under the hood) and I can work around it, however from the user perspective I still find it confusing.

@dberenbaum
Copy link
Contributor

I have lost quite some compute time because this behaviour surprised me, and other members in my group did so as well.

@ralfbanisch Could you explain your actual workflow? The minimal example is great, but this is now more of a product discussion where your real use case is more important. How did the files get to be in a modified/missing state in the workspace, and why do you want to ignore whatever changes you made there?

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label Oct 30, 2023
@dberenbaum
Copy link
Contributor

@pmrowla The surprising part to me is that it works differently without --temp. In a workspace run, does that mean we don't do dvc commit but still do dvc checkout?

@ralfbanisch
Copy link
Author

ralfbanisch commented Oct 30, 2023

For me the behaviour is actually the same without --temp, the difference is only that I see then after dvc exp run that file.txt.dvc has a staged change.

The use case for me was to generate data on model performance as a function of dataset size in an ablation experiment. file.txt contains a list of images (subset from the full dataset in the dvc cache) which are used to train the model. My workflow was something like this:

  • generate file_subset{n}_seed{s}.txt by sampling n images from full dataset with random seed s.
  • dvc add file_subset{n}_seed{s}.txt
  • modify file.txt.dvc so that it will point to file_subset{n}_seed{s}.txt
  • dvc exp run --temp/--queue
  • repeat

Since I never modified file.txt or explicitly included the dvc checkout after modifying file.txt.dvc, I ended up with many experiments that had identical dependencies, instead of a different dependency for each experiment, as I had intended. I could have of course just modified file.txt in the first place, knowing what I know now. I thought explicitly adding my modified dependency to the dvc cache and modifying the "dependency pointer" file.txt.dvc was the more appropriate way.

@dberenbaum
Copy link
Contributor

For me the behaviour is actually the same without --temp, the difference is only that I see then after dvc exp run that file.txt.dvc has a staged change.

My mistake there. I misread the issue.

I could have of course just modified file.txt in the first place, knowing what I know now. I thought explicitly adding my modified dependency to the dvc cache and modifying the "dependency pointer" file.txt.dvc was the more appropriate way.

In general, the expectation is that you manage the actual data, and dvc manages the .dvc files. In other words, file.txt should be the source of truth for dvc exp run and dvc will update file.txt.dvc accordingly. If there is somewhere that you think it would help to better clarify, we can try to improve the docs around this.

The distinction is that DVC runs the dvc commit internally, so file.txt.dvc is modified, and the temp workspace will contain the modified file.txt.dvc. Inside the temp workspace, the dvc checkout step then uses the modified file.txt.dvc which results in checking out the modified file.txt.

@pmrowla What happens here when file.txt is missing?

@skshetry skshetry removed the awaiting response we are waiting for your reply, please respond! :) label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants