Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run behaves differently than repro for stages without dependencies #6718

Closed
iesahin opened this issue Sep 30, 2021 · 3 comments
Closed
Labels
A: experiments Related to dvc exp p3-nice-to-have It should be done this or next sprint

Comments

@iesahin
Copy link

iesahin commented Sep 30, 2021

Bug Report

Description

When I create a stage without dependencies, dvc exp run runs it for the first time but doesn't run it in the subsequent runs. We discussed this with @skshetry and he said it's a side effect of using dvc checkout + dvc repro. I believe as the outputs from the previous dvc exp run are checked out and the stages are not run, as their outputs seem to be identical with those in dvc.lock. This may lead to subtle (or not so subtle) bugs in experimentation.

I think stages without dependencies should always run, in both dvc repro and dvc exp run, unless the user explicitly wants otherwise.

Reproduce

When you run the following code:

take /tmp/$RANDOM
git init
dvc init
dvc stage add -n stage1 -o output1.txt 'echo $RANDOM >> output1.txt'
dvc stage add -n stage2 -o output2.txt -d output1.txt 'echo $RANDOM >> output2.txt'
git add . 
git commit -m "dvc init"
dvc exp run -n $RANDOM #1
rm -f output1.txt
dvc exp run -n $RANDOM #2

the second dvc exp run doesn't run the pipeline, telling:

$ dvc exp run
Stage 'stage1' didn't change, skipping
Stage 'stage2' didn't change, skipping

Instead if I use dvc repro after deleting output1.txt, the pipeline is run.

$ rm output1.txt
rm: remove regular file 'output1.txt'? y

$ dvc repro
Running stage 'stage1':
> echo $RANDOM >> output1.txt
Updating lock file 'dvc.lock'

Running stage 'stage2':
> echo $RANDOM >> output2.txt
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock
Use `dvc push` to send your updates to remote storage.

Expected

I'd expect dvc exp run and dvc repro behave identically for stages without dependencies. I believe a stage without dependency (like stage1 above) is meant to be run always. Otherwise there is no clear condition to run it, and there is no point to create a stage that won't be run.

Environment information

This is dvc 2.7.4.

@iesahin iesahin added the A: experiments Related to dvc exp label Sep 30, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Oct 1, 2021

For reference, this does also affect dvc repro. If the output file already exists, dvc repro will not re-run the stage without any dependencies.

You can reproduce this behavior by just running dvc repro twice. (Or by doing dvc repro; rm output.txt; dvc checkout; dvc repro to make it more like what exp run does)

$ cat dvc.yaml
stages:
  rand:
    cmd: echo $RANDOM > output.txt
    outs:
    - output.txt

$ dvc repro
Running stage 'rand':
> echo $RANDOM > output.txt
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock
Use `dvc push` to send your updates to remote storage.

$ dvc repro
Stage 'rand' didn't change, skipping
Data and pipelines are up to date.

IMO this is expected behavior for both repro and exp run. If the user has a stage (with or without deps) that is intended to always be re-run, it should explicitly be marked as always_changed: true.

We could consider changing the current default behavior to always assume stages without deps are always_changed: true, but I think we would need to make that change for both exp run and repro (and not solely for exp run).

@dberenbaum

@pmrowla
Copy link
Contributor

pmrowla commented Oct 1, 2021

To clarify, the default behavior for DVC pipelines is to assume that stages are always deterministic by default (always_changed: false). This applies whether or not the stage has any dependencies. Having no deps just means that the empty/None dependency state {} (rather than the typical {'some_dep.txt': 'abcd1234...'} file/param:hash dep state) still gets mapped to a single deterministic output.

If the stage is non-deterministic (as in this example case where the stage generates a random number), it's on the user to mark it as such with --always-changed/always_changed: true.

related: #2378

@dberenbaum
Copy link
Contributor

dberenbaum commented Oct 1, 2021

There are two separate issues:

  1. It's unclear whether stages without dependencies should always run. @iesahin Do you want to open a separate issue for this? I don't think it's part of the issue title.
  2. exp run and repro behave differently if an output is modified. As @skshetry explained, this occurs because exp run will checkout the old output before running repro. This will occur for any type of stage, but stages without dependencies don't rely on the run-cache, so they will always be run by repro if the output changes, whereas other stages might still get skipped if they are found in the run-cache.

@dberenbaum dberenbaum added the p3-nice-to-have It should be done this or next sprint label Feb 17, 2023
@mattseddon mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

4 participants