Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to skip certain stages when dependencies are missing #10019

Open
Luux opened this issue Oct 13, 2023 · 3 comments
Open

Allow to skip certain stages when dependencies are missing #10019

Luux opened this issue Oct 13, 2023 · 3 comments
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature triage Needs to be triaged

Comments

@Luux
Copy link

Luux commented Oct 13, 2023

If we add a new dataset, we want to run all the data processing steps, but for example skip the evaluation as we do not have labels yet. We still want to utilize the foreach functionality to iterate through our different datasets.

For this scenario, it would be helpful to have a kind of deps which specifies "if not present, skip the stage instead of throwing an error, but behave just as normal deps otherwise". Currently, we have to add dummy files manually in order to be able to run dvc repro.

We can use dvc repro --keep-going for now, but this does not differentiate between missing dependencies and other errors that might occur. Also, in cases like my example above, we want to treat the current state of the pipeline as clean.

@shcheklein
Copy link
Member

How about --allow-missing? can it be applied in this case, @Luux ? https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed

@shcheklein shcheklein added A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature triage Needs to be triaged labels Oct 13, 2023
@dberenbaum
Copy link
Contributor

Why is it not enough to specify a target like dvc repro train so that the pipeline stops before the eval stage?

@Luux
Copy link
Author

Luux commented Dec 1, 2023

@dberenbaum We want the entire pipeline to be clean. The idea of our pipeline is that we want to define everything relevant in our pipeline configuration (dvc.yaml), so that we just need to run dvc repro and do not have to think about anything else to get our data up-to-date from a user perspective. If dvc repro or dvc repro --dry is clean, this means we know that everything is fine.

If our datasets variable consists of a list of 4 datasets, and we have a stage eval with a foreach loop, this would result in

eval@dataset1
eval@dataset2
eval@dataset3
eval@dataset4

But dataset4 might not have the required labels.yaml yet, so it fails. Of course we could ignore errors, but dvc repro would still mark eval@dataset4 as dirty. To change that, we want to add a flag to the labels.yaml dependency that leads dvc to simply not consider the stage if it is missing. This means, in this scenario, dvc repro should only consider

eval@dataset1
eval@dataset2
eval@dataset3

as well as the corresponding downstream stages (maybe somehow mark/log them as not considered) . Nevertheless, dvc repro should be clean afterwards unless the file is added later on. Basically, we'd need a separation between "dependency does not exist" and "dependency exists, but is just not pulled to our local machine because we do not need it right now", which seems more to be the purpose of --alow-missing.
If the former is not feasable (as you'd need to check if some file is acutually references somewhere and therefore should exists at least one the remote), another option would be to allow subtractive for-loops/variables that are like "foreach $datasets except dataset4"

Currently, the way to mimic this is to create empty files and handle this case within our data handling package/script/program all the way down. Or to define a separate variable datasets_for_eval which just consists of the first three datasets, which both are not very elegant in my eyes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants