-
Notifications
You must be signed in to change notification settings - Fork 1.2k
dvc.yaml: added a new option to dvc.yaml to avoid unprotecting files #7072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true. Fixes iterative#6562.
for more information, see https://pre-commit.ci
Documenting the changes in iterative/dvc#7072
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rgferrari ! Thanks for the contribution.
Could you please include some tests for this new flag? You can use existing tests for the persist flag https://github.com/iterative/dvc/search?p=1&q=persist
The new flag should be also added to some additional places (i.e. dvc/stage/serialize.py, dvc/stage/utils.py, dvc/command/stage.py . . .). Similarly to tests, you can use the search URL above to look for persist usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general question I had is regarding whether or not it should actually be named append-only? The current naming convention seems to only apply to a persistent directory (where "append" is referring to adding new files to that directory)
If I have a single file persistent output, and I want to append to it (meaning write data to the end of that single file), this flag would prevent me from doing that
dvc/stage/__init__.py
Outdated
| for out in self.outs: | ||
| if (out.persist or out.checkpoint or out.live) and not force: | ||
| if ( | ||
| (out.persist and not out.append_only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would make it so that instead of being unprotected, the output is removed entirely (meaning it would be treated as a non-persistent output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rgferrari for the contribution. The title of your PR is a bit misleading as it's not adding a new flag to the dvc repro, but to the dvc.yaml and .dvc files.
The discussion on #6562 seems to focus on making this the default or adding a new CLI flag to the dvc repro, so I'd really like it if we move the discussion there ahead and then only commit to the new flag vs the new options in dvc.yaml, as we usually don't break compatibility on dvc.yaml.
Could you please share your workflow, the issues that you are having, and your proposal there?
|
Hi everyone, I appreciate all the feedbacks. Regarding @pmrowla 's question about the name, yes, it really doesn't make sense if you apply it to files. Do you think a name like keep_protect would make more sense? About the 'if' change, you are right, it deletes the folder and write it again, I didn't notice that. Nevertheless I didn't find any other part of the code linking the persist option to unprotecting it, gotta search more about it. Now, about @skshetry 's comment. The PR's title is indeed wrong, sorry, I'll change it latter to "a new option to dvc.yaml...". My proposal was to create a dvc.yaml option because, as @dberenbaum said, it makes more sense to set a dvc.yaml option only one time rather than have to write it again on the command line at every execution. |
|
@iterative/dvc what is the progress here? |
add the flag keep_protect to the dvc.yaml options to avoid unprotecting the files while persist flag is true.
|
@iterative/dvc Pinging about this PR. Could anyone take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, for such a long wait on this PR. We were expecting a lot of data management stuff to be clear after our refactoring with dvc-data.
Unfortunately, I have to decline this PR and would like to continue more discussion on this in #6562. The reason for this is that we already have a few options for modifying the data-management behaviors like persist and cache which themselves are not in a good state (for example, cache: false still caches). This option feels like one more flag to modify behavior, and we may need more to change the behavior that dvc-data provides.
Another reason is that after we commit to this PR, as we usually don't break dvc.yaml file format, we have to support these flags for a long time. We'd like to encourage discussion on #6562, and see if we can even make them the default.
Thanks for contributing to DVC, and again sorry for the very long silence in your PR. π
|
Closing per #7072 (review) |
add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true.
Fixes #6562.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Link to the dvc.org PR: iterative/dvc.org#3056