Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cache type: read-only hardlink #799

Closed
dmpetrov opened this issue Jun 21, 2018 · 6 comments
Closed

New cache type: read-only hardlink #799

dmpetrov opened this issue Jun 21, 2018 · 6 comments
Assignees
Labels
enhancement Enhances DVC question I have a question?
Milestone

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Jun 21, 2018

It is often confusing that user can edit the data files and it breaks immutable cache. Often it is not clear for users. An example #754.

The data file read-only semantic should be clear. Let's:

  1. Introduce a new CacheType - read-only hardlink ro-hardlink.
  2. Make ro-hardlink as the default type for file systems without reflinks support.

btw... It should be easy to implement (2) without breaking the back compatibility - just keep hardlink as the default type and create a config file with ro-hardlink in new dvc versions.

We should understand that ro-hardlink will change the expected Git-like behavior (dvc will change data file permissions) but it will protect immutable cache which is more important.

@dmpetrov dmpetrov added the question I have a question? label Jun 21, 2018
@dmpetrov dmpetrov added this to the 0.9.8 milestone Jun 21, 2018
@efiop efiop modified the milestones: 0.9.8, 0.9.9 Jun 21, 2018
@efiop efiop added the enhancement Enhances DVC label Jun 21, 2018
@efiop
Copy link
Contributor

efiop commented Jul 19, 2018

I think a better way to achieve this would be to introduce 'protected' mode for dvc repository. I.e. dvc config core.protected true will tell dvc to set read-only permissions on data files(both hard and symlinks). Plus we would introduce dvc protect/abandon(similar to git annex lock/unlock, but our dvc lock\unlock are already taken by stage locking) that will allow to set and unset read-only permissions on individual files. dvc abandon file.dvc will replace links to read-only cache with copies that will have write-permissions thus keeping the cache safe.

@dmpetrov
Copy link
Member Author

dmpetrov commented Aug 5, 2018

It looks like this feature should become a default behavior in dvc 1.0 (with the dvc commit). @efiop what do you think?

Protect\Abandon - does it affects all the cache types?

  • Hardlink - yes
  • Reflink - no
  • Symlinks - yes?

@efiop
Copy link
Contributor

efiop commented Aug 5, 2018

@dmpetrov I agree, we should definitely consider making it a default behaviour for 1.0 . And yes, only hardlinks and symlinks should be protected by default.

@dmpetrov
Copy link
Member Author

dmpetrov commented Aug 5, 2018

So, 2 cache types (Hardlink and Symlinks) need the new parameter (Protect\Abandon) for the cache immutability and other 2 types (Reflinks and Copy) do not need that.

As a result, DVC can protect immutable data files in two possible ways:

  1. Introduce new cache types ro-hardlink and ro-symlink (and use them by default).
  2. Introduce a global config parameter Protection = Protect|Abandon for hardlink and symlink instead.
    2.1. Add an ability to change hardlink and symlink protection settings per file - dvc protect\abandon file.dvc. To me, it is hard to imagine a realistic scenario for this feature, but I like the flexibility it provides. It is probably okay if we hide this feature from basic use cases.

@efiop could you please clarify if my understanding is correct?

@efiop
Copy link
Contributor

efiop commented Aug 5, 2018

Introduce new cache types ro-hardlink and ro-symlink (and use them by default).

Correct, the pros for this one is that it fits nicely in the currect cache.type logic. We will still have to introduce something similar to dvc protect|abandon to provide a convenient way to modify a file if it is protected.

Introduce a global config parameter Protection = Protect|Abandon for hardlink and symlink instead.
2.1. Add an ability to change hardlink and symlink protection settings per file - dvc protect\abandon file.dvc. To me, it is hard to imagine a realistic scenario for this feature, but I like the flexibility it provides. It is probably okay if we hide this feature from basic use cases.

I was actually thinking about something like core.protected = true|false, but it is essentially the same thing. I am also considering it affecting all cache types, just for the simple symmetry, but I'm not yet sure. Will reconsider all the options while preparing a PR for this. My current draft is using core.protected for all the cache types and also provides dvc protect + dvc abandon CLI commands. But I definitely need to give it a better thought.

@dmpetrov
Copy link
Member Author

dmpetrov commented Aug 5, 2018

Good! It looks like the protected config param is a better solution. However, two additional commands dvc protect/abandon for a very rare use case (why I need to make a data file mutable?) look like an overkill.

We should think more about how to avoid these commands.

@efiop efiop self-assigned this Oct 26, 2018
efiop added a commit to efiop/dvc that referenced this issue Oct 26, 2018
This is an opt-in mode, that will make data in the workspace
read-only, thus protecting cache from corruption when hardlinks or
symlinks are used. User can use `dvc unprotect file` command, to
replace read-only link to cache with an editable copy. We can use
this for now to decide if we want to go with a `unified workflow`
in the future or if we want to set unprotected `reflink,copy`
by default.

Fixes iterative#799

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC question I have a question?
Projects
None yet
Development

No branches or pull requests

2 participants