Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data files protection #1599

Closed
shcheklein opened this issue Feb 9, 2019 · 36 comments

Comments

Projects
None yet
10 participants
@shcheklein
Copy link
Member

commented Feb 9, 2019

I would like to bring more community attention to an important topic of protecting data artifacts that are under DVC control. Before we decide to implement a certain workflow (more on this below) it would be great if everyone would share their thoughts on what option looks more appealing and why.

First, let me state the problem - shorter version and a longer one, which you can skip if you feel that you understand the shorter one. So, bear with me please, I'll try to give more explanations along the way

Short version:

  • data files in the project workspace that are under DVC control are linked (usually via hardlink, sometimes reflink or symlink, see longer version for the detailed explanation) with their counterparts in the local cache (.dvc/cache). It is an optimization to avoid copying to save time and space.
  • as you know you need to be extra careful when you want to modify or overwrite those files. It means that you need to remember to run dvc unprotect or dvc remove.
  • by default there is no protection whatsoever which means that it's easy enough to corrupt the cache even w/o noticing this.


⬇LONG VERSION (CLICK TO EXPAND) ⬇ When you do `dvc add` or `dvc checkout`, or other operations that result in DVC taking some files under its control, DVC calculates `md5` for the file and puts it into local `.dvc\cache`. This is done to save the file, and semantically similar to `git commit`.

The naive way to perform this operation would be something like this:

  $ md5 data.tgz  
     xyz1234567890abcdefgh
  $ cp data.tgz ./dvc/cache/xy/z1234567890abcdefgh

It basically creates a full copy of the file (z1234567890abcdefgh) which is addressable by its md5 sum.

Instead, DVC tries to optimize this operation and runs something like this:

  $ md5 data.tgz  
  xyz1234567890abcdefgh
  $ mv data.tgz ./dvc/cache/xy/z1234567890abcdefgh
  $ link ./dvc/cache/xy/z1234567890abcdefgh data.tgz

And your workspace looks like this now:

...
data.tgz   -----> ./dvc/cache/xy/z1234567890abcdefgh
...

The goal is to avoid copying files! Thus, it gives you speed and saves space - two great benefits, especially if you deal with GBs of data! Nice, right? :)

To give you even more details, internally this link operation actually tries one by one different types of links (from the very best to worst):

  • reflink - also know as a copy on write. Two files will share the same content. Only when update to one of them happens FS will create a copy of a changed block. All this a seamless operation from the user perspective. This is best type of a link, but it's supported only on the most modern FSs - brtfs, apfs, windows server
  • hardlink - similar idea, but no copying happening on write - both files are going to be changed.
  • symlink - it's a regular symlink, no magic. If you write to one file you see changes in both.
  • copy - it neither of the link types work DVC will just clone the file. The slowest and space consuming way, but it's safe as reflinks, because both file are independent.

To summarize:

Link type No file copy (space) Speed Safe to edit (dvc unprotect is not required) Cache on a different FS
Reflink (CoW)
Hardlink
Symlink
Copy

This optimization with links does not come for free. As you can see (from the Safe to write column above) not all links support file updates directly from the working space. It ends up it the workflow like this. Let's imagine, we have a file some.csv in the project. Before it was DVC-managed, nothing prevented us from adding more entries, modifying it in any other way, or even rewriting it (for example, if there is a script that generates it). Now, right after we did dvc add some.csv even though from the user perspective it looks like a regular some.csv it's not that simple file anymore - it might be a link and it might be not safe to edit anymore.

Before:

project

Copy:

copy

It's safe to edit (`some.csv') - it has its own copy of content. But it takes time and space to create it, obviously.

Hadlink:

hardlink

It's not safe to edit. Both files share the same content. If you edit some.csv the file in cache that has its md5 as an address will be corrupted. Basically, it will have an inconsistency - path will not correlate with content.

To mitigate this problem, we had to introduce two things:

  1. A protected mode which basically makes all files that are under DVC control read-only and prevents cache corruption.
  2. dvc unprotect command. It a syntax sugar to remove the read-only mode and if needed make a full copy so it's safe to edit the file.



Bottom line: having files unprotected by default and a mandatory dvc unprotect/dvc remove workflow to modify files create problems and is confusing: #799, #599 (comment), #1524, etc.

Possible solutions:

There are two possible way to make it consistent and safe for users:

1️⃣Enable the protected=true mode by default. All files that under DVC control are read-only. Users must run dvc unprotect (or dvc remove depending on the use case) to modify, overwrite, replace the file:

  • Prevents cache corruption. User won't be able to edit the read-only file without explicitly changing the permission.
  • Workflow breaks. User has to learn and remember to run dvc unprotect before modifying any file under DVC control. It means that a very regular workflow with data looks different.
  • Scripts that edit/overwrite files now will start raising exceptions if dvc unprotect has not been run. Might be confusing and frustrating to users. Let's imagine that a script that takes a few days to train a model fails at the very end with an IOException only because we put the model file under DVC control.
  • Workflow is a bit complicated but is always unified. Files are always protected by default! You always have to remember about the protection.
  • Even on modern file systems like APFS (Mac OS) or xfs (default on some Linux distro) users will have to use dvc unprotect workflow even though it's not required on them.

2️⃣Enable cache type=reflink,copy by default. It means that users on almost all wide-spread file-systems that do not support reflinks (like ext4, NTFS, etc) will experience a performance downgrade and increased space consumption. Let user opt-in into hardlinks and symlinks + protected (dvc unprotect workflow) advanced mode only if it's needed. For example, when they have to deal with GBs of data. Another option for these advanced users will be to update the FS - use brtfs or apfs:

  • Prevents cache corruption. Reflinks and copy cache types are safe and do not require any additional commands.
  • Performance penalty on commonly used file-systems (ext4 on Linux, NFS Windows, older Mac OS systems) because CoW (reflinks) links are not supported on them. We can write a message about the advanced mode if files are too big and it takes too long to add them to DVC.
  • Space penalty for the same reason. Again, we can mitigate this to some extent with messages.
  • Potentially confusing because project that works very fast on the modern system (Mac OS) will be slow on Linux or Windows.
  • Workflow stays compatible with the workflow w/o DVC. No need to learn or run dvc unprotect unless you have GBs of data and really need to optimize it.

Finally, the question is - which one should we pick as a default option? 1️⃣ vs 2️⃣? Am I missing some other arguments in favor on in contra to one of the options? Am I missing other ways to solve this nasty issue? Please, share your thoughts & vote for one of the options & explain your motivation. It's extremely important to hear out you guys!

@dmpetrov dmpetrov added the question label Feb 9, 2019

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@tdeboissiere @sotte @ternaus @ophiry @colllin @villasv @polvoazul @drorata @Casyfill Guys, would love to have your feedback on this discussion. I have tried to explain it as simple as possible. Basically it's a tradeoff between performance and UX simplicity in the default DVC setup. As most active users you should definitely have something to say on this.

@sotte

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

Not an answer to your question, but maybe relevant.

When not using DVC I set all my data to be immutable (locally as well as remote storage buckets). That means I can only add data (newer version, transformations, etc.), I can access all versions of the data at the same time, and I can not change existing data. Protecting data from accidental changes/corruption is super important and a requirement for reproducibility.
As a sidenote: Git objects are read only.

Regarding copy: assume you have to copy >100GB, that would be a deal breaker. It just takes way too long.

@drorata

This comment has been minimized.

Copy link

commented Feb 12, 2019

Wonderful thread. Thanks for bringing it and for inviting me.

If I understand correctly, (2) means that linking will be reflink if possible and otherwise copy. My remarks next are given this understanding.

There is no magic solution between performance and protection. As our data is so valuable, its protection should come first.

I would favor (2) as it is not breaking workflows. I would also bring this fundamental issue to users attention, ideally as part of the installation process. If a user prefers option (1) they can take it manually. In this case it is important that unprotecting a file should be accessible also from within python. This way, scripts can be adopted to the needed flow.

As the last option, users can go back to the current default behavior - but this should be an active choice of the user and the ramifications should be clearly communicated.

@villasv

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

I'm down for [1]. Every workflow that breaks is trying to corrupt cache, at least that's what I get at first glance.

[2] could be a first stage on adoption plan for [1] though. Instead of just enforcing protection at first, make it a best practice in the tutorials, let the community sink that in, only then break their workflows because they should've know better. No one should expect their workflows not to break if dvc has a new major release anyway.

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@sotte thank you, it's really valuable! Let me clarify a few questions:

I can access all versions of the data at the same time, and I can not change existing data. Protecting data from accidental changes/corruption is super important and a requirement for reproducibility.

What about end results? Let's imagine we have a script that produces a model file, or a metric file? Do you make your scripts specifically to create a new copy every time? Have you ever had a script that was overwriting some results? In dvc terms it could something like this:

dvc run -d data -o model.pkl -m metrics.json train.py data

Basically the difference between the two options means that the first might fail at the very end of the training cycle if files already exist (both model.pkl and metrics.json will be set to read-only after the first run). User has to remember to run dvc unprotect before running the script again, or alternatively run it with dvc repro that takes care of unprotecting the outputs first.

Regarding copy: assume you have to copy >100GB, that would be a deal breaker. It just takes way too long.

Totally! Do you think a message that there is an advanced mode available would not be enough in this scenario. Basically, we can detect that files are large and it would too long to copy them and warn the user.

It might be actually an improvement to both options - introduce a flag that specifies what cache type is used per output. For example, for dvc add we can always specify 1 - protected, space efficient, for dvc run metric outputs always 2.

What do you think about this?

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@drorata thanks! You are absolutely right, protection first. Just to clarify, both solutions proposed do protect files. The tradeoff is performance vs UX complexity in the default DVC setup. What do you think about that possible improvement, btw (check the previous answer, last paragraph).

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@villasv thanks! Every workflow that breaks is trying to corrupt cache - I'm not sure I got it right, could you clarify a little bit please?

@sotte

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@sotte thank you, it's really valuable! Let me clarify a few questions:

I can access all versions of the data at the same time, and I can not change existing data. Protecting data from accidental changes/corruption is super important and a requirement for reproducibility.

What about end results? Let's imagine we have a script that produces a model file, or a metric file? Do you make your scripts specifically to create a new copy every time? Have you ever had a script that was overwriting some results?

Remember, the setup I described does not use DVC.
Model files (or metrics or any artifact) are just data, so I can write them, but never change them. All my scripts/tools fail when they try to edit/overwrite an existing file (the files are write protected after an experiment). Running an experiment that creates a model looks something like this:

python experiment/some_project/train_file_xyz.py --lr 0.7 <more parameters>
# Creating experiment <UUID>...
# Storing artifacts under data/experiments/some_project/train_file_xyz/<UUID>
# Train...and create artifacts and metrics...
# Make all data under data/experiments/some_project/train_file_xyz/<UUID> immutable
# Uploading artifacts to remote bucket...

I normally create a lot of artifacts and metrics. The tracking tool also tracks git version and parameters. DVCs metric tracking system never cut it for me. We use some external services for tracking parameters and metrics.

In dvc terms it could something like this:
dvc run -d data -o model.pkl -m metrics.json train.py data

Basically the difference between the two options means that the first might fail at the very end of the training cycle if files already exist (both model.pkl and metrics.json will be set to read-only after the first run).

In my case the data would be written to another location automatically. I'd argue that a failed training is also a valuable training :)

User has to remember to run dvc unprotect before running the script again, or alternatively run it with dvc repro that takes care of unprotecting the outputs first.

Wait if dvc repro takes care of it automatically, then everything is fine. Just use dvc repro :) I think I'm missing something.

Regarding copy: assume you have to copy >100GB, that would be a deal breaker. It just takes way too long.

Totally! Do you think a message that there is an advanced mode available would not be enough in this scenario. Basically, we can detect that files are large and it would too long to copy them and warn the user.

It might be actually an improvement to both options - introduce a flag that specifies what cache type is used per output. For example, for dvc add we can always specify 1 - protected, space efficient, for dvc run metric outputs always 2.

What do you think about this?

I have to say, that I'm relatively happy with my workflow right now and haven't used DVC for our bigger projects. I also don't see myself overwriting files. So option 1 or 2 are not really relevant for me. I want access to different versions at the same time. This is one of DVCs the big design decisions that I disagree with.
That being said, I really like DVC :)

@villasv

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@shcheklein

@villasv thanks! Every workflow that breaks is trying to corrupt cache - I'm not sure I got it right, could you clarify a little bit please?

These two downsides:

  • Workflow breaks. User has to learn and remember to run dvc unprotect before modifying any file under DVC control. It means that a very regular workflow with data looks different.

  • Scripts that edit/overwrite files now will start raising exceptions if dvc unprotect has not been run. Might be confusing and frustrating to users. Let's imagine that a script that takes a few days to train a model fails at the very end with an IOException only because we put the model file under DVC control.

These shouldn't be happening anyway, so you'll only break peoples projects if they were already broken, in the sense that they were likely causing cache corruption.

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@villasv gotcha! I was thinking more about new users. It's not about backward compatibility. They have already a certain way of writing training scripts and they don't use dvc at all. Now, they want to start using it. The first barrier they will hit is this advanced dvc unprotect and caching stuff and failures in their existing scripts.

@villasv

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@shcheklein Absolutely. My point is that the barrier has always been there. Before, the barrier was a frustrating experience with caches being corrupted. The barrier then becomes some human readable error in the CLI and very well detailed in a tutorial/FAQ on it.

It was quite a barrier for me, I had to change all my scripts to make sure none of them would even reuse the fs node even if they were truncating with > redirection. But at least if I started with that protected mode today, I would understand what was going on immediately.

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@villasv yep, agreed. It's a little bit nuanced bc it's hard to make the error human readable - users' code will be failing with some IOExceptions and we don't control that :(

What about the option [2]? It also gives protection (by using copy or advanced reflinks if they are available). Imagine yourself doing a first project with DVC or migrating the existing project, what would be more painful - copying (with some message that more efficient mode is available if it takes too long and files are big) or hitting some random IO errors and being not able to edit/move files?

@villasv

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

My first few DVC projects dealt with datasets small enough that using copy would be fine, so option [2] would definitely would have been easier. I was lucky, though.

But again, I defend both options as good options. Both enhance protection. The trade-off is between harder adoption and lower default performance. If I was a new user today I can imagine each scenario:

  • Option 1
    I dvc add my datasets, put my first dvc run command in place, everything is looking good. Suddenly when I try to run one of the scripts, I see some cryptic error because the file is read-only. WTF? I google dvc script {whatever error} and find a GH issue / dvc doc page saying that my script is trying to edit the file in-place and that's bad but if the migration effort is too big all I have to do is to turn protection off.
    I turn it off, life goes on. I learn that I shouldn't be doing that, and will sleep thinking about something that is left unprotected.

  • Option 2
    I dvc add my datasets, geez this is slow. I put my first dvc run command in place, quite a response time huh. Well ok let's keep putting all these commands. It keeps giving me these warnings that I'm running stuff unprotected and duplicating data. Geez, I didn't do anything to deserve that. I google the warning, the Internet tells me that I'm using copy cache and that sucks, but I'm stuck with it until I adjust my scripts. Oh well, am I going to receive that warning forever? It seems so,,,

@colllin

This comment has been minimized.

Copy link

commented Feb 12, 2019

Thank you for inviting me to this thread.

I'm sensitive to the scenario you describe in [1] where a days-long script results in an IOException due to forgetting to unprotect an output file. I feel that it will be difficult to justify the value of DVC if/when work is lost in this way.

At work, I deal exclusively with annoyingly-large image datasets, so I'm also sensitive to the "copy" strategy in [2], mostly because my manager is sensitive to EBS costs. It appears that 2 out of 3 downsides to [2] are related to large datasets.

Assuming that my team and I will usually end up on filesystems which don't support reflinks, I think I could survive the "copy" strategy in [2] if there was a mechanism for caching directly to the dvc remote, i.e. dvc add data.tgz --push. Such a mechanism would hash data.tgz, update data.tgz.dvc, and then upload data.tgz to the remote, without creating a copy in the local cache. At that point, my cached copy would be safe (on my remote), and my local copy would be safe to update. Uploading would be slower than creating a local cached copy, but if you intend to eventually upload the copy to your remote anyway, then uploading is faster than both copying to the local cache and later uploading it, and there is the added benefit of eliminating the extra disk space requirements. We would probably need to protect the file during the hash & upload operation.

I would probably prefer this direct-to-remote cache mechanism over everyone remembering to upgrade their file system in option [2], or remembering to unprotect their dvc-tracked files in option [1].

I might need help thinking through any other ramifications of this behavior.

More on my thinking (click to expand)

I agree that it is of utmost importance to protect the integrity of the data which is added to dvc. I think if dvc has only one guarantee, it should be that if I dvc add a data file, git commit the *.dvc marker file to git, git push, and dvc push, then I (or any of my teammates) will be able to later recover the exact state of my repository and data file by checking out the same git commit and pulling from the dvc remote. With git, git commit makes the file recoverable on this machine, and git push makes the file recoverable across machines. With dvc, it seems that dvc add could/does make the file recoverable on this machine, and dvc push makes the file recoverable across machines.

In [2], I imagine that dvc add would create a copy of the added file in the local cache, and dvc push would upload the local cache to the remote. For large datasets, a command such as dvc add data.tgz --push could combine these actions, which would immediately upload the file to the remote without creating a local copy.

If I decided to dvc add without --push, then in order to arrive in the same disk state, I would later need to dvc push; dvc gc, which would upload the cached data file and then remove it, since I already have another (unprotected, possibly modified) copy of that data file. Here, I'm using dvc gc as a way to say "remove anything from the cache which is safely uploaded to the remote", which may not be what it does right now.

@mroutis

This comment has been minimized.

Copy link
Collaborator

commented Feb 13, 2019

Comments

1️⃣

Scripts that edit/overwrite files now will start raising exceptions if dvc unprotect has not been run. Might be confusing and frustrating to users. Let's imagine that a script that takes a few days to train a model fails at the very end with an IOException only because we put the model file under DVC control.

I guess it would be something like the following:

sleep 1000 && echo "hello" > hello
dvc add hello # chmod -w hello
dvc run -d hello -o greetings "cp hello greetings"
sleep 1000 && echo "hola" > hello
# permission denied: hello

Indeed, is awful to have an exception like that after waiting for so long. The question is, why wouldn't you have something like the following instead:

dvc run -o hello "sleep 1000 && echo hello > hello"
dvc run -d hello -o greetings "cp hello greetings"
# sed -i "/cmd/ s/hello/hola/'' hello.dvc
dvc repro hello.dvc

Workflow breaks. User has to learn and remember to run dvc unprotect before modifying any file under DVC control. It means that a very regular workflow with data looks different.

If you are modifying a protected file by hand, the worst that could happend is seeing a message like:

E45: 'readonly' option is set (add ! to override)            

But you wont lose any data.
If your text editor is scriptable, you can workaround it by defining a command that runs dvc unprotect before writting the data to the file.

" NOTE: This is only an example, not an accurate implementation
autocmd BufWritePre * if !&modifiable | :!dvc unprotect <afile>

Even on modern file systems like APFS (Mac OS) or xfs (default on some Linux distro) users will have to use dvc unprotect workflow even though it's not required on them.

Why can't DVC be smarter about this one? It can use a cross-plataform implementation to check the file system in use https://github.com/giampaolo/psutil


2️⃣

Performance penalty on commonly used file-systems (ext4 on Linux, NFS Windows, older Mac OS systems) because CoW (reflinks) links are not supported on them. We can write a message about the advanced mode if files are too big and it takes too long to add them to DVC.

I really like the idea of informing the user when DVC is doing some heavy work (copying big files); "friendly" approaches are always better (you can't imply users will read the manual looking to improve their experience with the tool).

Conclusion

My vote goes with number 2️⃣ . I appreciate more the ease of use that it offers, and that you can have efficiency as an opt-in (with a recommendation from the tool when the cache starts to grow).

@drorata

This comment has been minimized.

Copy link

commented Feb 13, 2019

@shcheklein Just to make sure I understand, the difference between [2] and the current behavior is that currently if reflink is not available the fallback would be another available linking which is not safe. Am I right?

The performance price of copy is divided into two (a) time and (b) space and has to be paid only when dealing with large/huge datasets. Now, time is only an issue when checking out different data from history. The space price is only limited to the working copy. Say I have a dataset X (1GB) which evolved into Y (2GB) and then into Z (3GB), then my cache size will be 5GB but my working directory would be only 3GB. When reverting to X-state, the cache size won't change and only my working directory will. Am I right? In this case, @colllin 's boss should not be worried; the storage costs will be the same. If I'm right on this one, it means that the only real concern is the time issue.

The more I think about it, the price of copy is marginal and therefore option [2] is indeed my favorite one. It is worthy to make sure that whenever this approach charges its price (time copying) the user should be notified that there are more efficient options (which brings some overhead in terms of changing workflows/scripts/etc.

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

@drorata thank you for your thoughtful analysis! :)

if reflink is not available the fallback would be another available linking which is not safe. Am I right?

Yep! To be precise - it's not the link itself is not safe, it's because we don't put files under protection (make them read-only) for the links that require that protection.

The space price is only limited to the working copy.

You are absolutely right again! It's (1 x ALL VERSIONS + 2 x CURRENT VERSION). I still think it can a limiting factor in certain cases, like @colllin 's, but I would love to hear from him. It looks like for them the difference is paying for 2TB vs paying for 1TB EBS.

@dmpetrov

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Great discussion! Thank you guys for your thoughtful opinions.

It is not an easy choice... The 2️⃣-nd option definitely provides better user experience while 1️⃣-st one is better for large datasets.

DVC was initially designed with the large dataset scenarios in mind and this is why it (still) prioritizes hardlink over copy (default mode is reflink,hardlink,symlink,copy). However, we see that the large dataset scenario is not the most common use case.

Yet another opinion

I think that it might be a good choice to set the 2️⃣-nd option by default as the most common use case but give an easy ability to switch to the 1️⃣-st option. The logic behind: the best user experience for mass users but if you have large data files then please do an extra step.

Issues

There are a few issues related to this opinion/choice:

  1. Large file owners might not realize that the optimization is available. It can be easily solved. As Ivan said in the initial comment "We can write a message... if files are too big and it takes too long".

  2. People with large datasets not necessary need to optimize all the files - it is fine to copy an ML model (even 500Mb). It might be nice to have a per file option for 1️⃣. Ivan had already mentioned this:

introduce a flag that specifies what cache type is used per output. For example, for dvc add we can always specify 1 - protected, space efficient, for dvc run metric outputs always 2.

It can look like a flag for dvc add and dvc run:

$ dvc add images.tgz
[===                                      ]  8%  (it is slow - consider '-l' option)
^C   # <-- cancel the command

$ dvc add -l images.tgz   # <-- no copy. faster!

$ dvc run -d images.tgz -o images/ tar zxf images.tgz
Copying 'images' to workspace.
[==                                       ]  5%  (it is slow - consider '-l' option for outputs)

$ dvc run -d images.tgz -l images/ tar zxf images.tgz   # <-- no copy, faster!

So, this approach gives an ability to switch to 1️⃣ for a particular file. It requires some development - specify cache type in dvc-files, not config file.

What do you guys think about this combination 2️⃣ by default and 1️⃣-per file?

  1. We use protect/unprotect terminology just because DVC does not protect data right now. All agreed that protection is must have in DVC (the only question is the strategy). So, the terminology won't be relevant once we start using one of these by default. We need to use terms like "optimized files" or "large files" insted.

  2. @sotte brought a good point about default protection mode for all datasets as well as models. I think user should decide if he needs to use the protection and turn it on himself. However, DVC does not do a good job on this. It seems like a bug:

$ cp ~/Download/1.txt ~/Download/2.txt .
$ chmod a-w 2.txt
$ dvc add 1.txt 2.txt
$ ls -l 1.txt 2.txt
-rw-r--r--@ 2 dmitry  staff    53K Feb 13 19:04 1.txt
-rw-r--r--@ 2 dmitry  staff    53K Feb 13 19:04 2.txt   # <-- permissions were lost!
@drorata

This comment has been minimized.

Copy link

commented Feb 14, 2019

@shcheklein wrote:

The space price is only limited to the working copy.

You are absolutely right again! It's (1 x ALL VERSIONS + 2 x CURRENT VERSION). I still think it can a limiting factor in certain cases, like @colllin 's, but I would love to hear from him. It looks like for them the difference is paying for 2TB vs paying for 1TB EBS.

I believe the correct "formula" is 1 x (ALL VERSIONS - CURRENT VERSION) + 2 x CURRENT VERSION. Either way, as far as I get it, the size of the remote storage is not influenced by the method used. Furthermore, the space difference is only reflected in the factor 2 of the current version; this is in a way O(1) and can be ignored. Lastly, the space performance is only affecting the local machine(s) and @colllin's boss should not be worried.

@dmpetrov In your example, just to make sure, is dvc run -d images.tgz -o images/ tar zxf images.tgz slow due to the copying of the decompressed directory?

Personally, I am worried that the on-a-file-level solution introduces too much complexity for a second order problem. I think a KISS is to be preferred, namely, play the safe side with option [2] and make sure the user is aware that the performances can be improved but then there's a workflow complexity trade-off.

@dmpetrov

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@drorata slow means - copying of course and we can easily measure that. We cannot judge the performance of command.

It is a good point regarding complexity. We should think carefully if DVC can get enough benefits from this complexity increase. We can easily implement per-repository level protection and keep the per-file level as a feature request and see if there is a demand for that. It won't break backward compatibility.

@drorata

This comment has been minimized.

Copy link

commented Feb 14, 2019

@dmpetrov doesn't option [2] address the per-repository issue?

@dmpetrov

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@drorata Both are per-repository (strategy is defined in config file). There is a possibility to make them per-file (overwrite the strategy in dvc file).

@colllin

This comment has been minimized.

Copy link

commented Feb 14, 2019

@shcheklein Yes, that sums it up — the difference is paying for 2TB EBS vs. 1TB EBS. EBS costs (without DVC) account for 15-20% of our annual AWS bill.

Regarding @drorata's comment, I don't think this should be dismissed as O(1). While the cache size is constant (does not grow) with the number of versions, it does grow linearly with the size of my data, effectively doubling my disk requirements. If there is no (built-in) way to offload this cached copy (per my proposal above), then if I modify my dataset, after I dvc add the new version I will have 3 copies of my dataset until I dvc gc, further increasing the required disk space in order to meet peak demand.

For reference, where data.tgz is 141GB, working from an EC2 instance, EBS is 750GB gp2 with 2250 IOPS, uploading at ~100 MiB/s:

$ time cp data.tgz temp.tgz
real	19m16.066s
user	0m0.360s
sys	2m28.980s
$ time aws s3 cp temp.tgz s3://iterative-ai/deleteme_doitnow.tgz
real	22m9.277s
user	45m30.816s
sys	17m53.392s

It surprises me, but it appears that the copy+upload approximately doubles the time compared to upload alone, and I think it speaks to the validity of my proposal of caching directly to the dvc remote. Again, this gives me more control over my time and disk requirements, without adding steps to non-dvc-specific workflows. To clarify, because I think I understand better now what I was proposing, I think I proposed two features:

  • Allow me to garbage-collect copies of the current version in local cache which have been safely uploaded to the remote. This is frees up disk space and reduces the peak demand to 2x rather than 3x.
  • Allow me to cache directly to the remote, which saves total time, and further reduces peak demand for disk space to 1x.

Possibly dvc could warn when copying large files, and in dvc status when large files are cached. This would give an opportunity to link to docs pertaining to best practices when working with large files.

I understand that the use cases I'm discussing might be rare, and I fully expect you to take that into consideration.

@drorata

This comment has been minimized.

Copy link

commented Feb 14, 2019

I'm missing something: how (and why) does the remote cache influenced by the linking/copying strategy of DVC? If @colllin has a dataset which evolves and gets bigger, then each iteration has a unique copy in the cache and this is the only copy in the cache (local and remote alike). Each new version of the dataset has to be pushed to the remote cache upon push. Isn't it so? I don't understand how the remote storage sneaked into the discussion.

@colllin

This comment has been minimized.

Copy link

commented Feb 14, 2019

The remote cache is not influenced by the linking/copying strategy of DVC. The remote cache is only relevant due to my observations/assumptions that (a) the remote cache is generally the true/safe destination which fulfills the full promise of DVC and that (b) I rarely revisit past versions — I need to have access to them, but it doesn't need to be immediate. Thus, in the context of option 2️⃣, I don't really care about having any local cache once it is safe in my remote cache, especially at the expense of time (copying) and disk space (copying plus dvc-imposed local cache behavior). So my proposal prioritizes the use of the remote cache in order to reduce the local time and disk space requirements. I'm not suggesting anything about space requirements for the remote cache — it's taken for granted that it grows with the number of versions and size of your data files.

@colllin

This comment has been minimized.

Copy link

commented Feb 14, 2019

I've also been maybe-too-indirectly suggesting that, if the main drawbacks of option 2️⃣ are related to large files, then 1️⃣ solves that by linking them and imposing a protect/unprotect workflow, whereas my proposal might offer an alternative which doesn't impose any changes to normal workflows outside of dvc-specific workflows, while minimizing the negative impacts of the "copy" strategy for large datasets. It's totally fine if it's not interesting — my goal right now isn't to push my proposal, but to make sure it is understood.

@drorata

This comment has been minimized.

Copy link

commented Feb 14, 2019

@colllin I think we're slowly converging :)

Can you clarify what you meant with the x1, x2 and x3 factors:

  • Allow me to garbage-collect copies of the current version in local cache which have been safely uploaded to the remote. This is frees up disk space and reduces the peak demand to 2x rather than 3x.
  • Allow me to cache directly to the remote, which saves total time, and further reduces peak demand for disk space to 1x.
@colllin

This comment has been minimized.

Copy link

commented Feb 14, 2019

I'll try. Based on my understanding of the "copy" strategy in option 2️⃣, the best case scenario, meaning that you've just run dvc gc, is that you have a total of 2 copies of your data locally:

  • a cached copy somewhere in .dvc/
  • a working copy somewhere in your project directories

Now imagine that you want to modify this file and add the new version to dvc. You modify or overwrite your working copy, you dvc add the result, and now you have 3 total copies of your data on disk locally:

  • a cached copy of the previous version somewhere in .dvc/
  • a cached copy of the new version somewhere in .dvc/
  • a working copy somewhere in your project directory

This is the "peak demand" situation I was referring to, where in order for dvc to work I need enough room for 3 copies of my data on disk. Sometime later, I can run dvc gc to reduce me back to the original "best case" situation which has 2 total copies on my local disk.

@drorata

This comment has been minimized.

Copy link

commented Feb 15, 2019

Thanks for the clarification. I still think the copy strategy has an "O(1)" local-space complexity. A linking strategy would only save a single copy of your (local) data. Otherwise, it is the same. In my mind, the scenario of dynamic/evolving data is very common, and one ends up rather quickly with multiple versions (or evolution snapshots) of the "same" dataset --- thus, in the cache, there are going to be many copies; one for each stage. In this case, the single copy in the workspace (needed when using the copy strategy) is marginal and that's why I call it O(1).

If you're short of space on your local machine, then you probably want to persists the cache remotely and garbage collect old stages/copies/versions/snapshots locally. In this case, when using copy you indeed end up with 2-local-copies of the data. If we're talking about a huge set, this can be an issue and directly caching can be a nice feature. However, I believe this use case is out of the scope of this discussion. I think dvc is considering the remote cache as (a) a point of sharing and (b) a backup of the local data. Using the remote cache for a third purpose of data backbone is yet another (interesting and valid) use case, but it is orthogonal to the copy vs. link discussion.

Lastly, regardless of the chosen strategy, the remote storage costs won't change and they depend on how much "old" data you want to keep.

@colllin

This comment has been minimized.

Copy link

commented Feb 15, 2019

I see that when you're not working with large datasets or when you have too much disk space, then you're typically keeping a substantial local cache, treating the remote as a backup, and the extra disk allocation for the copyed working copy could be considered O(1) addition where the rest of your cache is O(v) for the number of versions. I think the missing factor here is s, the size of your dataset on disk. In that case, your local cache is O(v*s), and the working copy addition in 2️⃣ is O(s). When s is small, then we can ignore it, as you described. In my case, s is significantly large that we cannot ignore it.

In this situation, I'm actually doing the opposite — I'm minimizing v to the point that it is negligible, so that I'm operating in more of an O(s) regime. In other words, I'm minimizing my local cache so as not to allocate 100s of GB of disk space for rarely-needed copies of large files. Minimizing v is accomplished in a straightforward manner by running dvc gc on a regular basis.

Then, since s in my situation is far from negligible, I am concerned about the proposed behavior of the copy strategy in option 2️⃣. As described in the original issue, the main drawbacks of option 2️⃣ are related to working with large datasets. You seem to be dismissing these issues outright. If the main downsides of option 2️⃣ are related to working with large datasets, then it could be understood that option 1️⃣ is mainly provided as a workaround particular to large datasets. And this seems to be how it was interpreted, with ideas such as allowing us to choose option 1️⃣ at the file-level so as to minimize the drawbacks of option 1️⃣ — namely the perceived UX issues.

So, for me, the significant change from the current behavior to proposed option 2️⃣ is that I would now need to allocate minimum 3x the size of my working copy in order meet the peak disk space required for dvc to function properly. Even if we take the EBS costs as negligible, this will still likely be surprising to users, and will certainly result in headaches around migrating to larger EBS volumes and trying to understand why I need to maintain a 1TB volume in order to work with a 300GB dataset. My proposal directly addresses these concerns, is a potential complete replacement for the proposed option 1️⃣, and in my understanding, my proposed strategy provides better UX and better control over the use of my machine's disk space and time.

@shcheklein

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2019

@colllin it feels like in your case the initial data set is actually an external dependency. Can we do something like this to avoid caching (and even storing a single copy of the tarball on your machine):

dvc run -d s3://your-bucker/images.tgz -O images "aws s3 cp s3://your-bucker/images.tgz - | tar -xzvf -"

would it solve the problem?

I assume you do use cache for all the subsequent steps. And space is not a big issue for them, is it correct?

@guysmoilov

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

I would vote for 2️⃣, since it's less likely to surprise a newbie. Should lessen friction in the beginning, there's already plenty to be careful of when you're starting out that you don't need additional cognitive load.

After you're done experimenting and learning, feel more comfortable, and want to optimize your disk usage (in the specific case where you're on a system that does not support reflinks!), then you can switch to "advanced mode". Kind of like the pythonic approach, make things easy until there's good enough reason to make them hard.

@drorata

This comment has been minimized.

Copy link

commented Feb 24, 2019

If @colllin is trying to minimize the number of versions O(v), then I don't understand what is the need for dvc in his workflow?

A working directory comprises of one (and only one) copy of each item from the cache. You have to have enough space to host these items regardless of whether it is on a local machine or some cloud instance. The cache's size is the total of sizes of all items and all their versions. This is naturally much larger than your local copy. If you need/want to have many versions, be prepared to pay for the storage. I don't see a way around it. If you're short on space but your data is huge, you will have to reduce the cache size.

The strategy used by dvc is rather secondary and it won't magically allow you to enjoy both many versions and large volumes. Furthermore, you @colllin actually suggest that the cache is a minor consideration as he runs dvc gc regularly.

@bayethiernodiop

This comment has been minimized.

Copy link

commented Mar 20, 2019

Thanks for this thread. I feel little dumb :( cause i was thinking that there is no way that someone could want to prefer option 2 (I wasn't even thinking of it) cause i am new to DVC and just trying to understand it.
Doing the first examples i didn't see any use case that wouldn't prefer option 1 (doing just data versioning and script changing) i was thinking that it would be better to protect file by default so that the cache is safe and no need to copy the data. however reading this thread clearly show the problems like error occurring during final step of training do to file protection.
i actually think that both are good options but think that option 2 should be the default(when using run by automatically removing protection on the used files) and let option 1 as default when doing manual modification on dataset to avoid unwanted cache corruption.
So T think that Option 1 and 2 are note exclusive and can live together.

@efiop

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

Related #1821

@efiop

This comment has been minimized.

Copy link
Member

commented May 12, 2019

Guys, thanks a lot for this amazing discussion! We've decided to go with reflink, copy by default and a proper warning if links take too long. We will also soon prepare a doc about it to describe it in great detail. reflink, copy is now default starting from 0.40.0 and you can always switch back to old behavior with dvc config cache.type reflink,hardlink,symlink,copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.