-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dangerous default settings cause backups to fail entirely on any problems reading a single file/directory #690
Comments
There are 2 camps here, I personally prefer not failing by default, but an argument can be made that by succeeding silently the user may not even notice their files are not backed up. @ntolia and @julio-lopez who may have more comments. |
By 'succeeding silently' are you talking about exit codes? I don't see a reason to do that either. Taking three different approaches into account:
Both (1) and (2) are dangerous, and I don't know why anyone would want them. Especially by default. (3) is really the only sensible option for a backup program. This is how every other backup program I've used works. Is there any reason not to do (3) ? |
@LowSun Thanks for the feedback. The main rationale behind the current implementation / default is to avoid (2) above. (3) sounds reasonable. Concrete modifications to improve the current behavior:
|
Given that backups are often the last resort option, it makes sense to fail hard and early to prevent data loss. This is especially true for new users of kopia who will often not read all the documentation before starting to use the tool. For better UX, I would argue that fail fast should remain as default with an option, that needs to be explicitly set, to ignore failures. |
@ntolia I really can't understand your reasoning at all here. You're literally causing extra data loss for zero benefit. (3) still warns the user just as much as (1) does... but without causing even more unnecessary data loss. Backups are something that people set up and leave to run automated. Sure, "fail hard and early" makes sense in many areas of programming, however automated backups is definitely not one of them. If you think "fail hard and early" applies to backup software, I'm sorry to say that you really shouldn't be working on, or influencing decisions in backup software that other people will be relying on. Sorry that I'm being so harsh here, it's not something that I want to do when it comes to people creating open source software, but in this instance your opinion on this is just outright scary for something as fundamentally important as backup software. And it leaves me very concerned as to what other similar decisions may have been made in the software that I'm not even aware of yet. The realities are that:
That is a point in favor of having safe defaults that don't intentionally CAUSE data loss, i.e. the reason to do (3). I don't see how you could think this is a point in favor of breaking users' backups intentionally by default. I have been reading the documentation, and I didn't find any mention of this default setting. And I still can't find any documentation on how to disable it. But it sounds like disabling it is just going to ignore errors entirely, which is also a problem anyway.
"Ignoring failures" (2) really has nothing to do with the main point of this issue. I'm not arguing in favor of ignoring anything. I'm just saying you shouldn't intentionally cause backups to fail entirely, unless a user has specifically opted into that... and I can't even see a reason why any responsible person that cares about their backups would want that anyway. Maybe it's handy to save some time when you're developing/testing/debugging the program, but it doesn't make any sense for actually doing real backups, especially automated ones that are going to fail when people aren't always sitting there watching all the time, and don't necessarily have a bunch of time on their hands to deal with figuring out the file exclusion rules (both how the rules are set up in Kopia + also the which rules the user needs for their files). Again, sorry to be so abrupt here. I didn't think anything more needed to be said until seeing the comment above from @ntolia - but I'm just really concerned that anyone working on / influencing backup software would think that "fail hard and early" (or even silently ignoring errors) should be applied to automated backup software by default. This is shaping up to be some really awesome software. But please take the advice of myself and @julio-lopez - as this issue is a real showstopper for real world use for now. I wish you all the best, and do very much appreciate the work you do in creating and sharing open source software. Thanks for reading (sorry, way too long I know!). |
@LowSun - It doesn't seem like anyone is advocating for "silently ignoring errors". As for default behavior, I'm not sure I share your views on the "danger" of "fail hard and early". Whether you encounter an error after 5 seconds or 5 hours doesn't ultimately matter, apart from time saved by failing early. Even if you end up with a "usable" snapshot after one or more errors, you still need to determine, either by manual examination or by building automated parsing of the error output, whether the snapshot is viable. And you must have the presence of mind to do so at the time. Something to consider if (3) is default: "hey my repo has a perfectly valid snapshot from 18 months ago, but when I restored it yesterday, When you default to failing early, you at least hold a newbie to the minimum bar of asserting that they acknowledge (by providing the appropriate flag) the potential for data loss when bypassing errors. |
Well apparently that's what changing this setting does currently. But other than that, yeah it's not the main issue here.
Indeed. What matters is whether your backups actually work. And currently with (1) as default, they don't. As soon as it has a minor IO problem, it just stops backing anything up at all, including all following data that can be backed up perfectly well. Yes that's ok when you're sitting there watching it and testing things, but it's definitely not ok when backups are meant to be doing their thing automatically.
Ok. But having 99% of your backup work (everything except the problematic files) is better than it just cancelling at 1% and not backing up anything more at all. Having a usable snapshot is better than an unusable one. That's why no other reasonable backup system just cancels your backup entirely for minor expected issues like locked files/permission issues... which are always to be expected for any unattended backup software. Name another backup program that does this?
This is AUTOMATED backup software that you're supposed to be able to set up and then rely on. It should just carry on doing what it can, and yes also report any problems too. The premise that your backups should just stop working altogether until you've managed to figure out and predict every future minor issue that might ever occur is incredibly reckless.
I don't even really get what your point is here. If they're not monitoring the logs to begin with, then that's the exact scenario where (1) is really going to screw them over. There's absolutely zero benefits to (1) or (2) in that scenario where they were ignoring the logs / exit codes all along. In that scenario, under (1) they wouldn't have been able to restore anything at all, which is a lot worse than not being able to restore one specific file. You're proving my point for me.
The default setting itself is literally the cause of the data loss in this case. If it wasn't that way, then the data loss wouldn't actually be occurring to begin with. The idea that you would intentionally cause data loss by default because of some abstract notion of "you at least hold a newbie to the minimum bar of asserting that they acknowledge" is just willfully harmful. If you want somebody to "acknowledge" something, then you need to tell them that up front and have them confirm they understand that. Randomly failing at runtime depending on what they happening to be doing at the time (and assuming they're watching) is not a reasonable way to expect people to "acknowledge" something. Especially when that "acknowledgement" has zero tangible actual benefits that help anyone with anything at all, and actively causes the exact problem that you were supposedly "warning" them about to begin with. Like I pointed out repeatedly, there are many reasons why backups will have inconsistent IO issues that you can't always predict. The idea that data that can be backed up, should just be skipped altogether and your automated backup system in production should just fail entirely is absolutely ridiculous. Not a big deal if this hasn't occurred to anyone yet, and I didn't report this to argue with people... but the fact that now even with all these issues pointed out, two people are arguing in favor of intentionally causing data loss when common IO issues occur on a single file is just stupefying to me. |
Good discussion, thanks folks, I agree we should be doing (3) by default, while also always(*) saving the failed partial snapshot records in the history. This will be visible to both users who monitor their backup exit codes (scripts, cron) and those that don't (UI will show failed snapshot in the list). (*) There's still an edge case where the root of the snapshot cannot even be opened (whether it's a file or directory). In this case I think we must fail hard (both CLI and UI) without saving the empty snapshot record at all, because other parts of Kopia currently expect all snapshots to have a valid roots. @julio-lopez if we add --fail-fast flag, does that mean we can get rid of error policies altogether? or should "fail fast" be a policy setting? |
We've been sitting on this issue for too long, we should address this before 0.8 release: One last tweak: how about we repurpose the current "ignore" policies to report warnings instead of errors? We will have three possible behaviors when a file can't be read: a) a) will be the new default in 0.8 which is identical to (3) proposal above. @julio-lopez thoughts? I'm currently working on refactoring the snapshot/estimate UX and this change will fit in my next PR. |
Works. Tagging @pavannd1 as well. |
We discussed this offline and came up with environment variable (something like |
Fixes kopia#690 This is a breaking change for folks who are expecting snapshots to fail quickly without writing a snapshot manifest in case of an error. Before this change, any source read failure would cause the entire snapshot to fail (and not write a snapshot manifest as a result), unless `ignoreFileErrors` or `ignoreDirectoryErrors` was set. The new behavior is to continue snapshotting remaining files and directories (this can be disabled by passing `--fail-fast` flag or setting `KOPIA_SNAPSHOT_FAIL_FAST=1` environment variable) and defer returning an error until the very end. After snapshotting we will always attempt to write the snapshot manifest (except when the root of the snapshot itself cannot be opened). In case of a fail-fast error, the manifest will be marked as 'partial' and the directory tree will contain only partial set of files. In case of any errors, the manifest (and each directory object) will list the number if failures and no more than 10 examples of failed files/directories along with their respective errors. Once the snapshot is complete we will return non-zero exit code to the operating system if there were any fatal errors during snapshotting. With this change we are repurposing `ignoreFileErrors` and `ignoreDirectoryErrors` to designate some errors as non-fatal. Non-fatal errors are reported as warnings in the logs and will not cause a non-zero exit code to be returned.
Fixes #690 This is a breaking change for folks who are expecting snapshots to fail quickly without writing a snapshot manifest in case of an error. Before this change, any source read failure would cause the entire snapshot to fail (and not write a snapshot manifest as a result), unless `ignoreFileErrors` or `ignoreDirectoryErrors` was set. The new behavior is to continue snapshotting remaining files and directories (this can be disabled by passing `--fail-fast` flag or setting `KOPIA_SNAPSHOT_FAIL_FAST=1` environment variable) and defer returning an error until the very end. After snapshotting we will always attempt to write the snapshot manifest (except when the root of the snapshot itself cannot be opened). In case of a fail-fast error, the manifest will be marked as 'partial' and the directory tree will contain only partial set of files. In case of any errors, the manifest (and each directory object) will list the number if failures and no more than 10 examples of failed files/directories along with their respective errors. Once the snapshot is complete we will return non-zero exit code to the operating system if there were any fatal errors during snapshotting. With this change we are repurposing `ignoreFileErrors` and `ignoreDirectoryErrors` to designate some errors as non-fatal. Non-fatal errors are reported as warnings in the logs and will not cause a non-zero exit code to be returned.
It seems that using the default settings, your backup will fail entirely as soon as it hits any problems reading any file or directory. e.g. with an error like:
Is there a reason for this?
This seems incredibly dangerous to me for a backup program. Why would anybody want their backups to fail entirely due to an issue with a single directory/file? Let alone it be the default setting?
The text was updated successfully, but these errors were encountered: