Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangerous default settings cause backups to fail entirely on any problems reading a single file/directory #690

Closed
LowSun opened this issue Oct 21, 2020 · 11 comments · Fixed by #840
Labels
breaking-change PRs that introduce breaking changes ux User eXperience

Comments

@LowSun
Copy link

LowSun commented Oct 21, 2020

It seems that using the default settings, your backup will fail entirely as soon as it hits any problems reading any file or directory. e.g. with an error like:

error: encountered 1 errors:
unable to process directory "some_folder_name":   lock: The process cannot access the file because another process has locked a portion of the file., try --help

Is there a reason for this?

This seems incredibly dangerous to me for a backup program. Why would anybody want their backups to fail entirely due to an issue with a single directory/file? Let alone it be the default setting?

@jkowalski
Copy link
Contributor

There are 2 camps here, I personally prefer not failing by default, but an argument can be made that by succeeding silently the user may not even notice their files are not backed up.

@ntolia and @julio-lopez who may have more comments.

@LowSun
Copy link
Author

LowSun commented Oct 21, 2020

By 'succeeding silently' are you talking about exit codes? I don't see a reason to do that either.

Taking three different approaches into account:

  • (1) Cause the entire backup to fail - and give an error exit code (this is what it does currently by default)
  • (2) Continue backup and silently give a success exit code - no reason to do this
  • (3) Output error messages to STDERR, continue the rest of the backup, and give a error exit code.

Both (1) and (2) are dangerous, and I don't know why anyone would want them. Especially by default.

(3) is really the only sensible option for a backup program. This is how every other backup program I've used works.

Is there any reason not to do (3) ?

@julio-lopez
Copy link
Collaborator

julio-lopez commented Oct 21, 2020

  • (1) Cause the entire backup to fail - and give an error exit code (this is what it does currently by default)
  • (2) Continue backup and silently give a success exit code - no reason to do this
  • (3) Output error messages to STDERR, continue the rest of the backup, and give a error exit code.

@LowSun Thanks for the feedback. The main rationale behind the current implementation / default is to avoid (2) above. (3) sounds reasonable.

Concrete modifications to improve the current behavior:

  • Change the implementation to allow (3). IIRC, currently ignoring errors results in (2).
  • Add a flag to allow (1) for scenarios where "fails fast" may be desirable.

@julio-lopez julio-lopez added the ux User eXperience label Oct 21, 2020
@ntolia
Copy link
Collaborator

ntolia commented Oct 21, 2020

Given that backups are often the last resort option, it makes sense to fail hard and early to prevent data loss. This is especially true for new users of kopia who will often not read all the documentation before starting to use the tool. For better UX, I would argue that fail fast should remain as default with an option, that needs to be explicitly set, to ignore failures.

@LowSun
Copy link
Author

LowSun commented Oct 22, 2020

it makes sense to fail hard and early to prevent data loss

@ntolia I really can't understand your reasoning at all here. You're literally causing extra data loss for zero benefit.

(3) still warns the user just as much as (1) does... but without causing even more unnecessary data loss. Backups are something that people set up and leave to run automated.

Sure, "fail hard and early" makes sense in many areas of programming, however automated backups is definitely not one of them.

If you think "fail hard and early" applies to backup software, I'm sorry to say that you really shouldn't be working on, or influencing decisions in backup software that other people will be relying on. Sorry that I'm being so harsh here, it's not something that I want to do when it comes to people creating open source software, but in this instance your opinion on this is just outright scary for something as fundamentally important as backup software. And it leaves me very concerned as to what other similar decisions may have been made in the software that I'm not even aware of yet.

The realities are that:

  • Backups regularly have issues with reading files that come and go, on any OS, but especially on Windows with file locking. I've been refining my exclusion list literally for years for dirs under C:\Users where inconsistently locked files cause issues with backup software. The exclusion list is never going to be 100% complete. If I'd have been using Kopia with its default setting, this means that my backups would basically never work at all. ...or worse yet, worked when first set up, then unexpectedly broken sometime later.
  • You can't just assume that the backup is going to forever run as smoothly as it did the first time you manually ran it and watched it.
  • Not everyone properly monitors their backup systems/scripts. So many users will never even be aware that their backups were failing for months/years until they try to restore something. Notification/monitoring systems fail too. Is that bad? Yes definitely, but it happens. But it's very reasonable to assume that backup software wouldn't be written to intentionally fail entirely on a minor hiccup on a single file that probably wasn't important anyway.
  • And even when you are aware of certain specific files causing issues, you might not be able to fix these issues immediately, especially if you were using this on production servers. This is additionally complicated by the fact that Kopia uses fairly unusual methods of ignoring files that aren't documented properly nor can easily be previewed with a dry run, at least as far as I've been able to figure out.

This is especially true for new users of kopia who will often not read all the documentation before starting to use the tool.

That is a point in favor of having safe defaults that don't intentionally CAUSE data loss, i.e. the reason to do (3). I don't see how you could think this is a point in favor of breaking users' backups intentionally by default.

I have been reading the documentation, and I didn't find any mention of this default setting. And I still can't find any documentation on how to disable it. But it sounds like disabling it is just going to ignore errors entirely, which is also a problem anyway.

that needs to be explicitly set, to ignore failures

"Ignoring failures" (2) really has nothing to do with the main point of this issue. I'm not arguing in favor of ignoring anything. I'm just saying you shouldn't intentionally cause backups to fail entirely, unless a user has specifically opted into that... and I can't even see a reason why any responsible person that cares about their backups would want that anyway. Maybe it's handy to save some time when you're developing/testing/debugging the program, but it doesn't make any sense for actually doing real backups, especially automated ones that are going to fail when people aren't always sitting there watching all the time, and don't necessarily have a bunch of time on their hands to deal with figuring out the file exclusion rules (both how the rules are set up in Kopia + also the which rules the user needs for their files).

Again, sorry to be so abrupt here. I didn't think anything more needed to be said until seeing the comment above from @ntolia - but I'm just really concerned that anyone working on / influencing backup software would think that "fail hard and early" (or even silently ignoring errors) should be applied to automated backup software by default.

This is shaping up to be some really awesome software. But please take the advice of myself and @julio-lopez - as this issue is a real showstopper for real world use for now.

I wish you all the best, and do very much appreciate the work you do in creating and sharing open source software. Thanks for reading (sorry, way too long I know!).

@redgoat650
Copy link
Contributor

@LowSun - It doesn't seem like anyone is advocating for "silently ignoring errors".

As for default behavior, I'm not sure I share your views on the "danger" of "fail hard and early". Whether you encounter an error after 5 seconds or 5 hours doesn't ultimately matter, apart from time saved by failing early. Even if you end up with a "usable" snapshot after one or more errors, you still need to determine, either by manual examination or by building automated parsing of the error output, whether the snapshot is viable. And you must have the presence of mind to do so at the time.

Something to consider if (3) is default: "hey my repo has a perfectly valid snapshot from 18 months ago, but when I restored it yesterday, /file/that/definitely/should/be/there was missing. No, I don't have any error output logs from last year, why do you ask?"

When you default to failing early, you at least hold a newbie to the minimum bar of asserting that they acknowledge (by providing the appropriate flag) the potential for data loss when bypassing errors.

@LowSun
Copy link
Author

LowSun commented Oct 22, 2020

It doesn't seem like anyone is advocating for "silently ignoring errors".

Well apparently that's what changing this setting does currently. But other than that, yeah it's not the main issue here.

Whether you encounter an error after 5 seconds or 5 hours doesn't ultimately matter, apart from time saved by failing early.

Indeed.

What matters is whether your backups actually work.

And currently with (1) as default, they don't. As soon as it has a minor IO problem, it just stops backing anything up at all, including all following data that can be backed up perfectly well. Yes that's ok when you're sitting there watching it and testing things, but it's definitely not ok when backups are meant to be doing their thing automatically.

Even if you end up with a "usable" snapshot after one or more errors, you still need to determine, either by manual examination or by building automated parsing of the error output, whether the snapshot is viable.

Ok. But having 99% of your backup work (everything except the problematic files) is better than it just cancelling at 1% and not backing up anything more at all. Having a usable snapshot is better than an unusable one.

That's why no other reasonable backup system just cancels your backup entirely for minor expected issues like locked files/permission issues... which are always to be expected for any unattended backup software. Name another backup program that does this?

And you must have the presence of mind to do so at the time.

This is AUTOMATED backup software that you're supposed to be able to set up and then rely on. It should just carry on doing what it can, and yes also report any problems too. The premise that your backups should just stop working altogether until you've managed to figure out and predict every future minor issue that might ever occur is incredibly reckless.

Something to consider if (3) is default: "hey my repo has a perfectly valid snapshot from 18 months ago, but when I restored it yesterday, /file/that/definitely/should/be/there was missing. No, I don't have any error output logs from last year, why do you ask?"

I don't even really get what your point is here. If they're not monitoring the logs to begin with, then that's the exact scenario where (1) is really going to screw them over. There's absolutely zero benefits to (1) or (2) in that scenario where they were ignoring the logs / exit codes all along. In that scenario, under (1) they wouldn't have been able to restore anything at all, which is a lot worse than not being able to restore one specific file. You're proving my point for me.

When you default to failing early, you at least hold a newbie to the minimum bar of asserting that they acknowledge (by providing the appropriate flag) the potential for data loss when bypassing errors.

The default setting itself is literally the cause of the data loss in this case. If it wasn't that way, then the data loss wouldn't actually be occurring to begin with. The idea that you would intentionally cause data loss by default because of some abstract notion of "you at least hold a newbie to the minimum bar of asserting that they acknowledge" is just willfully harmful. If you want somebody to "acknowledge" something, then you need to tell them that up front and have them confirm they understand that. Randomly failing at runtime depending on what they happening to be doing at the time (and assuming they're watching) is not a reasonable way to expect people to "acknowledge" something. Especially when that "acknowledgement" has zero tangible actual benefits that help anyone with anything at all, and actively causes the exact problem that you were supposedly "warning" them about to begin with.

Like I pointed out repeatedly, there are many reasons why backups will have inconsistent IO issues that you can't always predict. The idea that data that can be backed up, should just be skipped altogether and your automated backup system in production should just fail entirely is absolutely ridiculous. Not a big deal if this hasn't occurred to anyone yet, and I didn't report this to argue with people... but the fact that now even with all these issues pointed out, two people are arguing in favor of intentionally causing data loss when common IO issues occur on a single file is just stupefying to me.

@jkowalski
Copy link
Contributor

Good discussion, thanks folks,

I agree we should be doing (3) by default, while also always(*) saving the failed partial snapshot records in the history. This will be visible to both users who monitor their backup exit codes (scripts, cron) and those that don't (UI will show failed snapshot in the list).

(*) There's still an edge case where the root of the snapshot cannot even be opened (whether it's a file or directory). In this case I think we must fail hard (both CLI and UI) without saving the empty snapshot record at all, because other parts of Kopia currently expect all snapshots to have a valid roots.

@julio-lopez if we add --fail-fast flag, does that mean we can get rid of error policies altogether? or should "fail fast" be a policy setting?

@jkowalski
Copy link
Contributor

We've been sitting on this issue for too long, we should address this before 0.8 release:

One last tweak: how about we repurpose the current "ignore" policies to report warnings instead of errors?

We will have three possible behaviors when a file can't be read:

a) fail-fast==false && ignoreFileErrors==false - reports the error, continues the snapshot, final exit code != 0
b) fail-fast==true && ignoreFileErrors==false - reports the error, immediately fails the snapshot, final exit code != 0
c) fail-fast==true|false && ignoreFileErrors==true - reports a warning (not an error), continues the snapshot, final exit code == 0

a) will be the new default in 0.8 which is identical to (3) proposal above.
b) can be set in 0.8 to revert to the 0.7.3 behavior (arguably broken)
c) may still be useful if you're trying to backup a directory that has a file that's occasionally locked but you don't want to fail the cron job

@julio-lopez thoughts?

I'm currently working on refactoring the snapshot/estimate UX and this change will fit in my next PR.

@julio-lopez julio-lopez added the breaking-change PRs that introduce breaking changes label Feb 16, 2021
@julio-lopez
Copy link
Collaborator

Works.
We'd need to tag it as a "breaking change" since there are downstream projects that rely on the current default behavior.

Tagging @pavannd1 as well.

@jkowalski
Copy link
Contributor

We discussed this offline and came up with environment variable (something like KOPIA_SNAPSHOT_FAIL_FAST) which would provide default for kopia invocation.

jkowalski added a commit to jkowalski/kopia that referenced this issue Feb 17, 2021
Fixes kopia#690

This is a breaking change for folks who are expecting snapshots to fail
quickly without writing a snapshot manifest in case of an error.

Before this change, any source read failure would cause the entire
snapshot to fail (and not write a snapshot manifest as a result),
unless `ignoreFileErrors` or `ignoreDirectoryErrors` was set.

The new behavior is to continue snapshotting remaining files and
directories (this can be disabled by passing `--fail-fast` flag or
setting `KOPIA_SNAPSHOT_FAIL_FAST=1` environment variable) and defer
returning an error until the very end.

After snapshotting we will always attempt to write the snapshot manifest
(except when the root of the snapshot itself cannot be opened). In case
of a fail-fast error, the manifest will be marked as 'partial' and
the directory tree will contain only partial set of files.

In case of any errors, the manifest (and each directory object) will
list the number if failures and no more than 10 examples of failed
files/directories along with their respective errors.

Once the snapshot is complete we will return non-zero exit code to the
operating system if there were any fatal errors during snapshotting.

With this change we are repurposing `ignoreFileErrors` and
`ignoreDirectoryErrors` to designate some errors as non-fatal.
Non-fatal errors are reported as warnings in the logs and will not
cause a non-zero exit code to be returned.
julio-lopez pushed a commit that referenced this issue Feb 17, 2021
Fixes #690

This is a breaking change for folks who are expecting snapshots to fail
quickly without writing a snapshot manifest in case of an error.

Before this change, any source read failure would cause the entire
snapshot to fail (and not write a snapshot manifest as a result),
unless `ignoreFileErrors` or `ignoreDirectoryErrors` was set.

The new behavior is to continue snapshotting remaining files and
directories (this can be disabled by passing `--fail-fast` flag or
setting `KOPIA_SNAPSHOT_FAIL_FAST=1` environment variable) and defer
returning an error until the very end.

After snapshotting we will always attempt to write the snapshot manifest
(except when the root of the snapshot itself cannot be opened). In case
of a fail-fast error, the manifest will be marked as 'partial' and
the directory tree will contain only partial set of files.

In case of any errors, the manifest (and each directory object) will
list the number if failures and no more than 10 examples of failed
files/directories along with their respective errors.

Once the snapshot is complete we will return non-zero exit code to the
operating system if there were any fatal errors during snapshotting.

With this change we are repurposing `ignoreFileErrors` and
`ignoreDirectoryErrors` to designate some errors as non-fatal.
Non-fatal errors are reported as warnings in the logs and will not
cause a non-zero exit code to be returned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change PRs that introduce breaking changes ux User eXperience
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants