Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs-sxbackup can cause btrfs deadlock #17

Closed
bhelm opened this issue Jun 10, 2015 · 7 comments
Closed

btrfs-sxbackup can cause btrfs deadlock #17

bhelm opened this issue Jun 10, 2015 · 7 comments

Comments

@bhelm
Copy link
Contributor

bhelm commented Jun 10, 2015

this problem is more related to btrfs than to btrfs-sxbackup. im still creating this issue to warn other users about it and because btrfs-sxbackup could avoid this problem.

when btrfs-sxbackup starts while another instance on the same target is running, btrfs-sxbackup does not recognize this and tries to delete the temp directories on both sides. this causes a deadlock in linux kernel versions < 4.0.3 (see https://www.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.0.3 ).

this deadlock brings the .sxbackup directory to a unusable state, only reboot will solve this.

btrfs-sxbackup could warn and refuse to run a backup job that appears already running because most users do not want to run two instances of the same job at the same time.

@lkraav
Copy link

lkraav commented Jun 10, 2015

Thanks. I have been wondering what happens when a remote transfer job takes more than an hour and the next job begins.

@masc3d
Copy link
Owner

masc3d commented Jun 10, 2015

thanks for the heads up. the expected result in this case is:

INFO preparing environment
ERROR Command '['bash', '-c', 'if [ -d "/.sxbackup/.temp"* ]; then btrfs sub del "/.sxbackup/.temp"*; fi']' returned non-zero exit status 1
ERROR: cannot delete '/.sxbackup/.temp.675e3049b71349b2aa8dc19cac6b349b' - Operation not permitted

As a temporary subvolume could exist for various reasons, also eg. unclean termination due to kill, reset, power outage etc. refusing to run essentially implies delegating the cleanup to the user for all other cases, which is (imho) a bad idea. Best to use kernel version which has the fix.

@masc3d masc3d closed this as completed Jun 10, 2015
@bhelm
Copy link
Contributor Author

bhelm commented Jun 11, 2015

my suggestion is to let btrfs-sxbackup check itself if it is already running, because it should not depend on (the currently buggy) btrfs to do so. btrfs-sxbackup should not run twice. even if running it two times will make it fail at a lower level, it is imho not the clean way to do so. The "Operation not permitted" message is not exactly a "Btrfs-Sxbackup is already running".

also please note that linux 4.0.3+ is rarly seen on any production systems. debian does not even offer a prebuild package for amd64. this bug hit me three times on different systems with linux 3.17 - 4.0.2, causing silent backup failures and reboots.

@lkraav
Copy link

lkraav commented Jun 11, 2015

@bhelm has a strong argument.

@masc3d
Copy link
Owner

masc3d commented Jun 11, 2015

Multiple instances of btrfs-sxbackup are allowed, since you could run multiple jobs at the same time.
I cannot replicate this on 4.0.1 (at all). I see delays, but no deadlocks. How do you replicate it?

The issue is already resolved on kernel level, I will not implement a workaround due to update woes of specific distros. But you could look into https://ma.ttias.be/prevent-cronjobs-from-overlapping-in-linux/ if you have to.

@bhelm
Copy link
Contributor Author

bhelm commented Jul 8, 2015

Multiple instances of btrfs-sxbackup are allowed, since you could run multiple jobs at the same time.

that is right but that is not the problem im talking about. the problem occures sometimes when you run multiple instances of the SAME job, this can cause deadlocks on older kernels and btrfs-sxbackup does not support this use case either.

btrfs itself should have no problems if you create multiple snapshots of the same volume and transfer them simultaniously, but btrfs-sxbackup deletes the first partially transfered snapshot before it starts another transfer (without noticing that the transfer that belongs to that snapshot is still active)

this could be avoided by checking for already running btrfs-sxbackup instances on per-job basis and refuse to start another transfer on that condition. Im now using the linux flock utility to archive this, having this functionality native in btrfs-sxbackup would make job creation a bit easier and would prevent new users from running the same job twice by accident - and also running into that btrfs deadlock bug on common stable kernels that requires a reboot to fix.

@masc3d
Copy link
Owner

masc3d commented Aug 6, 2015

ok. if you provide a good implementation using fcntl I will pull from you, but I don't have time to implement it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants