Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run btrfs send after deduplication #50

Closed
AlexFreeland opened this issue Mar 30, 2015 · 36 comments
Closed

Unable to run btrfs send after deduplication #50

AlexFreeland opened this issue Mar 30, 2015 · 36 comments
Labels

Comments

@AlexFreeland
Copy link

Hi there,
After running "duperemove -drv" on a btrfs filesystem and taking a readonly snapshot I am no longer able to back it up off disk using btrfs send. The following error is thrown: "ERROR: send ioctl failed with -5: Input/output error"

Checking dmesg shows a corresponding "BTRFS error (device sdc): did not find backref in send_root. inode=708396, offset=131072, disk_byte=379846254592 found extent=379846254592"

The send command still works for the previous snapshot (prior to deduplication) and btrfs scrub shows no errors.

I have tested under Ubuntu 14.04.1 and 14.10 using btrfs-tools versions 3.12 and 3.14.1, respectively, and the issue persists. I am running duperemove v0.09.1.

Is this a known issue? Is there a workaround? Cheers

@markfasheh
Copy link
Owner

Hi to be honest this sounds like some sort of file system corruption. Have you been able to run btrfsck against the disk in question? What does it report?

@AlexFreeland
Copy link
Author

btrfs check --repair /dev/sdc
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/sdc
UUID: 9453d881-041f-431f-a810-0ac9105e93db
cache and super generation don't match, space cache will be invalidated
found 347049165171 bytes used err is 0
total csum bytes: 741816336
total tree bytes: 5194465280
total fs tree bytes: 4176560128
total extent tree bytes: 220987392
btree space waste bytes: 881550719
file data blocks allocated: 1657671786496
referenced 973693386752
Btrfs v3.14.1

I am still able to successfully run send on a snapshot taken immediately prior to running duperemove so the issue certainly seems to be related to the deduplication process. I've tested this several times with the same result.

FWIW if I do a btrfs inspect-internal inode-resolve -v on the reported inode and stat the afflicted file I see that the modify and change times have been updated as reported in #51

Not sure if this is related to the issue I'm having or not but trying to give you as much information to work with as possible.

@markfasheh
Copy link
Owner

The time change means that we did indeed run on the files so that helps, thanks :)

So for some reason I didn't realize that this was happening to you right after a dedupe run. That is definitely not expected behavior though I'm not sure it's the duperemove userspace causing it. Give me some time (it's nearing the end of my day) to try to reproduce this locally on my machine.

In the meantime, what kernel are you using? The output of 'uname -a' would tell me this. Also if you were to cut and paste the exact commands you use to reproduce this it would speed things up on my end.

@AlexFreeland
Copy link
Author

On the Ubuntu 14.04.1 machine:
Linux TestbedA 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
On the 14.10 machine:
Linux TestbedB 3.16.0-33-generic #44-Ubuntu SMP Thu Mar 12 12:19:35 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

To reproduce the issue on a newly created btrfs volume (mounted at /drive/2) I run the following commands:

cd /drive/2
mkdir .backup
btrfs receive -f /drive/1/originalFilesystem .backup
cp -aR .backup/originalFilesystem/* ./
btrfs subvolume snapshot -r ./ .backup/pre
duperemove -drv $(ls -1)
btrfs subvolume snapshot -r ./ .backup/post

At this point running btrfs send on .backup/pre will succeed however running it on .backup/post will fail with the error posted above.

@mac-linux-free
Copy link

I also have the problem...please tell me howto fix the send/receive.

@mac-linux-free
Copy link

ok I fixed it by doing a full balance...but this couldn´t be right or ?

@AlexFreeland
Copy link
Author

I can also confirm that performing a balance allows me to use send as normal. Hopefully this can provide some clue as to what's occurring... :)

@Ferroin
Copy link

Ferroin commented Apr 8, 2015

Personally, this sounds like a problem with the in-kernel implementation of IOC_BTRFS_EXTENT_SAME. Has anyone reported this on linux-btrfs@vger.kernel.org?

@markfasheh
Copy link
Owner

I'm also pretty sure it's a kernel bug, there's nothing userspace should be able to do to fail something like this. The extent same ioctl uses the clone code beneath some safety checks.

Clone is also used to do reflinks, so we could exercise (mostly) the same path just by making 'cp --reflink=always' copies of a file then trying to btrfs end that subvolume. If you can do that and tell me whether btrfs send still breaks, that would help narrow it down a bit.

Re the btrfs list, feel free to send them a bug report. You might want to CC me if you do that so I can help out with it.

@AlexFreeland
Copy link
Author

Hi Mark,
I performed the following test:

cd /drive/2
mkdir .backup
btrfs receive -f /drive/1/originalFilesystem .backup
cp -aR --reflink=always .backup/originalFilesystem/* ./
btrfs subvolume snapshot -r ./ .backup/reflink

Running btrfs send on .backup/reflink completes without error. Hope this helps narrow things down.

@mac-linux-free
Copy link

a new info.....btrfs balance does only work if no other snapshots are present..that´s bad...how could i migrate snapshots in btrfs ?

@mac-linux-free
Copy link

new info....the btrfs send and receive works now with btrfs-progs 4.0 on both sides :) even after (un)successful dedup

@shyblower
Copy link

@mac-linux-free: unfortunately I cannot confirm this! Running on linux kernel 4.0.1 using btrfs-progs-4.0, I still get the error when trying to send a snapshot of a volume after having run dupremove on that volume.

@mac-linux-free
Copy link

ok ... we are waiting for 4.1 :)

@mac-linux-free
Copy link

it seams that kernel 4.1 and btrfs-progs 4.1 lead to a successful deduplication...but you have to balance your source btrfs pool first.

@shyblower
Copy link

I've just balanced my 6TB backup target... it took weeks :/

@mac-linux-free
Copy link

...hoping I don´t have to do it on my 110TB source :\

@markfasheh
Copy link
Owner

This sounds like it was fixed upstream - and to my knowledge the fix wasn't directly related to dedupe (let me know if otherwise). Going to close for now.

@mac-linux-free
Copy link

Sorry to open this again. This is still not fixed with kernel 4.1.2 and btrfs-progs 4.1. After an unsuccessful dedup:

Kernel processed data (excludes target files): 32.0G
Comparison of extent info shows a net change in shared extents of: 0.0

the send / receive does not work:

BTRFS error (device vdb): did not find backref in send_root. inode=17588, offset=1703936, disk_byte=20751888384 found extent=20751888384

I fixed send/receive again with a full balance. But how to dedup?

@doudou
Copy link

doudou commented Aug 7, 2015

Same here. Ran a dedup and got the send_root bug. The problem is that balance currently randomly crashes my machine (with a hard lock). I am deleting the offending inodes one by one ...

@markfasheh
Copy link
Owner

Ooops ok I'm going to keep this one open, with the same comment I gave in issue#87:

"Regarding the btrfs error you're seeing during send, there isn't anything that duperemove is doing directly which would cause this behavior. My guess is that send is broken (again) or that the clone code (used in the kernel for dedupe) corrupted something on disk. Have you asked about this on the btrfs list?"

I'll try to reproduce this week (on vacation) and take it to the list if I can't figure out immediately what's causing it.

@markfasheh markfasheh reopened this Aug 10, 2015
@markfasheh
Copy link
Owner

Ok I tried this a few times on 4.2-rc7 and was not able to reproduce the issue. Here's an example of what I was doing (I tried a few combinations of subvolumes and file trees)

fstest1:/ # uname -r
4.2.0-rc7+
fstest1:/ # cp -a /boot /btrfs/boot
fstest1:/ # cp -a /boot /btrfs/boot.1
fstest1:/ # duperemove -rdh /btrfs/boot /btrfs/boot.1 | tail -n 5
[0x13fe800] Dedupe 1 extents (id: 53d6821b) with target: (0.0, 57.8M), "/btrfs/boot/vmlinux-4.2.0-rc4+.gz"
[0x13fe8a0] Try to dedupe extents with id 7100870a
[0x13fe8a0] Dedupe 1 extents (id: 7100870a) with target: (0.0, 57.9M), "/btrfs/boot/vmlinux-4.2.0-rc6+.gz.old"
Kernel processed data (excludes target files): 529.6M
Comparison of extent info shows a net change in shared extents of: 388.4M
fstest1:/ # btrfs su sn -r /btrfs/ /btrfs/rosnap
Create a readonly snapshot of '/btrfs/' in '/btrfs/rosnap'
fstest1:/ # btrfs send /btrfs/rosnap/ > /dev/null
At subvol /btrfs/rosnap/
fstest1:/ # dmesg | tail -n 5
[360097.205463] BTRFS: device fsid 2ef606d8-6f69-460a-8f2e-ebdf477c2aba devid 1 transid 3 /dev/vdb1
[360100.424094] BTRFS info (device vdb1): disk space caching is enabled
[360100.424099] BTRFS: has skinny extents
[360100.424101] BTRFS: flagging fs with big metadata feature
[360100.429689] BTRFS: creating UUID tree

Does something like this reproduce it for you all or did I miss a step?

@mac-linux-free
Copy link

Sorry to say it is still not working.

Linux fbo-fs-02 4.2.0-1.el7.elrepo.x86_64 #1 SMP Sun Aug 30 21:25:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

#dmesg
[18047.639439] BTRFS error (device vdc): did not find backref in send_root. inode=146553, offset=12058624, disk_byte=178733858816 found extent=178733858816

This error occurs on the fileserver which I had deduped before.

@doudou
Copy link

doudou commented Sep 3, 2015

This error occurs on the fileserver which I had deduped before.

Could it be that the error is due to the previous dedup ?

By the way, it seems that btrfs-progs 4.1.2 can recover from these errors. I had one filesystem with those, and running a btrfs check --repair removed them

@mac-linux-free
Copy link

No I did not dedup until the Kernel 4.2 was finally released. And I do a send/receive backup to a different server nightly. After the dedup my scripted backup failed. I also know how to recover the error just with btrfs balance online.

@markfasheh
Copy link
Owner

I tried my test on 4.1 and didn't hit anything. Can anyone here give e a test case which reproduces this from a fresh file system? I doubt I'll be able to find this without a reproducer :(

@mac-linux-free
Copy link

I do not understand this. It is happening on one server only. I had 2 new installations last week and this errors did not occur on the new servers. Perhaps I shoud try the btrfs check --repair option instead of balancing only.

How could I debug ?

@markfasheh
Copy link
Owner

I'd start with btrfsck if you haven't already.

@mac-linux-free
Copy link

Now I have one more info. If duperemove runs (in background) you should avoid to run btrfs send/receive. And I found that on big filesystems sometimes occured OOM-errors.

Both of these things are leading to the error above.

@mac-linux-free
Copy link

the problem still exists on kernel 4.2.3 and btrfsprogs 4.2.2...I do have many snapshots and run duperemove with the -x switch. Is the the right way to do it or do I have to run duperemove on the whole pool?

@mac-linux-free
Copy link

Perhaps I should dedup on the pool level and not at the mounted subvol? ( /mnt/btrfs/files instead of /mnt/files) ... I´m testing.

@ddiss
Copy link

ddiss commented Feb 11, 2016

I raised a duplicate bug for this issue at:
https://bugzilla.suse.com/show_bug.cgi?id=966259

Filipe mentioned that this was fixed in the 4.3 kernel via:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d6589101b67a55107652050dfbf414403a93e351

@markfasheh
Copy link
Owner

Closing as this was fixed in the upstream kernel, thanks to Filipe Manana.

@samcv
Copy link

samcv commented Mar 25, 2018

This does not seem to be fixed. At least I just encountered it. I ran duperemove and now send causes OOM which then causes a kernel panic after it kills every single program and cannot kill any more. Running rebalance and hopefully that will get things working again.

@ddiss
Copy link

ddiss commented Mar 25, 2018

@samcv which kernel version are you using? I'd suggest raising the issue on the linux-btrfs vger.kernel.org mailing list if you're hitting it with a recent kernel.

@samcv
Copy link

samcv commented Mar 25, 2018

@ddiss I am using 4.15.10

I just did a full rebalance and now send/receive works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants