Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recv stream from 0.7.0-rc3 on 0.6.5.8, stuck on 100% CPU #5999

Closed
aither64 opened this issue Apr 10, 2017 · 7 comments
Closed

Unable to recv stream from 0.7.0-rc3 on 0.6.5.8, stuck on 100% CPU #5999

aither64 opened this issue Apr 10, 2017 · 7 comments

Comments

@aither64
Copy link

@aither64 aither64 commented Apr 10, 2017

System information

Type Version/Name
Distribution Name Scientific Linux
Distribution Version 6
Linux Kernel 2.6.32-042stab120.20 (OpenVZ)
Architecture x86_64
ZFS Version 0.7.0-rc3_154_g7b0dc2a and 0.6.5.8-1
SPL Version 0.7.0-rc3_7_gbf8abea and 0.6.5.8-1

Describe the problem you're observing

It's not possible to receive a stream created with ZFS 0.7 on ZFS 0.6. zfs recv writes most (or all) of the data, as I can see the used property to increase. When it should finish, it keeps spinning at 100% CPU in the R state. The process is stuck and cannot be killed even with SIGKILL.

Describe how to reproduce the problem

First, with ZFS 0.7:

zfs send src-tank@snapshot > stream.dat

Then with ZFS 0.6:

zfs recv -F dst-tank < stream.dat

For example, stream in 50.dat.gz was created with ZFS 0.7 and every time I try to receive it on ZFS 0.6, the problem occurs:

[root@node1-zfs.dev1.prg.vpsfree.cz]
 ~ # zcat 50.dat.gz | zfs recv -v -F vz/test/50
receiving full stream of vz/private/50@test into vz/test/50@test

This process never finishes.

Include any warning/errors/backtraces from the system logs

No error is shown, the zfs recv process is just stuck.

[root@node1-zfs.dev1.prg.vpsfree.cz]
 ~ # for i in {1..5} ; do cat /proc/8599/stack ; echo '---' ; sleep 2 ; done
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffff811aa8b7>] kmem_cache_alloc_node_trace+0x127/0x220
[<ffffffff811aaa85>] __kmalloc_node+0x65/0x70
[<ffffffffa02a8b8e>] spl_kmem_zalloc+0x8e/0x190 [spl]
[<ffffffffa031218f>] dbuf_hold_impl+0x4f/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffff811aa8b7>] kmem_cache_alloc_node_trace+0x127/0x220
[<ffffffff811aaa85>] __kmalloc_node+0x65/0x70
[<ffffffffa02a8b8e>] spl_kmem_zalloc+0x8e/0x190 [spl]
[<ffffffffa031218f>] dbuf_hold_impl+0x4f/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---

On another system with a different kernel (2.6.32-042stab120.11), it looks like this:

[root@node12.prg.vpsfree.cz]
 ~ # for i in {1..5} ; do cat /proc/527723/stack ; echo '---' ; sleep 2 ; done    
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa02fb1d0>] dbuf_hold_impl+0x90/0xb0 [zfs]
[<ffffffffa0314c10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa031766e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa03051f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa030bd27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa030ca94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0382e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03824b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffff811aacdb>] kfree+0x5b/0x250
[<ffffffffffffffff>] 0xffffffffffffffff
---
@loli10K
Copy link
Contributor

@loli10K loli10K commented Apr 10, 2017

restore_freeobjects

possible duplicate of #5699?

@aither64
Copy link
Author

@aither64 aither64 commented Apr 12, 2017

Yes, it seems to be the same issue. So I followed the OpenZFS 6393 to commit e6d3a84, which is in 0.7.0. Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10? In 0.6.5, the function's name is indeed restore_freeobjects.

I tried to extract the fix from 6393 and came up with this patch. It seems to work for me, recv finishes correctly. I don't know if it has some other consequences though.

@loli10K
Copy link
Contributor

@loli10K loli10K commented Apr 12, 2017

Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10?

That fix (which doesn't seem to work as intended, see #5699 (comment)) is needed on the sending side, not the receiving end: backporting to 0.6.5.10 would not help in this case.

@Jontem
Copy link

@Jontem Jontem commented Jul 3, 2017

Hi!

I have the same problem. With the latest FreeNAS(freebsd 11-stable) as the sending side and Ubuntu 16.04 with the latest stable of Zol as the receiving side.

I've read this issue and #5699 a couple of times but I'm not totally clear if there is a way opting out of freerecords. This commit seems to be related freenas/os@d29ea66

According to this thread it seems unlikely. Especially this comment https://illumos.topicbox.com/groups/discuss/discussions/T03e37d6ecdac991f-Mbe84d215678afc7bd84d9a68

I've tested with on 0.7.0-rc4 and it works. Is there a release date planned for 0.7 stable? Have searched but couldn't find any.

@behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Jul 3, 2017

Created a project for 0.7.0-rc5, the only major change outstanding is the finalize the safeimport (multi-modifier protection) code.

@gmelikov
Copy link
Member

@gmelikov gmelikov commented Aug 9, 2017

0.7 branch is stable now, closed as #5699 duplicate.

@gmelikov gmelikov closed this Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
@behlendorf @gmelikov @loli10K @aither64 @Jontem and others
You can’t perform that action at this time.