Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test case: zpool_destroy_001_pos #6145

Closed
behlendorf opened this issue May 19, 2017 · 7 comments
Closed

Test case: zpool_destroy_001_pos #6145

behlendorf opened this issue May 19, 2017 · 7 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue

Comments

@behlendorf
Copy link
Contributor

System information

Type Version/Name
Distribution Name Amazon Linux
Distribution Version AMI release 2017.03
Linux Kernel 4.9.27-14.31.amzn1.x86_64
Architecture x86_64
ZFS Version zfs-0.7.0-rc4-17-ga32df59
SPL Version 0.7.0-rc4

Describe the problem you're observing

Reliable deadlock when running zpool_destory_001_pos. This test case creates a zpool layered on a ZVOL which is known to be problematic. Commits 5559ba0 and 0778358 were designed to allow this and do appear to work reliably on other platforms, but not always.

Describe how to reproduce the problem

  • Boot an m3.large instance in EC2 running the latest Amazon Linux AMI.
  • Enable support for partitions on loopback devices and reboot.
    echo "options loop max_part=15" >/etc/modprobe.d/loop.conf
  • Update the AMI and install the needed packages to build ZFS.
  • Build the spl and zfs source with debugging enabled (master branch)
  • Run sudo ./scripts/zfs-helpers.sh -iv to install the zfs udev rules.
  • Run sudo ./scripts/zfs.sh to load the zfs kernel modules.
  • Modify the zpool_destroy_001_pos.ksh test case to run the original non-linux version of the test.
diff --git a/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh b/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
index 428765e..200ea76 100755
--- a/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
+++ b/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
@@ -73,14 +73,14 @@ partition_disk $SLICE_SIZE $DISK 2
 
 create_pool "$TESTPOOL" "${DISK}${SLICE_PREFIX}${SLICE0}"
 
-if is_linux; then
-       # Layering a pool on a zvol can deadlock and isn't supported.
-       create_pool "$TESTPOOL2" "${DISK}${SLICE_PREFIX}${SLICE1}"
-else
+#if is_linux; then
+#      # Layering a pool on a zvol can deadlock and isn't supported.
+#      create_pool "$TESTPOOL2" "${DISK}${SLICE_PREFIX}${SLICE1}"
+#else
        create_pool "$TESTPOOL1" "${DISK}${SLICE_PREFIX}${SLICE1}"
        log_must zfs create -s -V $VOLSIZE $TESTPOOL1/$TESTVOL
        create_pool "$TESTPOOL2" "${ZVOL_DEVDIR}/$TESTPOOL1/$TESTVOL"
-fi
+#fi
 
 typeset -i i=0
  • Run ./scripts/zfs-tests.sh -vx -t ./tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh.
  • System should deadlock immediately.

Include any warning/errors/backtraces from the system logs

[   79.571848] ZFS: Loaded module v0.7.0-rc4_17_ga32df59 (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
[   88.306058] loop: module loaded
[   88.776354]  loop0: p1
[   89.212604]  loop0: p1 p2
[   89.596101] SPL: using hostid 0x00000000
[   90.057163]  zd0: p1 p9
[  123.868062] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 33s!
[  123.875947] Showing busy workqueues and worker pools:
[  123.879551] workqueue events: flags=0x0
[  123.882503]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  123.886495]     pending: vmstat_shepherd, push_to_pool
[  123.894217] workqueue events_power_efficient: flags=0x80
[  123.898104]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  123.902096]     pending: neigh_periodic_work
[  123.922839] workqueue kblockd: flags=0x18
[  123.926263]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
[  123.930255]     pending: blk_mq_timeout_work, blk_mq_timeout_work
[  123.939473] workqueue vmstat: flags=0xc
[  123.942804]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  123.946789]     pending: vmstat_update
[  123.953876] workqueue ipv6_addrconf: flags=0x40008
[  123.957801]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  123.961792]     pending: addrconf_verify_work [ipv6]
[  149.216056] INFO: rcu_sched self-detected stall on CPU
[  149.216056] 	0-...: (14736 ticks this GP) idle=0a9/140000000000001/0 softirq=8042/8042 fqs=7126 
[  149.216056] 	 (t=14750 jiffies g=1102 c=1101 q=806)
[  149.216056] Task dump for CPU 0:
[  149.216056] blkid           R  running task        0  4497   3136 0x00000008
[  149.216056]  ffff8801e7a03ce8 ffffffff8109d6a7 0000000000000000 0000000000000000
[  149.216056]  ffff8801e7a03d00 ffffffff8109fff9 ffffffff81a54100 ffff8801e7a03d30
[  149.216056]  ffffffff81173a40 ffff8801e7a18940 ffffffff81a54100 0000000000000000
[  149.216056] Call Trace:
[  149.216056]  <IRQ> 
[  149.216056]  [<ffffffff8109d6a7>] sched_show_task+0xd7/0x140
[  149.216056]  [<ffffffff8109fff9>] dump_cpu_task+0x39/0x40
[  149.216056]  [<ffffffff81173a40>] rcu_dump_cpu_stacks+0x80/0xbc
[  149.216056]  [<ffffffff810d12bf>] rcu_check_callbacks+0x6ef/0x850
[  149.216056]  [<ffffffff810a0a51>] ? account_system_time+0x81/0x110
[  149.216056]  [<ffffffff810a0ce0>] ? account_process_tick+0x60/0x170
[  149.216056]  [<ffffffff810e60b0>] ? tick_sched_do_timer+0x30/0x30
[  149.216056]  [<ffffffff810d770f>] update_process_times+0x2f/0x60
[  149.216056]  [<ffffffff810e5af6>] tick_sched_handle.isra.13+0x36/0x50
[  149.216056]  [<ffffffff810e60ed>] tick_sched_timer+0x3d/0x70
[  149.216056]  [<ffffffff810d8296>] __hrtimer_run_queues+0xd6/0x230
[  149.216056]  [<ffffffff810d8718>] hrtimer_interrupt+0xa8/0x1a0
[  149.216056]  [<ffffffff81021b6f>] xen_timer_interrupt+0x1f/0x30
[  149.216056]  [<ffffffff810c41cc>] __handle_irq_event_percpu+0x3c/0x1a0
[  149.216056]  [<ffffffff810c4353>] handle_irq_event_percpu+0x23/0x60
[  149.216056]  [<ffffffff810c7f5a>] handle_percpu_irq+0x3a/0x50
[  149.216056]  [<ffffffff810c3502>] generic_handle_irq+0x22/0x30
[  149.216056]  [<ffffffff813972f8>] evtchn_2l_handle_events+0x238/0x240
[  149.216056]  [<ffffffff81394b63>] __xen_evtchn_do_upcall+0x43/0x80
[  149.216056]  [<ffffffff813968c0>] xen_evtchn_do_upcall+0x30/0x50
[  149.216056]  [<ffffffff81534732>] xen_hvm_callback_vector+0x82/0x90
[  149.216056]  <EOI> 
[  149.216056]  [<ffffffff81533400>] ? _raw_spin_lock+0x10/0x30
[  149.216056]  [<ffffffffa056c4f6>] ? zvol_open+0xa6/0x3a0 [zfs]
[  149.216056]  [<ffffffff81234334>] __blkdev_get+0xc4/0x420
[  149.216056]  [<ffffffff8123391e>] ? bdget+0x3e/0x130
[  149.216056]  [<ffffffff81234506>] __blkdev_get+0x296/0x420
[  149.216056]  [<ffffffff8123483f>] blkdev_get+0x1af/0x300
[  149.216056]  [<ffffffff81234a3b>] blkdev_open+0x5b/0x70
[  149.216056]  [<ffffffff811f6f93>] do_dentry_open+0x213/0x310
[  149.216056]  [<ffffffff812349e0>] ? blkdev_get_by_dev+0x50/0x50
[  149.216056]  [<ffffffff811f82af>] vfs_open+0x4f/0x70
[  149.216056]  [<ffffffff8120408b>] ? may_open+0x9b/0x100
[  149.216056]  [<ffffffff81207459>] path_openat+0x529/0x1300
[  149.216056]  [<ffffffff811b7402>] ? page_add_file_rmap+0x52/0x150
[  149.216056]  [<ffffffff81176976>] ? filemap_map_pages+0x366/0x3a0
[  149.216056]  [<ffffffff81209dce>] do_filp_open+0x7e/0xd0
[  149.216056]  [<ffffffff8102a649>] ? __switch_to+0x1f9/0x5d0
[  149.216056]  [<ffffffff81217bc0>] ? __alloc_fd+0xb0/0x170
[  149.216056]  [<ffffffff811f8655>] do_sys_open+0x115/0x1f0
[  149.216056]  [<ffffffff811f874e>] SyS_open+0x1e/0x20
[  149.216056]  [<ffffffff81533677>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[  242.652110] INFO: task lt-zpool:4363 blocked for more than 120 seconds.
[  242.658901]       Tainted: P           OE   4.9.27-14.31.amzn1.x86_64 #1
[  242.665595] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.673689] lt-zpool        D    0  4363   3372 0x00000000
[  242.680394]  ffff8801e227e180 0000000000000000 ffff8801e662bb00 ffff8801d2b98000
[  242.690713]  ffff8801e7a57bc0 ffffc90003b839f8 ffffffff8152ed4c ffffc90003b839c8
[  242.700941]  ffffffff8152f45e ffffc90003b83a20 ffff8801d2b98000 ffff8801e5bb96dc
[  242.711462] Call Trace:
[  242.714602]  [<ffffffff8152ed4c>] ? __schedule+0x23c/0x680
[  242.720233]  [<ffffffff8152f45e>] ? schedule_preempt_disabled+0xe/0x10
[  242.726812]  [<ffffffff8152f1c6>] schedule+0x36/0x80
[  242.732045]  [<ffffffff8152f45e>] schedule_preempt_disabled+0xe/0x10
[  242.738617]  [<ffffffff81530cb5>] __mutex_lock_slowpath+0x95/0x110
[  242.744922]  [<ffffffff81530d47>] mutex_lock+0x17/0x27
[  242.750449]  [<ffffffff812342cf>] __blkdev_get+0x5f/0x420
[  242.756244]  [<ffffffff8123488f>] blkdev_get+0x1ff/0x300
[  242.761600]  [<ffffffff8121a2c4>] ? mntput+0x24/0x40
[  242.766930]  [<ffffffff8120356e>] ? path_put+0x1e/0x30
[  242.772304]  [<ffffffff81234bc3>] blkdev_get_by_path+0x53/0x90
[  242.778509]  [<ffffffffa0501f24>] vdev_disk_open+0x3b4/0x420 [zfs]
[  242.784856]  [<ffffffff812152ca>] ? iput+0x8a/0x200
[  242.790043]  [<ffffffffa04feaae>] vdev_open+0x12e/0x730 [zfs]
[  242.796170]  [<ffffffffa04ff105>] vdev_open_children+0x55/0x170 [zfs]
[  242.802863]  [<ffffffffa0511540>] vdev_root_open+0x50/0x110 [zfs]
[  242.809212]  [<ffffffffa04feaae>] vdev_open+0x12e/0x730 [zfs]
[  242.815298]  [<ffffffffa04ff282>] vdev_create+0x22/0xa0 [zfs]
[  242.821367]  [<ffffffffa03f449c>] ? zfs_allocatable_devs+0x5c/0x80 [zcommon]
[  242.828408]  [<ffffffffa04e75f5>] spa_create+0x435/0xaa0 [zfs]
[  242.834551]  [<ffffffffa037e725>] ? nvlist_add_uint64+0x35/0x40 [znvpair]
[  242.841280]  [<ffffffffa052ab61>] ? zfs_fill_zplprops_impl+0x231/0x410 [zfs]
[  242.848309]  [<ffffffffa053014e>] zfs_ioc_pool_create+0x12e/0x230 [zfs]
[  242.855010]  [<ffffffffa0307eff>] ? strdup+0x3f/0x60 [spl]
[  242.860858]  [<ffffffffa0530e39>] zfsdev_ioctl+0x4d9/0x600 [zfs]
[  242.867149]  [<ffffffff8120ccd6>] do_vfs_ioctl+0x96/0x5b0
[  242.872660]  [<ffffffff8106100a>] ? __do_page_fault+0x24a/0x4a0
[  242.878770]  [<ffffffff8120d269>] SyS_ioctl+0x79/0x90
[  242.884182]  [<ffffffff81533677>] entry_SYSCALL_64_fastpath+0x1a/0xa9
@behlendorf behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label May 19, 2017
@bprotopopov
Copy link
Contributor

Not sure if I can get my hands on m3.large instance, so will try other environments; looking at the stack traces above, I believe

blkid takes
bdev->bd_mutex, then zvol-level mutex and blocks
zpool takes
spa_namespace_lock, then tries to take bdev->bd_mutex and blocks

It seems there should be another thread that took a zvol-level lock and blocked (or did not release it ?)

@behlendorf
Copy link
Contributor Author

@bprotopopov thanks. It looks like Amazon's kernel is largely a stock 4.9.27 with some additional drivers added and a few small bug fixes. There doesn't appear to be any significant change in this area of the code. This should be reproducible elsewhere, but it does seems to be easy to reproduce in this environment.

Here's the full console log from another instance of the deadlock, issue-6145.txt. It looks like we've got three processes involved, blkid:7776, lt-zpool:7624, losetup:7859.

@bprotopopov
Copy link
Contributor

@behlendorf
followed the instructions above, ran:

[build@localhost zfs-bprotopopov]$ ./scripts/zfs-tests.sh -vvvx -t ./tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh

--- Cleanup ---
Removing pool(s):      
Removing dm(s):       
Removing loopback(s):  
Removing files(s):     

--- Configuration ---
Runfile:         /var/tmp/zfs-tests.16362.run
STF_TOOLS:       /home/build/code/git/zfs-bprotopopov/tests/test-runner
STF_SUITE:       /home/build/code/git/zfs-bprotopopov/tests/zfs-tests
STF_PATH:        /home/build/code/git/zfs-bprotopopov/bin
FILEDIR:         /var/tmp
FILES:           /var/tmp/file-vdev0 /var/tmp/file-vdev1 /var/tmp/file-vdev2
LOOPBACKS:       /dev/loop0 /dev/loop1 /dev/loop2 
DISKS:           loop0 loop1 loop2 
NUM_DISKS:       3
FILESIZE:        4G
Keep pool(s):    rpool
Missing util(s): pax 

/home/build/code/git/zfs-bprotopopov/tests/test-runner/cmd/test-runner.py  -c /var/tmp/zfs-tests.16362.run -i /home/build/code/git/zfs-bprotopopov/tests/zfs-tests
Test: /home/build/code/git/zfs-bprotopopov/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh (run as root) [00:08] [PASS]

Results Summary
PASS	   1

Running Time:	00:00:08
Percent passed:	100.0%
Log directory:	/var/tmp/test_results/20170522T171406

[build@localhost zfs-bprotopopov]$ uname -a
Linux localhost.localdomain 4.11.2-1.el7.elrepo.x86_64 #1 SMP Sun May 21 19:31:34 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[build@localhost zfs-bprotopopov]$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
#
no changes added to commit (use "git add" and/or "git commit -a")
[build@localhost zfs-bprotopopov]$ git diff
diff --git a/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh b/tests/zfs-tests/tests/functional/cli_root
index 428765e..b41f916 100755
--- a/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
+++ b/tests/zfs-tests/tests/functional/cli_root/zpool_destroy/zpool_destroy_001_pos.ksh
@@ -73,14 +73,14 @@ partition_disk $SLICE_SIZE $DISK 2
 
 create_pool "$TESTPOOL" "${DISK}${SLICE_PREFIX}${SLICE0}"
 
-if is_linux; then
+#if is_linux; then
        # Layering a pool on a zvol can deadlock and isn't supported.
-       create_pool "$TESTPOOL2" "${DISK}${SLICE_PREFIX}${SLICE1}"
-else
+#      create_pool "$TESTPOOL2" "${DISK}${SLICE_PREFIX}${SLICE1}"
+#else
        create_pool "$TESTPOOL1" "${DISK}${SLICE_PREFIX}${SLICE1}"
        log_must zfs create -s -V $VOLSIZE $TESTPOOL1/$TESTVOL
        create_pool "$TESTPOOL2" "${ZVOL_DEVDIR}/$TESTPOOL1/$TESTVOL"
-fi
+#fi
 
 typeset -i i=0
 while (( i < ${#datasets[*]} )); do
[build@localhost zfs-bprotopopov]$ 
[build@localhost zfs-bprotopopov]$ uname -a
Linux localhost.localdomain 4.11.2-1.el7.elrepo.x86_64 #1 SMP Sun May 21 19:31:34 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

@bprotopopov
Copy link
Contributor

The merged log of the test:

[build@localhost zfs-bprotopopov]$ more /var/tmp/test_results/20170522T171406/zpool_destroy/zpool_destroy_001_pos.ksh/merged 
ASSERTION: 'zpool destroy <pool>' can destroy a specified pool.
SUCCESS: parted /dev/loop0 -s -- mklabel gpt
Error: Partition doesn't exist.
SUCCESS: parted /dev/loop0 -s -- mklabel gpt
SUCCESS: parted /dev/loop0 -s -- mkpart part0 1cyl 63cyl
SUCCESS: parted /dev/loop0 -s -- mkpart part1 63cyl 125cyl
SUCCESS: zpool create -f testpool loop0p1
SUCCESS: zpool create -f testpool1 loop0p2
SUCCESS: zfs create -s -V 150m testpool1/testvol
SUCCESS: zpool create -f testpool2 /dev/zvol/testpool1/testvol
SUCCESS: poolexists testpool
SUCCESS: zpool destroy testpool
SUCCESS: poolexists testpool exited 1
SUCCESS: poolexists testpool2
SUCCESS: zpool destroy testpool2
SUCCESS: poolexists testpool2 exited 1
NOTE: Performing local cleanup via log_onexit (cleanup)
SUCCESS: zfs destroy -f testpool1/testvol
SUCCESS: parted /dev/loop0 -s -- mklabel gpt
'zpool destroy <pool>' executes successfully
[build@localhost zfs-bprotopopov]$

@bprotopopov
Copy link
Contributor

Hi, @behlendorf,
ran same in a loop 10 times, all passed
using vbox-based VM running on 4 cores with 2G of memory
kernel 4.11.2-1.el7.elrepo.x86_64

behlendorf added a commit to behlendorf/zfs that referenced this issue May 25, 2017
Run zpool_destroy_001_pos 50 times with a 30 seconds timeout.

Requires-builders: test
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6145
@behlendorf
Copy link
Contributor Author

@bprotopopov thanks for trying to reproduce this. I haven't been able to reproduce problem either outside of Amazon's ec2 environment. One possible way to tackle this would be to open a PR with enables zpool_destroy_001_pos and adds some additional debugging. This should reproduce this issue since the buildbot runs everything in ec2.

As a proof off concept I've opened PR #6168 which will runs zpool_destroy_001_pos 50 times. I've also added a trap to dump the internal zfs debug log and the dmesg output. It can be extended to add additional debugging. Feel free to lean on the buildbot if it's helpful to identify the root cause.

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

2 participants