Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent invocations of ndctl can cause linux panic #96

Open
etsaur4 opened this issue Jun 6, 2019 · 4 comments

Comments

@etsaur4
Copy link

commented Jun 6, 2019

Raising a discussion on the linux-nvdimm alias to be tracked as a github issue.

https://lists.01.org/pipermail/linux-nvdimm/2019-May/021385.html

The problem still exists in 5.2 RC2

The problem is fairly easy to reproduce in as little as 10 minutes.
Do the following in parallel, like in separate terminals. Example...
in term #1, #3, #5, type
while [1]; do ndctl create-namespace -m devdax -s 48G done
in term #2, #4, #6, type
while [1]; do ndctl destroy-namespace all -f done

Even simple invocation will eventually lead to a panic, it can take hours though. Example...
in term #1 run the script
#/bin/bash
while /bin/true
do
ndctl destroy-namespace -f all
date
for R in ndctl list -R | jq -r ".[] | .dev"
do
for i in {1..10}
do
ndctl create-namespace -r $R -s 8g -m devdax
done
done
done
in term #2 type
while /bin/true; do ndctl list done

Run that same terminal #1 script in 2 separate terminals, thereby creating 2 separate threads that will destroy/create will usually result in a panic within an hour.

@etsaur4

This comment has been minimized.

Copy link
Author

commented Jun 7, 2019

Update with 5.2 RC2 + patches like the one for issue 91 also exhibit the problem. Same stack as the one in the nvdimm alias.

[ 376.581650] CPU: 20 PID: 1950 Comm: kworker/u130:14 Not tainted 4.14.35-1923.el7uek.x86_64 #2
[ 376.591165] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM, MB, X7-2, BIOS 51020101 05/07/2019
[ 376.601755] Workqueue: events_unbound async_run_entry_fn
[ 376.607683] task: ffff9e78fa63bd80 task.stack: ffffc2348fb74000
[ 376.614292] RIP: 0010:kernfs_find_ns+0x18/0xbf
[ 376.619250] RSP: 0018:ffffc2348fb77d20 EFLAGS: 00010246
[ 376.625081] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
[ 376.633045] RDX: 0000000000000000 RSI: ffffffffa8eb5ac1 RDI: 0000000000000000
[ 376.641010] RBP: ffffc2348fb77d40 R08: 0000000000000000 R09: ffff9e61f9f48000
[ 376.648973] R10: 000000000000005c R11: 00000000000000a6 R12: ffffffffa8eb5ac1
[ 376.656938] R13: 0000000000000000 R14: ffffffffa8eb5ac1 R15: ffff9e7905fad208
[ 376.664902] FS: 0000000000000000(0000) GS:ffff9e791ef00000(0000) knlGS:0000000000000000
[ 376.673933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 376.680347] CR2: 0000000000000070 CR3: 000000156b40a002 CR4: 00000000007606e0
[ 376.688311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 376.696273] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 376.704238] PKRU: 55555554
[ 376.707255] Call Trace:
[ 376.709987] kernfs_find_and_get_ns+0x31/0x52
[ 376.714848] sysfs_unmerge_group+0x1d/0x57
[ 376.719422] dpm_sysfs_remove+0x22/0x5c
[ 376.723706] device_del+0x5a/0x325
[ 376.727502] device_unregister+0x1a/0x58
[ 376.731886] nd_async_device_unregister+0x22/0x30 [libnvdimm]
[ 376.738299] async_run_entry_fn+0x3e/0x169
[ 376.742870] process_one_work+0x169/0x3a6
[ 376.747345] worker_thread+0x4d/0x3e5
[ 376.751434] kthread+0x105/0x138
[ 376.755035] ? rescuer_thread+0x380/0x375
[ 376.759510] ? kthread_bind+0x20/0x15
[ 376.763600] ret_from_fork+0x24/0x49
[ 376.767588] Code: 24 08 48 83 42 40 01 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 f6 41 55 49 89 d5 31 d2 41 54 53 <0f> b7 47 70 48 8b 5f 48 66 c1 e8 05 83 e0 01 4d 85 ed 0f b6 c8
[ 376.788686] RIP: kernfs_find_ns+0x18/0xbf RSP: ffffc2348fb77d20
[ 376.795293] CR2: 0000000000000070

@djbw

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

I'm able to readily reproduce this. Concurrent ndctl seems to be triggering double-free (double device-unregistration events). Still looking to narrow down all the scenarios where double unregistration occurs.

@djbw

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

@djbw

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

ColinIanKing pushed a commit to ColinIanKing/linux-next-mirror that referenced this issue Jun 20, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

avalluri added a commit to avalluri/pmem-CSI that referenced this issue Jul 11, 2019

DeviceManager/ndctl: Synchronize all ndctl calls
As ndctl does not support concurrent volume creation and deletions we have to
synchronize all ndctl calls in our layer

For reference: pmem/ndctl#96

avalluri added a commit to avalluri/pmem-CSI that referenced this issue Jul 12, 2019

DeviceManager/ndctl: Synchronize all ndctl calls
As ndctl does not support concurrent volume creation and deletions we have to
synchronize all ndctl calls in our layer

For reference: pmem/ndctl#96

This also fixes the deadlock introduced by recent change
6a8d1ee FlushDevice() itself handles the
mutex(volumeMutex) locking and unlocking, so callers(LVM,ndctl) should not lock
explicitly while calling this utility method.

ColinIanKing pushed a commit to ColinIanKing/linux-next-mirror that referenced this issue Jul 25, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

woodsts pushed a commit to woodsts/linux-stable that referenced this issue Jul 31, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
commit 8aac0e2 upstream.

A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

gregkh added a commit to gregkh/linux that referenced this issue Aug 9, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
commit 8aac0e2 upstream.

A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

mrchapp pushed a commit to mrchapp/linux that referenced this issue Aug 9, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
commit 8aac0e2 upstream.

A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

isjerryxiao added a commit to archlinux-jerry/Amlogic_s905-kernel that referenced this issue Aug 17, 2019

libnvdimm/bus: Prevent duplicate device_unregister() calls
commit 8aac0e2338916e273ccbd438a2b7a1e8c61749f5 upstream.

A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
     device_del+0x73/0x370
     device_unregister+0x16/0x50
     nd_async_device_unregister+0x1e/0x30 [libnvdimm]
     async_run_entry_fn+0x39/0x160
     process_one_work+0x23c/0x5e0
     worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
     klist_del+0xe/0x10
     device_del+0x8a/0x2c9
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     device_unregister+0x44/0x4f
     nd_async_device_unregister+0x22/0x2d [libnvdimm]
     async_run_entry_fn+0x47/0x15a
     process_one_work+0x1a2/0x2eb
     worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Fixes: 4d88a97 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Link: pmem/ndctl#96
Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.