Deadlock behavior seen with Lustre during memory allocation #15786

jeyaga · 2024-01-17T18:20:36Z

System information

Type	Version/Name
Distribution Name	Amazon Linux2
Distribution Version
Kernel Version	5.10
Architecture	arm64
OpenZFS Version	2.1.7

Describe the problem you're observing

A deadlock situation is seen with Lustre using ZFS in low memory instances. The issue happens when Lustre calls ZFS apis to write/update data. The memory allocation in the api tries to free up pages inline during the allocation call and which in turn calls Lustre APIs resulting in a deadlock situation. Stack trace posted below shows the situation where the Lustre RPC thread calls sa_update, which then calls spl_kmem_zalloc. The spl_kmem_zalloc tries to free up memory inline and this results in calling Lustre api again and results in a deadlock.

Describe how to reproduce the problem

The issue happens in systems with low memory. Another trigger condition for the issue is to have a Lustre client mount on the Lustre metadata server. The issue does not happen if all the memory allocations from zfs are modified to use KM_NOSLEEP, by avoiding inline free. Would there be an implication of using the KM_NOSLEEP for all the zfs allocations or is there a better way to avoid running into this scenario?

Include any warning/errors/backtraces from the system logs

[<0>] __switch_to+0x80/0xb0
[<0>] obd_get_mod_rpc_slot+0x310/0x578 [obdclass]
[<0>] ptlrpc_get_mod_rpc_slot+0x3c/0x60 [ptlrpc]
[<0>] mdc_close+0x220/0xe5c [mdc]
[<0>] lmv_close+0x1ac/0x480 [lmv]
[<0>] ll_close_inode_openhandle+0x398/0xc8c [lustre]
[<0>] ll_md_real_close+0xa8/0x288 [lustre]
[<0>] ll_clear_inode+0x1a4/0x7e0 [lustre]
[<0>] ll_delete_inode+0x74/0x260 [lustre]
[<0>] evict+0xe0/0x23c
[<0>] dispose_list+0x5c/0x7c
[<0>] prune_icache_sb+0x68/0xa0
[<0>] super_cache_scan+0x158/0x1c0
[<0>] do_shrink_slab+0x19c/0x360
[<0>] shrink_slab+0xc0/0x144
[<0>] shrink_node_memcgs+0x1e4/0x240
[<0>] shrink_node+0x154/0x5e0
[<0>] shrink_zones+0x9c/0x220
[<0>] do_try_to_free_pages+0xb0/0x300
[<0>] try_to_free_pages+0x128/0x260
[<0>] __alloc_pages_slowpath.constprop.0+0x3e0/0x7fc
[<0>] __alloc_pages_nodemask+0x2bc/0x310
[<0>] alloc_pages_current+0x90/0x148
[<0>] allocate_slab+0x3cc/0x4f0
[<0>] new_slab_objects+0xa4/0x164
[<0>] ___slab_alloc+0x1b8/0x304
[<0>] __slab_alloc+0x28/0x60
[<0>] __kmalloc_node+0x140/0x3e0
[<0>] spl_kmem_alloc_impl+0xd4/0x134 [spl]
[<0>] spl_kmem_zalloc+0x20/0x38 [spl]
[<0>] sa_modify_attrs+0xfc/0x368 [zfs]
[<0>] sa_attr_op+0x144/0x3d4 [zfs]
[<0>] sa_bulk_update_impl+0x6c/0x110 [zfs]
[<0>] sa_update+0x8c/0x170 [zfs]
[<0>] __osd_sa_xattr_update+0x12c/0x270 [osd_zfs]
[<0>] osd_object_sa_dirty_rele+0x1a0/0x1a4 [osd_zfs]
[<0>] osd_trans_stop+0x370/0x720 [osd_zfs]
[<0>] top_trans_stop+0xb4/0x1024 [ptlrpc]
[<0>] lod_trans_stop+0x70/0x108 [lod]
[<0>] mdd_trans_stop+0x3c/0x2ec [mdd]
[<0>] mdd_create_data+0x43c/0x73c [mdd]
[<0>] mdt_create_data+0x224/0x39c [mdt]
[<0>] mdt_mfd_open+0x2e0/0xdc0 [mdt]
[<0>] mdt_finish_open+0x558/0x840 [mdt]
[<0>] mdt_open_by_fid_lock+0x468/0xb24 [mdt]
[<0>] mdt_reint_open+0x824/0x1fb8 [mdt]
[<0>] mdt_reint_rec+0x168/0x300 [mdt]
[<0>] mdt_reint_internal+0x5e8/0x9e0 [mdt]
[<0>] mdt_intent_open+0x170/0x43c [mdt]
[<0>] mdt_intent_opc+0x16c/0x65c [mdt]
[<0>] mdt_intent_policy+0x234/0x3b8 [mdt]
[<0>] ldlm_lock_enqueue+0x4b0/0x97c [ptlrpc]
[<0>] ldlm_handle_enqueue0+0xa20/0x1b2c [ptlrpc]
[<0>] tgt_enqueue+0x88/0x2d4 [ptlrpc]

The text was updated successfully, but these errors were encountered:

amotin · 2024-01-17T18:27:23Z

KM_NOSLEEP makes sense only if caller is ready to handle allocation errors, as a packet loss in network stack. ZFS can not just return ENOMEM errors randomly, so it would not be a fix, but a permanent disaster.

behlendorf · 2024-01-18T21:52:56Z

This same issue comes up in quite a few places in the ZFS code. To handle it we added two functions spl_fstrans_mark() and spl_fstrans_unmark(). They set PF_MEMALLOC_NOIO and are used to wrap critical sections where inline/direct memory reclaim could deadlock (like above). Within the section the kernel memory allocation function isn't allowed to perform the reclaim itself which avoids the issue.

It looks to me like the best fix here will be to update the Lustre code to use these wrappers for any area where a deadlock like this is possible.

jeyaga · 2024-01-19T06:19:41Z

This same issue comes up in quite a few places in the ZFS code. To handle it we added two functions spl_fstrans_mark() and spl_fstrans_unmark(). They set PF_MEMALLOC_NOIO and are used to wrap critical sections where inline/direct memory reclaim could deadlock (like above). Within the section the kernel memory allocation function isn't allowed to perform the reclaim itself which avoids the issue.

It looks to me like the best fix here will be to update the Lustre code to use these wrappers for any area where a deadlock like this is possible.

Thanks, will check out the spl_fstrans_mark()/spl_fstrans_unmark().

ryao · 2024-01-19T19:02:13Z

This same issue comes up in quite a few places in the ZFS code. To handle it we added two functions spl_fstrans_mark() and spl_fstrans_unmark(). They set PF_MEMALLOC_NOIO and are used to wrap critical sections where inline/direct memory reclaim could deadlock (like above). Within the section the kernel memory allocation function isn't allowed to perform the reclaim itself which avoids the issue.

It looks to me like the best fix here will be to update the Lustre code to use these wrappers for any area where a deadlock like this is possible.

Agreed. Modifying lustre to wrap calls into ZFS where lustre cannot recurse into itself with spl_fstrans_mark() and spl_fstrans_unmark() is the solution here.

jeyaga added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 17, 2024

amotin mentioned this issue Jan 25, 2024

Add native NFSv4 style ZFS ACL support for Linux truenas/zfs#206

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock behavior seen with Lustre during memory allocation #15786

Deadlock behavior seen with Lustre during memory allocation #15786

jeyaga commented Jan 17, 2024

amotin commented Jan 17, 2024

behlendorf commented Jan 18, 2024

jeyaga commented Jan 19, 2024

ryao commented Jan 19, 2024

Deadlock behavior seen with Lustre during memory allocation #15786

Deadlock behavior seen with Lustre during memory allocation #15786

Comments

jeyaga commented Jan 17, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

amotin commented Jan 17, 2024

behlendorf commented Jan 18, 2024

jeyaga commented Jan 19, 2024

ryao commented Jan 19, 2024