Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent deadlock in arc_read in Linux memory reclaim callback #9987

Merged

Conversation

markroper
Copy link
Contributor

Motivation and Context

When using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process.

Description

This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

 #0 [ffffc9002b98adc8] __schedule at ffffffff81610f2e
 #1 [ffffc9002b98ae68] schedule at ffffffff81611558
 #2 [ffffc9002b98ae70] schedule_preempt_disabled at ffffffff8161184a
 #3 [ffffc9002b98ae78] __mutex_lock at ffffffff816131e8
 #4 [ffffc9002b98af18] arc_buf_destroy at ffffffffa0bf37d7 [zfs]
 #5 [ffffc9002b98af48] dbuf_destroy at ffffffffa0bfa6fe [zfs]
 #6 [ffffc9002b98af88] dbuf_evict_one at ffffffffa0bfaa96 [zfs]
 #7 [ffffc9002b98afa0] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
 #8 [ffffc9002b98b050] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
 #9 [ffffc9002b98b100] osd_object_delete at ffffffffa0b64ecc [osd_zfs]
#10 [ffffc9002b98b118] lu_object_free at ffffffffa06d6a74 [obdclass]
#11 [ffffc9002b98b178] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
#12 [ffffc9002b98b220] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
#13 [ffffc9002b98b278] shrink_slab at ffffffff811ca9d8
#14 [ffffc9002b98b338] shrink_node at ffffffff811cfd94
#15 [ffffc9002b98b3b8] do_try_to_free_pages at ffffffff811cfe63
#16 [ffffc9002b98b408] try_to_free_pages at ffffffff811d01c4
#17 [ffffc9002b98b488] __alloc_pages_slowpath at ffffffff811be7f2
#18 [ffffc9002b98b580] __alloc_pages_nodemask at ffffffff811bf3ed
#19 [ffffc9002b98b5e0] new_slab at ffffffff81226304
#20 [ffffc9002b98b638] ___slab_alloc at ffffffff812272ab
#21 [ffffc9002b98b6f8] __slab_alloc at ffffffff8122740c
#22 [ffffc9002b98b708] kmem_cache_alloc at ffffffff81227578
#23 [ffffc9002b98b740] spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
#24 [ffffc9002b98b780] arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
#25 [ffffc9002b98b7b0] arc_read at ffffffffa0bf0924 [zfs]
#26 [ffffc9002b98b858] dbuf_read at ffffffffa0bf9083 [zfs]
#27 [ffffc9002b98b900] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

How Has This Been Tested?

This change has been built and subjected to stability and performance tests as the backing file system for Lustre 2.10.* file systems.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@markroper markroper force-pushed the linux-memory-reclaim-deadlock-fix branch from 6ac3449 to ac4b105 Compare February 12, 2020 13:38
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing Component: Memory Management kernel memory management labels Feb 12, 2020
@codecov
Copy link

codecov bot commented Feb 12, 2020

Codecov Report

Merging #9987 into master will decrease coverage by 12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9987      +/-   ##
==========================================
- Coverage      79%      67%     -12%     
==========================================
  Files         384      302      -82     
  Lines      121683   104452   -17231     
==========================================
- Hits        96584    69987   -26597     
- Misses      25099    34465    +9366
Flag Coverage Δ
#kernel ?
#user 67% <100%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e9e90b...3a5c365. Read the comment docs.

@degremont
Copy link

buildbot/Kernel.org Built-in x86_64 (BUILD) test failing is totally unrelated. Likely more an issue in the test environment.
Is it safe now to retrigger them? How could we retrigger this test?

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will resolve the deadlock but it also somewhat violates the existing layering. Normally we call spl_fstrans_check() when setting up a new kthread if it's going to call the DMU interfaces, or for system calls it's done in our registered VFS callbacks. Feel free to update the PR, but before moving forward with this solution let's check with @adilger about potentially calling this on the Lustre side when they setup the threads which access the DMU. There may be other cases this doesn't cover.

test failing is totally unrelated.

That's right. The failures are due to some recent changes in the mainline kernel, what will be the 5.6 kernel when released. The needed compatibility code for this was added last week as commit 795699a and 0dd7364. If you rebase this PR on the latest version of the master branch with those commits, and force update the PR you can trigger a new CI run which will build fine.

module/zfs/arc.c Outdated Show resolved Hide resolved
@markroper markroper force-pushed the linux-memory-reclaim-deadlock-fix branch from ac4b105 to 3a5c365 Compare February 14, 2020 14:01
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving this some thought I think it's reasonable for us to handle this here. This way we'll catch it for all DMU consumers and Lustre versions. I'm OK with this PR as is.

As an aside it looks like memalloc_noio_save() / memalloc_noio_restore() are now available in all supported kernels (back to 3.10). So at some point in the future we could probably consider simplifying our compatibility wrapper functions.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 10, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Signed-off-by: Mark Roper <markroper@gmail.com>
@markroper markroper force-pushed the linux-memory-reclaim-deadlock-fix branch from 3a5c365 to a66ca64 Compare March 11, 2020 12:59
@behlendorf behlendorf merged commit 1e9231a into openzfs:master Mar 12, 2020
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 15, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes openzfs#9987
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes openzfs#9987
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 22, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes openzfs#9987
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 28, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes openzfs#9987
tonyhutter pushed a commit that referenced this pull request May 12, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes #9987
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes openzfs#9987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants