Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zio_change_priority() lock inversion #7301

Conversation

behlendorf
Copy link
Contributor

Description

A deadlock in the I/O pipeline involving zio_change_priority() is possible because the zio->io_lock and vdev->vq_lock are need to be acquired in the incorrect order.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

This patch resolves the lock inversion by using mutex_tryenter() to detect when the vq->vq_lock is contended in the zio_change_priority() call path. When contended all the locks are released and acquiring them in retried.

Motivation and Context

Any deadlock like this in the pipeline could manifest itself as a hang. It may explain issues like #7241 and #7059 which have been observed in master. This specific issue was introduced in a8b2e30 (zfs-0.7.0-223-ga8b2e30) and does not impact the 0.7 release branch.

How Has This Been Tested?

The issue was reliably reproducible with the sequential_reads test case from the perf-regression tests.

DISKS="vdb vdc vdd" ./scripts/zfs-tests.sh -r perf-regression.run -vx -T perf

With this patch applied the issue can no longer be reproduced, and initial performance results are promising but there are several outliers for long latency times.

job: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
...
fio-3.1
Starting 64 threads

job: (groupid=0, jobs=64): err= 0: pid=22006: Wed Mar 14 00:05:19 2018
   read: IOPS=117, BW=117MiB/s (123MB/s)(69.2GiB/604548msec)
    clat (msec): min=2, max=40781, avg=542.60, stdev=1336.27
     lat (msec): min=2, max=40781, avg=542.61, stdev=1336.27
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[    7],
     | 30.00th=[   11], 40.00th=[   22], 50.00th=[   65], 60.00th=[  155],
     | 70.00th=[  300], 80.00th=[  625], 90.00th=[ 1687], 95.00th=[ 2802],
     | 99.00th=[ 5604], 99.50th=[ 7617], 99.90th=[15234], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min= 1706, max=26892, per=4.71%, avg=5651.91, stdev=3823.11, samples=25665
   iops        : min=    1, max=   26, avg= 5.32, stdev= 3.75, samples=25665
  lat (msec)   : 4=6.19%, 10=22.88%, 20=9.91%, 50=8.67%, 100=7.02%
  lat (msec)   : 250=12.59%, 500=9.97%, 750=4.74%, 1000=2.87%, 2000=6.93%
  lat (msec)   : >=2000=8.23%
  cpu          : usr=0.00%, sys=0.97%, ctx=1379109, majf=0, minf=16384
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=70844,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=117MiB/s (123MB/s), 117MiB/s-117MiB/s (123MB/s-123MB/s), io=69.2GiB (74.3GB), run=604548-604548msec

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Mar 14, 2018
@behlendorf behlendorf requested a review from tcaputi March 14, 2018 17:35
@codecov
Copy link

codecov bot commented Mar 14, 2018

Codecov Report

Merging #7301 into master will increase coverage by 0.07%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7301      +/-   ##
==========================================
+ Coverage   76.36%   76.43%   +0.07%     
==========================================
  Files         328      328              
  Lines      104023   104033      +10     
==========================================
+ Hits        79439    79521      +82     
+ Misses      24584    24512      -72
Flag Coverage Δ
#kernel 76.45% <92.3%> (+0.39%) ⬆️
#user 65.68% <0%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de4f8d5...bae0254. Read the comment docs.

A deadlock in the I/O pipeline involving zio_change_priority() is
possible because the zio->io_lock and vdev->vq_lock are need to be
acquired in the incorrect order.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

This patch resolves the lock inversion by using mutex_tryenter() to
detect when the vq->vq_lock is contended in the zio_change_priority()
call path.  When contended all the locks are released and acquiring
them in retried.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf force-pushed the zio_change_priority-lock-inversion branch from fce6f52 to bae0254 Compare March 15, 2018 01:31
@behlendorf behlendorf mentioned this pull request Mar 15, 2018
13 tasks
@behlendorf behlendorf closed this Mar 15, 2018
@behlendorf behlendorf deleted the zio_change_priority-lock-inversion branch April 19, 2021 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant