Skip to content
This repository

ZFS I/O throttle unnecessarily throttles I/O on a lightly loaded system #160

Closed
calmh opened this Issue · 6 comments

3 participants

Jakob Borg Bill Pijewski Keith M Wesolowski
Jakob Borg

It seems that I/O is being unnecessarily throttled when the system is
otherwise mostly idle, I/O wise.

Intuitively I would expect a zone to be able to burst to pretty much all
of the available I/O capacity, much like CPU utilization is (or can be,
at least) handled.

Going by http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ it
seems that the actual expected behavior is that a zone gets throttled
when it's I/O utilization significantly exceed the average.

In my use case, a mostly idle system (i.e. a few things running, but none
of them generating very much I/O), the average will be close to zero.
It would be nice if this didn't result in harsh throttling when a zone
actually wants to perform some I/O.

This is my setup:

root@anto ~]# vmadm list -o uuid,type,ram,state,zfs_io_priority
UUID                                  TYPE  RAM      STATE             IO_PRIORITY
81035ab6-9827-448d-9208-87eeb7d35891  KVM   256      running           100
0d6e2251-aa11-452b-afb7-e43c8e7bfe1c  KVM   512      running           100
183b9ce8-38c6-460b-a5d1-9bb139be0f34  OS    512      running           100
1bbb6dd3-6abd-47b9-8485-a7afe0b727d5  OS    512      running           100
8061c37a-5ec3-47e8-80d7-da9fa204fde1  OS    512      running           100
8491ba2d-1ffc-4219-879d-d8f7384780d9  KVM   512      stopped           100
b2535e73-0892-4183-9e02-0255c6dde661  OS    512      running           100
ba1f8d92-2ca2-4f01-9939-fa2de11a535e  OS    512      running           100
e39c6382-3c22-4896-95b7-3e6515a6b26e  OS    512      running           100
556004f2-36ec-4483-8e1e-5af188f0db7c  KVM   768      running           100
f993583b-2a17-4751-948d-b454f6ba720f  KVM   768      running           100
7dc0f886-5faa-4534-a68e-8277e167464e  KVM   2048     running           100
bc3d93e3-ee95-471a-8341-02881794fcfa  OS    32768    stopped           100

[root@anto ~]# zpool status
  pool: zones
 state: ONLINE
  scan: scrub repaired 0 in 8h49m with 0 errors on Sat Jan  5 01:25:38 2013
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t50014EE25B7FC628d0  ONLINE       0     0     0
            c0t50014EE205CF84F8d0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c0t50014EE205CF7101d0  ONLINE       0     0     0
            c0t50014EE206561484d0  ONLINE       0     0     0
          mirror-3                 ONLINE       0     0     0
            c0t50014EE2B0D583F5d0  ONLINE       0     0     0
            c0t50014EE25B7CCB38d0  ONLINE       0     0     0
        logs
          c2t1d0                   ONLINE       0     0     0
          c2t2d0                   ONLINE       0     0     0

errors: No known data errors

Doing a bonnie++ test in a zone (ba1f8d) gives the results below. (The
system has 16G RAM, I'm purposely setting it down in bonnie++ to make
the test take less time. I'm only interested in perceived I/O
performance from the zone, not to measure what the disks can sustain
worst case.)

jb@zlogin:~/tmp % bonnie++ -f -r 4096
Writing intelligently...done
...
Delete files in random order...done.
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zlogin.nym.se    8G           135610  84 41693  81           238721  24 +++++ +++
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26387  95 +++++ +++ +++++ +++ 23028  99 +++++ +++ 30239  91
zlogin.nym.se,8G,,,135610,84,41693,81,,,238721,24,+++++,+++,16,26387,95,+++++,+++,+++++,+++,23028,99,+++++,+++,30239,91
jb@zlogin:~/tmp % 

For my purposes at the moment, the 135.6 and 41.7 MB/s figures are
the interesting ones.

vfsstat inside the zone looked like this during the test:

jb@zlogin:~ % vfsstat -M 5
  r/s   w/s  Mr/s  Mw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
 66.3  32.4   0.4   0.4   0.0   0.0    0.1    0.1   0   0  27.4   73.1 ba1f8d92 (4)
 24.5 12640.2   0.0  98.7   0.0   0.6    0.4    0.0   1  63 12642.0   36.6 ba1f8d92 (4)
 21.7 12883.0   0.0 100.6   0.0   1.0    0.0    0.1   0  99 12884.8   60.0 ba1f8d92 (4)
 21.9 11178.8   0.0  87.3   0.0   1.0    0.0    0.1   0  99 11180.0   64.1 ba1f8d92 (4)
 21.7 10300.8   0.0  80.5   0.0   1.0    0.0    0.1   0  99 10302.0   80.0 ba1f8d92 (4)
 21.7 11857.2   0.0  92.6   0.0   1.0    0.0    0.1   0  99 11858.6   69.4 ba1f8d92 (4)
 21.7 15072.9   0.0 117.8   0.0   1.0    0.0    0.1   0  99 15074.5   50.0 ba1f8d92 (4)
 21.7 27519.1   0.0 215.0   0.0   1.0    0.0    0.0   0  98 27522.7   19.6 ba1f8d92 (4)
 21.7 38525.8   0.0 301.0   0.0   1.0    0.0    0.0   0  97 38530.8   10.0 ba1f8d92 (4)
 22.1 34036.6   0.0 265.9   0.0   0.9    0.1    0.0   0  92 34040.8   11.0 ba1f8d92 (4)
 21.7 14490.0   0.0 113.2   0.0   1.0    0.0    0.1   0  96 14491.8   47.5 ba1f8d92 (4)
 21.7 7946.7   0.0  62.1   0.0   1.1    0.0    0.1   0 107 7947.3  103.4 ba1f8d92 (4)
 21.7 7296.7   0.0  57.0   0.0   1.0    0.0    0.1   0  99 7297.3  118.0 ba1f8d92 (4)
1812.0 6849.4  14.0  53.5   0.1   0.9    0.1    0.1   9  90 6962.1  117.6 ba1f8d92 (4)
6138.2 6116.5  47.8  47.8   0.2   0.8    0.0    0.1  20  78 6499.5  111.0 ba1f8d92 (4)
6328.9 6307.0  49.3  49.3   0.2   0.8    0.0    0.1  20  78 6701.8  106.8 ba1f8d92 (4)
6486.5 6465.0  50.6  50.5   0.2   0.8    0.0    0.1  22  76 6870.3   96.4 ba1f8d92 (4)
6951.8 6929.9  54.2  54.1   0.3   0.7    0.0    0.1  25  73 7363.6   89.4 ba1f8d92 (4)
5540.7 5519.0  43.2  43.1   0.3   0.7    0.0    0.1  26  72 5864.6  109.6 ba1f8d92 (4)
4750.6 4728.7  37.0  36.9   0.2   0.8    0.0    0.2  18  81 5025.0  138.3 ba1f8d92 (4)
4793.7 4771.9  37.3  37.3   0.2   0.8    0.0    0.2  18  81 5070.8  145.6 ba1f8d92 (4)
 ...

zpool iostat in the GZ:

[root@anto ~]# zpool iostat -v 5
...
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
zones                      1.31T  1.41T  1.29K  2.18K  10.2M  22.6M
  mirror                    482G   446G    320    621  2.50M  3.98M
    c0t50014EE25B7FC628d0      -      -     64    238  1.61M  3.99M
    c0t50014EE205CF84F8d0      -      -     69    258  1.75M  3.99M
  mirror                    527G   401G    471    639  3.68M  4.15M
    c0t50014EE205CF7101d0      -      -     84    156  2.63M  4.16M
    c0t50014EE206561484d0      -      -     93    175  2.79M  4.17M
  mirror                    335G   593G    526    850  4.03M  5.43M
    c0t50014EE2B0D583F5d0      -      -    123    254  3.79M  5.43M
    c0t50014EE25B7CCB38d0      -      -     82    248  2.09M  5.43M
logs                           -      -      -      -      -      -
  c2t1d0                    784K  37.2G      0     60      0  4.59M
  c2t2d0                    940K  37.2G      0     61      0  4.46M
-------------------------  -----  -----  -----  -----  -----  -----

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
zones                      1.31T  1.41T      1    521  3.30K  5.36M
  mirror                    482G   446G      0    118      0   770K
    c0t50014EE25B7FC628d0      -      -      0     83      0   774K
    c0t50014EE205CF84F8d0      -      -      0     85      0   774K
  mirror                    527G   401G      1    172  1.96K  1.04M
    c0t50014EE205CF7101d0      -      -      0     93  1.34K  1.04M
    c0t50014EE206561484d0      -      -      0     90    640  1.03M
  mirror                    335G   593G      0    205  1.34K  1.20M
    c0t50014EE2B0D583F5d0      -      -      0    134    274  1.20M
    c0t50014EE25B7CCB38d0      -      -      0    116  1.07K  1.20M
logs                           -      -      -      -      -      -
  c2t1d0                    864K  37.2G      0     12      0  1.14M
  c2t2d0                   1000K  37.2G      0     12      0  1.23M
-------------------------  -----  -----  -----  -----  -----  -----

iostat -x in the GZ:

[root@anto ~]# iostat -x 5
...
                  extended device statistics                  
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
lofi1       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ramdisk1    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd1         1.2  269.2    2.3 2119.4  0.0  0.2    0.7   0   5 
sd2         0.0  166.3    0.0 1244.8  0.0  0.1    0.7   0   3 
sd3         0.0  147.1    0.0 1484.8  0.0  0.2    1.4   0   5 
sd4         1.0  269.6    1.7 2106.6  0.0  0.2    0.8   0   6 
sd5         0.0  162.5    0.0 1244.8  0.0  0.2    1.0   0   4 
sd6         0.0  147.3    0.0 1484.8  0.0  0.2    1.3   0   4 
sd7         0.0    7.2    0.0   32.0  0.0  0.0    0.0   0   0 
sd8         0.0    7.2    0.0   38.4  0.0  0.0    0.1   0   0 
                  extended device statistics                  
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
lofi1       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ramdisk1    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd1         0.8  348.0    1.6 2622.4  0.0  0.2    0.7   0   6 
sd2         0.0  261.5    0.0 1988.4  0.0  0.2    0.8   0   6 
sd3         0.0  296.0    0.0 2348.1  0.0  0.2    0.7   0   6 
sd4         0.6  341.0    1.2 2635.2  0.0  0.2    0.7   0   7 
sd5         0.0  249.7    0.0 1988.4  0.0  0.2    0.8   0   5 
sd6         0.0  299.2    0.0 2348.1  0.0  0.2    0.7   0   6 
sd7         0.0    6.0    0.0   31.2  0.0  0.0    0.0   0   0 
sd8         0.0    6.8    0.0   24.0  0.0  0.0    0.0   0   0 
^C
[root@anto ~]# 

The latter two mostly to show that we're not exactly busy. Also I'm
guessing the bonnie++ data is fairly compressible, maybe even
mostly-zeroes.

Increasing the I/O priority to a very large number gives better results:

[root@anto ~]# vmadm update ba1f8d92-2ca2-4f01-9939-fa2de11a535e zfs_io_priority=100000000
Successfully updated ba1f8d92-2ca2-4f01-9939-fa2de11a535e
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zlogin.nym.se    8G           391652  56 112447  56           248690  20 +++++ +++
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
zlogin.nym.se,8G,,,391652,56,112447,56,,,248690,20,+++++,+++,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

Closer to the expected numbers, vfsstat still shows some throttling,
but less than previously.

  r/s   w/s  Mr/s  Mw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
 69.6  37.2   0.4   0.4   0.0   0.0    0.1    0.1   0   0  30.8   74.2 ba1f8d92 (4)
 23.5 10582.5   0.0  82.7   0.0   0.3    0.0    0.0   0  32 4067.5   25.8 ba1f8d92 (4)
 21.7 42496.7   0.0 332.0   0.0   1.0    0.0    0.0   0  97 14792.8   25.0 ba1f8d92 (4)
 21.7 53883.6   0.0 421.0   0.0   1.0    0.0    0.0   0  97 7828.0   10.0 ba1f8d92 (4)
 21.7 41576.2   0.0 324.8   0.0   1.0    0.0    0.0   0  97 17985.4   10.0 ba1f8d92 (4)
1335.3 61493.0  10.3 480.4   0.1   0.9    0.0    0.0   5  91   0.0    0.0 ba1f8d92 (4)
27504.2 27482.5 214.8 214.7   0.8   0.2    0.0    0.0  75  20   2.0   10.0 ba1f8d92 (4)
26188.1 26166.4 204.5 204.4   0.8   0.2    0.0    0.0  75  20 2004.2    8.9 ba1f8d92 (4)
11848.1 11826.4  92.4  92.4   0.6   0.4    0.0    0.0  55  43 12565.9   28.4 ba1f8d92 (4)
7640.5 7618.6  59.6  59.5   0.5   0.5    0.1    0.1  45  53 8094.7   61.7 ba1f8d92 (4)
6706.4 6684.5  52.3  52.2   0.4   0.6    0.1    0.1  38  60 7102.5   83.1 ba1f8d92 (4)
9075.5 9053.5  70.8  70.7   0.3   0.7    0.0    0.1  30  68 9619.6   68.2 ba1f8d92 (4)
10412.6 10390.7  81.2  81.2   0.4   0.6    0.0    0.1  37  61 11040.1   51.4 ba1f8d92 (4)
12986.8 12964.9 101.3 101.3   0.4   0.6    0.0    0.0  41  56 13775.0   36.4 ba1f8d92 (4)
13348.5 13327.0 104.2 104.1   0.6   0.4    0.0    0.0  56  41 14160.2   23.9 ba1f8d92 (4)
15183.8 15161.5 118.5 118.4   0.5   0.5    0.0    0.0  52  45 16109.0   22.3 ba1f8d92 (4)
14781.6 14760.1 115.4 115.3   0.5   0.5    0.0    0.0  50  47 15682.4   24.3 ba1f8d92 (4)
9786.3 9764.4  76.3  76.3   0.5   0.5    0.0    0.1  47  50 10374.5   44.2 ba1f8d92 (4)
12352.4 12330.7  96.4  96.3   0.4   0.6    0.0    0.0  42  55 13101.5   37.4 ba1f8d92 (4)
16446.5 16424.8 128.4 128.3   0.5   0.4    0.0    0.0  52  44 17451.2   19.8 ba1f8d92 (4)
21371.8 13521.9 166.8 105.6   0.6   0.4    0.0    0.0  61  35 14856.5   19.0 ba1f8d92 (4)
Jakob Borg

This all on joyent_20121228T011955Z btw.

While doing some further testing I ran into a weird thing. If I set the I/O priority of one zone to 100000000 as above, that zone gets better performance. But! another zone still at 100 also gets massively improved performance. I.e.

zone a prio = 100
zone b prio = 100
idle system, slowish I/O on both

zone a prio = 100000000
zone b prio = 100
idle system, much much faster I/O on both, individually...!?

Jakob Borg

On request, here's iostat -xn 1 (http://2c8571f5a45abf71.paste.se/) and vfsstat -ZzM 1 (http://738725d59aec0dc9.paste.se/) from the GZ while running bonnie in the zone ba1f, started at the same time.

Bill Pijewski

I'll look into this, but for now, I have some initial questions/observations based on what you're seeing.

The vfsstat output could be a little misleading since the columns do not use the same units: read_t and writ_t are in milliseconds, and del_t is in microseconds. The largest delay for any given I/O is 100 microseconds (200 microseconds is the poll is determined to be "busy", where "busy" corresponds roughly to 80% %b from iostat). Since (most) writes are asynchronous in nature, the throttle primarily tries to reduce/cap the number of IOPS rather than to inject severe (10ms+) latencies into any given operation. Your vfsstat report indicate that there was some throttling, but since the del_t was not pegged at 100 or 200, the amount your zone was being throttled was changing based on ambient I/O from other zones.

If you're comfortable using kmdb (i.e. this is not something you'd regularly want to do on a production system, but I have done it on my lab machine with no hiccups), you can try disabling the I/O throttle and rerunning some or all of your tests. You can do this with:

# mdb -kw
> zfs_zone_delay_enable/w 0
zfs_zone_delay_enable:          0x1     =       0x0

and turn it back on with:

# mdb -kw
> zfs_zone_delay_enable/w 1
zfs_zone_delay_enable:          0       =       0x1

Is bonnie++ a single-threaded workload? Either way, if you're able to disable the throttle, I would verify that you're able to max out the I/O throughput of the system with the throttle disabled. It'll be telling to compare the bonnie++ results with the throttle disabled and then reenabled.

Finally, I'd be careful about the ambient I/O from other zones, especially the KVM ones. I've gotten some inconsistent results from ambient I/O even on systems which had no load. You may want to stop the other zones on the system or at least keep a vfsstat -Z running in the background to make sure other zones aren't affecting things too much.

Jakob Borg

Hi, thanks! I'll do some experiments with it turned off. See also the links I posted in the comment above; to me (but I'm not an expert interpreting them) this looks like the throttle doing maximal throttling, with no I/O from the other zones going on. (That is, maybe more clear cut than the first runs that you commented on)

Jakob Borg

And yeah, single threaded. So probably a worst case scenario for the I/O throttle performance wise, being limited to 10,000 vfs iops (minus whatever time they actually take) times whatever size I/O it does. This might just be a pathological case, I'm just surprised to see the throttle kick in when I wouldn't think it necessary.

Keith M Wesolowski
Owner

This has almost certainly been addressed under OS-2576 (not public, unfortunately) in change 034675d. Please reopen if this is still happening with that in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.