ZFS I/O throttle unnecessarily throttles I/O on a lightly loaded system #160

Closed
calmh opened this Issue Jan 6, 2013 · 13 comments

7 participants

@calmh

It seems that I/O is being unnecessarily throttled when the system is
otherwise mostly idle, I/O wise.

Intuitively I would expect a zone to be able to burst to pretty much all
of the available I/O capacity, much like CPU utilization is (or can be,
at least) handled.

Going by http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ it
seems that the actual expected behavior is that a zone gets throttled
when it's I/O utilization significantly exceed the average.

In my use case, a mostly idle system (i.e. a few things running, but none
of them generating very much I/O), the average will be close to zero.
It would be nice if this didn't result in harsh throttling when a zone
actually wants to perform some I/O.

This is my setup:

root@anto ~]# vmadm list -o uuid,type,ram,state,zfs_io_priority
UUID                                  TYPE  RAM      STATE             IO_PRIORITY
81035ab6-9827-448d-9208-87eeb7d35891  KVM   256      running           100
0d6e2251-aa11-452b-afb7-e43c8e7bfe1c  KVM   512      running           100
183b9ce8-38c6-460b-a5d1-9bb139be0f34  OS    512      running           100
1bbb6dd3-6abd-47b9-8485-a7afe0b727d5  OS    512      running           100
8061c37a-5ec3-47e8-80d7-da9fa204fde1  OS    512      running           100
8491ba2d-1ffc-4219-879d-d8f7384780d9  KVM   512      stopped           100
b2535e73-0892-4183-9e02-0255c6dde661  OS    512      running           100
ba1f8d92-2ca2-4f01-9939-fa2de11a535e  OS    512      running           100
e39c6382-3c22-4896-95b7-3e6515a6b26e  OS    512      running           100
556004f2-36ec-4483-8e1e-5af188f0db7c  KVM   768      running           100
f993583b-2a17-4751-948d-b454f6ba720f  KVM   768      running           100
7dc0f886-5faa-4534-a68e-8277e167464e  KVM   2048     running           100
bc3d93e3-ee95-471a-8341-02881794fcfa  OS    32768    stopped           100

[root@anto ~]# zpool status
  pool: zones
 state: ONLINE
  scan: scrub repaired 0 in 8h49m with 0 errors on Sat Jan  5 01:25:38 2013
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t50014EE25B7FC628d0  ONLINE       0     0     0
            c0t50014EE205CF84F8d0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c0t50014EE205CF7101d0  ONLINE       0     0     0
            c0t50014EE206561484d0  ONLINE       0     0     0
          mirror-3                 ONLINE       0     0     0
            c0t50014EE2B0D583F5d0  ONLINE       0     0     0
            c0t50014EE25B7CCB38d0  ONLINE       0     0     0
        logs
          c2t1d0                   ONLINE       0     0     0
          c2t2d0                   ONLINE       0     0     0

errors: No known data errors

Doing a bonnie++ test in a zone (ba1f8d) gives the results below. (The
system has 16G RAM, I'm purposely setting it down in bonnie++ to make
the test take less time. I'm only interested in perceived I/O
performance from the zone, not to measure what the disks can sustain
worst case.)

jb@zlogin:~/tmp % bonnie++ -f -r 4096
Writing intelligently...done
...
Delete files in random order...done.
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zlogin.nym.se    8G           135610  84 41693  81           238721  24 +++++ +++
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26387  95 +++++ +++ +++++ +++ 23028  99 +++++ +++ 30239  91
zlogin.nym.se,8G,,,135610,84,41693,81,,,238721,24,+++++,+++,16,26387,95,+++++,+++,+++++,+++,23028,99,+++++,+++,30239,91
jb@zlogin:~/tmp % 

For my purposes at the moment, the 135.6 and 41.7 MB/s figures are
the interesting ones.

vfsstat inside the zone looked like this during the test:

jb@zlogin:~ % vfsstat -M 5
  r/s   w/s  Mr/s  Mw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
 66.3  32.4   0.4   0.4   0.0   0.0    0.1    0.1   0   0  27.4   73.1 ba1f8d92 (4)
 24.5 12640.2   0.0  98.7   0.0   0.6    0.4    0.0   1  63 12642.0   36.6 ba1f8d92 (4)
 21.7 12883.0   0.0 100.6   0.0   1.0    0.0    0.1   0  99 12884.8   60.0 ba1f8d92 (4)
 21.9 11178.8   0.0  87.3   0.0   1.0    0.0    0.1   0  99 11180.0   64.1 ba1f8d92 (4)
 21.7 10300.8   0.0  80.5   0.0   1.0    0.0    0.1   0  99 10302.0   80.0 ba1f8d92 (4)
 21.7 11857.2   0.0  92.6   0.0   1.0    0.0    0.1   0  99 11858.6   69.4 ba1f8d92 (4)
 21.7 15072.9   0.0 117.8   0.0   1.0    0.0    0.1   0  99 15074.5   50.0 ba1f8d92 (4)
 21.7 27519.1   0.0 215.0   0.0   1.0    0.0    0.0   0  98 27522.7   19.6 ba1f8d92 (4)
 21.7 38525.8   0.0 301.0   0.0   1.0    0.0    0.0   0  97 38530.8   10.0 ba1f8d92 (4)
 22.1 34036.6   0.0 265.9   0.0   0.9    0.1    0.0   0  92 34040.8   11.0 ba1f8d92 (4)
 21.7 14490.0   0.0 113.2   0.0   1.0    0.0    0.1   0  96 14491.8   47.5 ba1f8d92 (4)
 21.7 7946.7   0.0  62.1   0.0   1.1    0.0    0.1   0 107 7947.3  103.4 ba1f8d92 (4)
 21.7 7296.7   0.0  57.0   0.0   1.0    0.0    0.1   0  99 7297.3  118.0 ba1f8d92 (4)
1812.0 6849.4  14.0  53.5   0.1   0.9    0.1    0.1   9  90 6962.1  117.6 ba1f8d92 (4)
6138.2 6116.5  47.8  47.8   0.2   0.8    0.0    0.1  20  78 6499.5  111.0 ba1f8d92 (4)
6328.9 6307.0  49.3  49.3   0.2   0.8    0.0    0.1  20  78 6701.8  106.8 ba1f8d92 (4)
6486.5 6465.0  50.6  50.5   0.2   0.8    0.0    0.1  22  76 6870.3   96.4 ba1f8d92 (4)
6951.8 6929.9  54.2  54.1   0.3   0.7    0.0    0.1  25  73 7363.6   89.4 ba1f8d92 (4)
5540.7 5519.0  43.2  43.1   0.3   0.7    0.0    0.1  26  72 5864.6  109.6 ba1f8d92 (4)
4750.6 4728.7  37.0  36.9   0.2   0.8    0.0    0.2  18  81 5025.0  138.3 ba1f8d92 (4)
4793.7 4771.9  37.3  37.3   0.2   0.8    0.0    0.2  18  81 5070.8  145.6 ba1f8d92 (4)
 ...

zpool iostat in the GZ:

[root@anto ~]# zpool iostat -v 5
...
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
zones                      1.31T  1.41T  1.29K  2.18K  10.2M  22.6M
  mirror                    482G   446G    320    621  2.50M  3.98M
    c0t50014EE25B7FC628d0      -      -     64    238  1.61M  3.99M
    c0t50014EE205CF84F8d0      -      -     69    258  1.75M  3.99M
  mirror                    527G   401G    471    639  3.68M  4.15M
    c0t50014EE205CF7101d0      -      -     84    156  2.63M  4.16M
    c0t50014EE206561484d0      -      -     93    175  2.79M  4.17M
  mirror                    335G   593G    526    850  4.03M  5.43M
    c0t50014EE2B0D583F5d0      -      -    123    254  3.79M  5.43M
    c0t50014EE25B7CCB38d0      -      -     82    248  2.09M  5.43M
logs                           -      -      -      -      -      -
  c2t1d0                    784K  37.2G      0     60      0  4.59M
  c2t2d0                    940K  37.2G      0     61      0  4.46M
-------------------------  -----  -----  -----  -----  -----  -----

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
zones                      1.31T  1.41T      1    521  3.30K  5.36M
  mirror                    482G   446G      0    118      0   770K
    c0t50014EE25B7FC628d0      -      -      0     83      0   774K
    c0t50014EE205CF84F8d0      -      -      0     85      0   774K
  mirror                    527G   401G      1    172  1.96K  1.04M
    c0t50014EE205CF7101d0      -      -      0     93  1.34K  1.04M
    c0t50014EE206561484d0      -      -      0     90    640  1.03M
  mirror                    335G   593G      0    205  1.34K  1.20M
    c0t50014EE2B0D583F5d0      -      -      0    134    274  1.20M
    c0t50014EE25B7CCB38d0      -      -      0    116  1.07K  1.20M
logs                           -      -      -      -      -      -
  c2t1d0                    864K  37.2G      0     12      0  1.14M
  c2t2d0                   1000K  37.2G      0     12      0  1.23M
-------------------------  -----  -----  -----  -----  -----  -----

iostat -x in the GZ:

[root@anto ~]# iostat -x 5
...
                  extended device statistics                  
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
lofi1       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ramdisk1    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd1         1.2  269.2    2.3 2119.4  0.0  0.2    0.7   0   5 
sd2         0.0  166.3    0.0 1244.8  0.0  0.1    0.7   0   3 
sd3         0.0  147.1    0.0 1484.8  0.0  0.2    1.4   0   5 
sd4         1.0  269.6    1.7 2106.6  0.0  0.2    0.8   0   6 
sd5         0.0  162.5    0.0 1244.8  0.0  0.2    1.0   0   4 
sd6         0.0  147.3    0.0 1484.8  0.0  0.2    1.3   0   4 
sd7         0.0    7.2    0.0   32.0  0.0  0.0    0.0   0   0 
sd8         0.0    7.2    0.0   38.4  0.0  0.0    0.1   0   0 
                  extended device statistics                  
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
lofi1       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ramdisk1    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd1         0.8  348.0    1.6 2622.4  0.0  0.2    0.7   0   6 
sd2         0.0  261.5    0.0 1988.4  0.0  0.2    0.8   0   6 
sd3         0.0  296.0    0.0 2348.1  0.0  0.2    0.7   0   6 
sd4         0.6  341.0    1.2 2635.2  0.0  0.2    0.7   0   7 
sd5         0.0  249.7    0.0 1988.4  0.0  0.2    0.8   0   5 
sd6         0.0  299.2    0.0 2348.1  0.0  0.2    0.7   0   6 
sd7         0.0    6.0    0.0   31.2  0.0  0.0    0.0   0   0 
sd8         0.0    6.8    0.0   24.0  0.0  0.0    0.0   0   0 
^C
[root@anto ~]# 

The latter two mostly to show that we're not exactly busy. Also I'm
guessing the bonnie++ data is fairly compressible, maybe even
mostly-zeroes.

Increasing the I/O priority to a very large number gives better results:

[root@anto ~]# vmadm update ba1f8d92-2ca2-4f01-9939-fa2de11a535e zfs_io_priority=100000000
Successfully updated ba1f8d92-2ca2-4f01-9939-fa2de11a535e
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zlogin.nym.se    8G           391652  56 112447  56           248690  20 +++++ +++
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
zlogin.nym.se,8G,,,391652,56,112447,56,,,248690,20,+++++,+++,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

Closer to the expected numbers, vfsstat still shows some throttling,
but less than previously.

  r/s   w/s  Mr/s  Mw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
 69.6  37.2   0.4   0.4   0.0   0.0    0.1    0.1   0   0  30.8   74.2 ba1f8d92 (4)
 23.5 10582.5   0.0  82.7   0.0   0.3    0.0    0.0   0  32 4067.5   25.8 ba1f8d92 (4)
 21.7 42496.7   0.0 332.0   0.0   1.0    0.0    0.0   0  97 14792.8   25.0 ba1f8d92 (4)
 21.7 53883.6   0.0 421.0   0.0   1.0    0.0    0.0   0  97 7828.0   10.0 ba1f8d92 (4)
 21.7 41576.2   0.0 324.8   0.0   1.0    0.0    0.0   0  97 17985.4   10.0 ba1f8d92 (4)
1335.3 61493.0  10.3 480.4   0.1   0.9    0.0    0.0   5  91   0.0    0.0 ba1f8d92 (4)
27504.2 27482.5 214.8 214.7   0.8   0.2    0.0    0.0  75  20   2.0   10.0 ba1f8d92 (4)
26188.1 26166.4 204.5 204.4   0.8   0.2    0.0    0.0  75  20 2004.2    8.9 ba1f8d92 (4)
11848.1 11826.4  92.4  92.4   0.6   0.4    0.0    0.0  55  43 12565.9   28.4 ba1f8d92 (4)
7640.5 7618.6  59.6  59.5   0.5   0.5    0.1    0.1  45  53 8094.7   61.7 ba1f8d92 (4)
6706.4 6684.5  52.3  52.2   0.4   0.6    0.1    0.1  38  60 7102.5   83.1 ba1f8d92 (4)
9075.5 9053.5  70.8  70.7   0.3   0.7    0.0    0.1  30  68 9619.6   68.2 ba1f8d92 (4)
10412.6 10390.7  81.2  81.2   0.4   0.6    0.0    0.1  37  61 11040.1   51.4 ba1f8d92 (4)
12986.8 12964.9 101.3 101.3   0.4   0.6    0.0    0.0  41  56 13775.0   36.4 ba1f8d92 (4)
13348.5 13327.0 104.2 104.1   0.6   0.4    0.0    0.0  56  41 14160.2   23.9 ba1f8d92 (4)
15183.8 15161.5 118.5 118.4   0.5   0.5    0.0    0.0  52  45 16109.0   22.3 ba1f8d92 (4)
14781.6 14760.1 115.4 115.3   0.5   0.5    0.0    0.0  50  47 15682.4   24.3 ba1f8d92 (4)
9786.3 9764.4  76.3  76.3   0.5   0.5    0.0    0.1  47  50 10374.5   44.2 ba1f8d92 (4)
12352.4 12330.7  96.4  96.3   0.4   0.6    0.0    0.0  42  55 13101.5   37.4 ba1f8d92 (4)
16446.5 16424.8 128.4 128.3   0.5   0.4    0.0    0.0  52  44 17451.2   19.8 ba1f8d92 (4)
21371.8 13521.9 166.8 105.6   0.6   0.4    0.0    0.0  61  35 14856.5   19.0 ba1f8d92 (4)
@calmh

This all on joyent_20121228T011955Z btw.

While doing some further testing I ran into a weird thing. If I set the I/O priority of one zone to 100000000 as above, that zone gets better performance. But! another zone still at 100 also gets massively improved performance. I.e.

zone a prio = 100
zone b prio = 100
idle system, slowish I/O on both

zone a prio = 100000000
zone b prio = 100
idle system, much much faster I/O on both, individually...!?

@calmh

On request, here's iostat -xn 1 (http://2c8571f5a45abf71.paste.se/) and vfsstat -ZzM 1 (http://738725d59aec0dc9.paste.se/) from the GZ while running bonnie in the zone ba1f, started at the same time.

@pijewski

I'll look into this, but for now, I have some initial questions/observations based on what you're seeing.

The vfsstat output could be a little misleading since the columns do not use the same units: read_t and writ_t are in milliseconds, and del_t is in microseconds. The largest delay for any given I/O is 100 microseconds (200 microseconds is the poll is determined to be "busy", where "busy" corresponds roughly to 80% %b from iostat). Since (most) writes are asynchronous in nature, the throttle primarily tries to reduce/cap the number of IOPS rather than to inject severe (10ms+) latencies into any given operation. Your vfsstat report indicate that there was some throttling, but since the del_t was not pegged at 100 or 200, the amount your zone was being throttled was changing based on ambient I/O from other zones.

If you're comfortable using kmdb (i.e. this is not something you'd regularly want to do on a production system, but I have done it on my lab machine with no hiccups), you can try disabling the I/O throttle and rerunning some or all of your tests. You can do this with:

# mdb -kw
> zfs_zone_delay_enable/w 0
zfs_zone_delay_enable:          0x1     =       0x0

and turn it back on with:

# mdb -kw
> zfs_zone_delay_enable/w 1
zfs_zone_delay_enable:          0       =       0x1

Is bonnie++ a single-threaded workload? Either way, if you're able to disable the throttle, I would verify that you're able to max out the I/O throughput of the system with the throttle disabled. It'll be telling to compare the bonnie++ results with the throttle disabled and then reenabled.

Finally, I'd be careful about the ambient I/O from other zones, especially the KVM ones. I've gotten some inconsistent results from ambient I/O even on systems which had no load. You may want to stop the other zones on the system or at least keep a vfsstat -Z running in the background to make sure other zones aren't affecting things too much.

@calmh

Hi, thanks! I'll do some experiments with it turned off. See also the links I posted in the comment above; to me (but I'm not an expert interpreting them) this looks like the throttle doing maximal throttling, with no I/O from the other zones going on. (That is, maybe more clear cut than the first runs that you commented on)

@calmh

And yeah, single threaded. So probably a worst case scenario for the I/O throttle performance wise, being limited to 10,000 vfs iops (minus whatever time they actually take) times whatever size I/O it does. This might just be a pathological case, I'm just surprised to see the throttle kick in when I wouldn't think it necessary.

@wesolows

This has almost certainly been addressed under OS-2576 (not public, unfortunately) in change 034675d740e420a003b4fabded65e37aeacf0c7f. Please reopen if this is still happening with that in place.

@wesolows wesolows closed this Jan 15, 2014
@razamatan

@wesolows: does this mean that the public version of smartos doesn't have the fix?

@rmustacc
Joyent member

No, just that the bug's information is not available. The commit is in the source tree for illumos-joyent, just see the git sha that @wesolows put there.

@matthiasg

@rmustacc @wesolows @razamatan @calmh can anyone tell me which public smartos build has the aforementioned fix included ?

@MerlinDMC

@matthiasg joyent/illumos-joyent@034675d ... if you look at the tag it should be in releases from 20131128 and younger

@matthiasg
@MerlinDMC

I had to edit my comment as I still had the latest tag in the clipboard. The fix should be in for good a year now - 20131128

@matthiasg

@pijewski i am experiencing a pegged del_t at 200 for writing and 100 when reading for the duration of a bonnie run. This is in a zone on a totally idle machine otherwise. The machine has about 28 other zones all of which are idle at present.

This is with "Live Image": "20150219T102159Z". I had thought there would not be any or almost no io throttling when no other zone was busy in this build/

I looked up this thread again since i am experiencing delays on a server in production whenever one machine needs to rebuild its databases from disk where the zfs cache is not pre-filled it seems.

What is the best way to proceed .. is it possible to totally, persistently, deactive the throttle for some time ?
What exactly is the algorithm for the throttle and how is it affected by idle other zones (idle as in not doing any io whatsoever) ?

EDIT: I disabled throttle with echo zfs_zone_delay_enable/W0 | mdb -kw and i get del_t of 0 as expected and the performance increased quite a bit:

Before:

[root@machine ~]# bonnie -s 2000
File './Bonnie.67711', size: 2097152000
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         2000 23902 17.8 46134 61.7 43469 62.9 149166 97.4 3528339 99.4 11883.6 20.8

After:

[root@machine ~]# bonnie -s 2000
File './Bonnie.68270', size: 2097152000
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         2000 42461 25.8 102161  9.2 42056  6.8 172552 99.7 3674030 99.7 209369.3 292.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment