Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terribly slow virtio write speeds in Windows guest and recommendation of latest Virtio drivers #320

Closed
bassu opened this issue Apr 22, 2014 · 25 comments

Comments

@bassu
Copy link

bassu commented Apr 22, 2014

SmartOS with default zfs params (either dual mirror or raidz1 with or without SLOG):

  1. Virtio drivers signed by Joyent give 2 MB/s sequential writes.
  2. Virtio drivers from Red Hat give 23 MB/s sequential write.

Writes outside the guest or in Linux guest average on 150 MB/s.

I have compared this to Joyent Cloud, where I see they are using old
Joyent signed Virtio drivers, but the sequential writes in Windows
instances are over there are 160 MB/s.

This is easily reproducible with latest SmartOS and independent on what hardware is used.

zfs_zone_delay_enable was disabled during all tests.

zpool iostat -v

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
zones                      2.84T  4.41T      0  1.55K      0  29.2M
  raidz1                   2.84T  4.41T      0  1.55K      0  29.2M
    c0t50014EE2089F83E2d0      -      -      0  1.55K      0  9.86M
    c0t50014EE208A21B1Ad0      -      -      0  1.55K      0  9.86M
    c0t50014EE25DF5259Fd0      -      -      0  1.55K      0  9.86M
    c0t50014EE2B34A1BF1d0      -      -      0  1.55K      0  9.86M
-------------------------  -----  -----  -----  -----  -----  -----

screenshot

slow-virtio

@kylegato
Copy link

My Setup:

WD Velociraptor Drives 10K RPM
64 GB DDR3 ECC
Dual Xeon 5520s

[root@smartos ~]# zpool status
  pool: zones
 state: ONLINE
  scan: scrub repaired 0 in 0h55m with 0 errors on Mon Apr 21 19:14:58 2014
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0

errors: No known data errors
[root@smartos ~]# 

screen shot 2014-04-22 at 12 23 59 pm

@ccrusius
Copy link

ccrusius commented May 1, 2014

Same here. I have a 48-core, 2.4GHz machine with a zpool 7.2k RPM mirror. On Windows guests, both 7 enterprise and 2008 standard, my disk throughput is abysmally low. Network throughput does not fare much better, capping at about 150Mbps. This is using Red Hat virtio drivers.

image

@rmustacc
Copy link
Contributor

rmustacc commented May 4, 2014

There have been lots of conversation about this in IRC, unfortunately I don't know if the folks who were asking about this in IRC correspond to folks who have commented on this bug or not. For whatever reason, we don't see the same data. As I've mentioned in IRC, folks who are seeing this are going to have to drive the investigation and help us understand where the latency is coming from and help us get to the bottom of this.

Specifically, you should start by understanding what the ZFS layer latency is to QEMU. In other words, what is the I/O latency that QEMU sees from ZFS for a given request. That will immediately help us rule out one area of the problem or not. The best way to gather this data is with DTrace.

Once that data is firmly in hand for everyone's unique case, then we can go from there. Keep in mind that differences in hardware capabilities, etc. may change things greatly here.

What we're trying to do is break this down into one of several areas to focus our attention on. There could be a problem with how the host is issuing the I/O that QEMU is requesting and how it is syncing it out. Note that the way the I/O patterns look will be different from Windows to other KVM guests to a zone, so it's important to focus on the latency specifically that the Windows guest is seeing. In addition to the FS I/O, there is the question of how much time is the QEMU process in the host waiting to service the I/O. eg. what is the length of time from which QEMU knows about the I/O, to the time that it begins to service it. Also, what is the length of time from the I/O being completed to QEMU going through and notifying the guest. These lengths of time will help us understand what is going on in the process, where we can still easily observe what's going on.

After that, the next thing to understand is how Windows is issuing I/O requests and what it is seeing. But before we go and dig into that, we should take a stab at what is much easier to observe.

@ccrusius
Copy link

ccrusius commented May 5, 2014

If one could post a dtrace script to run, with instructions, it would be helpful. If I could I'd provide access to the machine to SmartOS developers, but as it stands it sits behind a corporate firewall.

@IanCollins
Copy link

I'm pretty sure this is the old lack of log problem that has been discussed on the mail list several times. Just to double check (all my previous testing was with Linux guests) I tried the crystal disk mark test in a win 2008 VM. The pool on my test box is a stripe of 4 mirrors of 2.5" 7200 rpm laptop drives, so it's certainly not flash! Anyway, without the logs the write performance was a blistering 12MB/sec.. With two 100GB Intel 3700s added as logs, this increased to 140MB/s, so more than an order of magnitude improvement.

It is worth running "iostat -xtcMn" on the host while running these tests to see how busy the pool disks are. Without a log, I typically see (for one mirror):

r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
0.0  167.5    0.0    8.4  0.0  0.4    0.0    2.5   0  16 c0t5000C500616B8891d0
0.0  166.9    0.0    8.4  0.0  0.4    0.0    2.5   0  16 c0t5000C500616B1721d0
0.0  787.1    0.0   50.3 394.1  2.7  500.7    3.4  23  32 zones

When I add the logs:

The two log devices:
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 1609.7 0.0 69.7 0.1 0.2 0.1 0.1 7 18 c2t0d0
0.0 1611.7 0.0 69.8 0.1 0.2 0.0 0.1 6 18 c2t1d0

One of the mirrors plus the pool totals:
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
3.2 534.0 0.0 34.0 0.0 1.1 0.0 2.1 0 41 c0t5000C500616B8891d0
6.3 556.7 0.0 34.0 0.0 1.2 0.0 2.1 0 40 c0t5000C500616B1721d0
13.1 6392.7 0.1 410.6 141.8 9.3 22.1 1.5 61 84 zones

@IanCollins
Copy link

here is a screen shot of the crystal disk mark data:

crystaldiskmark

The read numbers are pretty meaningless, data would have been read from RAM due to the small file size. The 4K QD32 random write really shows the gains form log devices:

r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
0.0 2585.8    0.0   20.0  0.0  0.1    0.0    0.0   1   8 c2t0d0
0.0 2585.8    0.0   20.0  0.0  0.1    0.0    0.0   1   8 c2t1d0

@bassu
Copy link
Author

bassu commented Oct 5, 2014

@IanCollins According to original Open ZFS official documentation, maximum size of separate ZIL device should be half of the available RAM.

So unless you have 200 GB of RAM on this machine, you cannot throw in a 100GB log device for sanity. I added two 20 GB Samsung 840 Pros and couldn't get past 60 MB/s at sequential writes, which I guess has to do with running more than a few VMs on the same machine with write intensive usage as the speed divides (may be not the case in your tests).

Which also, unfortunately is subpar to what Linux VMs give you on the same host.

@IanCollins
Copy link

The ZIL device can be as big as you want. Only a small part will actually be used, (cf. the comment in the documentation "because that is the maximum amount of potential in-play data that can be stored") but the overall size doesn't really matter. You'll probably get longer life out of a bigger device if its wear levelling works well. I use 200G 3700s on production systems simply because they have double the IOPs of the 100G part.

Samsung 840 Pros aren't really a write optimised SSD and I don't think they have power fail protection. I've found them for cache devices, but not logs. For log devices you need to look for high, sustained random write IOPs, durability and power fail protection. 3700s push all the right buttons. If they aren't good enough, consider a RAM based ZIL, such as a ZeusRAM.

@bassu
Copy link
Author

bassu commented Oct 5, 2014

Hmm.
ZeusRAMs -- yes, I shall get those when I'll live in a beach house somewhere on Long Island. LOL. 😀

The problem is, no one these days wants to throw in more money on hardware to solve a software problem, which, doesn't exist in Linux VMs on the same host doing the same synchronous IO.

The problem, which, also needless to say, does NOT exist in other KVM implementations!

@davefinster
Copy link

I see a similar increase. Without a log device, my pool which consists of 2 x RAIDZ2 vdevs with 8 x 15k SAS disks each can only do 34.72 MB/sec. Adding in two similar SSDs (I have the 100G instead of 200G, so not as good) as log devices that @IanCollins is using, that figure jumps to 270MB/sec.

@IanCollins I am however curious as to how your pushing those sorts of read rates. Although they aren't representative of disk, the highest my Crystal has ever shown is 1050 MB/sec (even when I had a cache drive)

@IanCollins
Copy link

@bassu The "problem" isn't with SmartOS KVM, its the way ZFS handles synchronous writes.

You would probably observe the same numbers with KVM on Linux if the host used ZFS on Linux. SmartOS KVM could probably be setup to lie and not wait for writes to complete, or add a RAM cache for the virtual drive. You could get the same effect by disabling the ZIL ( DON'T!: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29). Both would boost performance at the expense of data integrity. Or you could add an appropriate SSD..

@IanCollins
Copy link

@davefinster See my note about reading from RAM!

@bassu
Copy link
Author

bassu commented Oct 5, 2014

@IanCollins I didn’t see this problem on ZFS on Linux (although there were many other problems).
Yes, I had to resort to running some of the noncritical VMs in “Linux style” IO by disabling sync.

On 06-Oct-2014, at 4:02 am, IanCollins notifications@github.com wrote:

@bassu The "problem" isn't with SmartOS KVM, its the way ZFS handles synchronous writes.

You would probably observe the same numbers with KVM on Linux if the host used ZFS on Linux. SmartOS KVM could probably be setup to lie and not wait for writes to complete, or add a RAM cache for the virtual drive. You could get the same effect by disabling the ZIL ( DON'T!: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29). Both would boost performance at the expense of data integrity. Or you could add an appropriate SSD..


Reply to this email directly or view it on GitHub.

@davefinster
Copy link

@IanCollins I did! Thats why I knew it wasn't representative of the disks, but rather ARC. What I'm wondering what factors would influence the read performance from ARC?

@bassu
Copy link
Author

bassu commented Oct 5, 2014

@davefinster Use a good SATA3 HBA. That’s most of it.
@IanCollins ran the benchmark on an idle machine (see the %b in his iostat output)

On 06-Oct-2014, at 4:30 am, Dave Finster notifications@github.com wrote:

@IanCollins I did! Thats why I knew it wasn't representative of the disks, but rather ARC. What I'm wondering what factors would influence the read performance from ARC?


Reply to this email directly or view it on GitHub.

@davefinster
Copy link

@bassu Using a LSI SAS 9207 HBA - but Ian was referring to ARC in RAM, not L2ARC. I also ran my tests on an idle machine.

@ghost
Copy link

ghost commented Oct 6, 2014

@bassu "The problem is, no one these days wants to throw in more money on hardware to solve a software problem, which, doesn't exist in Linux VMs on the same host doing the same synchronous IO."

@IanCollins "@bassu The "problem" isn't with SmartOS KVM, its the way ZFS handles synchronous writes."

In truth, it's neither. This is not a software problem, it's a design choice. We have chosen to have KVM's writes to zvols treated synchronously. Any implementation making this same design choice will have similar performance characteristics for a given identical backing store. ZFS is not really relevant here; it's just doing what it's told.

The tradeoff here is correctness/durability vs perceived performance. It's no different from having a database call fsync() on transaction commit boundaries: if you do it, your clients see fewer transactions per second; if you don't, there are several failure classes within the system that will allow committed transaction state to be lost. Since the backing store for a guest may be used to store transactional state (we have no way to know what you want to use the guest for), the safe choice here is to do I/O synchronously.

There is a third way, but it requires that the guest observe certain semantics with respect to its virtual block devices, specifically the use of SYNCHRONIZE CACHE or an analogous protocol-appropriate command. This is what ZFS does, and it is what makes the use of (hardware) disk write caches safe (provided the disk provides the documented standard semantics). If you are absolutely sure that your guest's filesystem(s), block layer, SCSI/ATA layer (if applicable), and drivers all flush the write cache on transaction boundaries, then it would be safe to change KVM to provide a device that emulates a disk with the write cache enabled. Of course, you would also have to make sure that everything in KVM required for this to work is in place and working correctly, as well as virtio assuming you're using that instead of ATA or SCSI. There is no reason this cannot work; however, the combination of numerous known-bad guest OSs and a lack of testing and verification of the KVM and virtio code in this regard makes enabling this behaviour expensive and risky. You're welcome to do so if you know your guest is bug-free in this area, or to simply disable synchronous writes entirely on the relevant zvol if your guest is entirely stateless (or disposable). But in general, the balance of engineering judgment favours the approach we've taken. Instead of asking why we're slow, you could as well ask why your GNU/Linux distributor's KVM I/O configuration is unsafe. They're two sides of the same coin -- and unfortunately, the common buggy guests we're accommodating with our design choice here are none other than GNU/Linux.

All that said, I still haven't seen compelling evidence that the problem being reported here has anything to do with this.

@IanCollins
Copy link

@wesolows Your answer is obviously correct and complete Keith, but the design choice (which I completely agree with) does highlight the issue with ZFS and synchronous writes. I'm sure other filesystems give better numbers by sacrificing data integrity, but the perception will remain that SmartOS KVM is "slow". Maybe education (through the wiki?) is the best solution to this recurring "issue"? I also think the installer to offer more help. Offering a choice between capacity (raidz2) and performance (mirrors) pool configuration rather than defaulting to raidz2, which is seldom the best choice.

Just to lay the SmartOS KVM is "slow" stuff to rest, here are the numbers for the same system with ZIL disabled:

cdm2

Not bad for a puny little pool on pimped up consumer PC!

@bassu
Copy link
Author

bassu commented Oct 6, 2014

Just to say as a note for others to see on what @IanCollins did:
Don't just disable sync on servers running critical apps or anything requiring transactions. Otherwise you may lose data on power/system failures.

@bassu
Copy link
Author

bassu commented Oct 6, 2014

Manpages Reference:

Since it is a _performance vs safety_ choice, here are the direct implications of disabling synchronous IO from ZFS reference:

sync=disabled
  Synchronous requests are disabled.  File system transactions
  only commit to stable storage on the next DMU transaction group
  commit which can be many seconds.  This option gives the
  highest performance. 

  However, it is very dangerous as ZFS
  is ignoring the synchronous transaction demands of
  applications such as databases or NFS.

  Setting sync=disabled on the currently active root or /var
  file system may result in out-of-spec behavior, application data
  loss and increased vulnerability to replay attacks.

  This option does *NOT* affect ZFS on-disk consistency.
  Administrators should only use this when these risks are understood.

@bassu bassu closed this as completed Jan 17, 2016
@matthiasg
Copy link

What are the drivers currently accepted as being the fastest ones for KVM windows. I tried the stable and latest from fedora and they don't really impress. (This is with all SSD pool with and without ZIL)

@sjorge
Copy link
Contributor

sjorge commented Aug 15, 2019 via email

@matthiasg
Copy link

matthiasg commented Aug 15, 2019 via email

@matthiasg
Copy link

@sjorge well it seem bhyve still has some issues, at least in the documentation department. I simply get a black screen when connecting with vnc (no password set), bhyve does not yet support providing a dhcp server so no assigned ips and booting from cd image is also not documented as far as I can see.

Does anybody have any experience booting a windows kvm virtio boot disk in bhyve ?

@sjorge
Copy link
Contributor

sjorge commented Aug 15, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants