Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance skew on VMs (Hyper-V and VMware) based on virtual RAM #139

Open
Karl-WE opened this issue Jul 2, 2020 · 1 comment
Open

Comments

@Karl-WE
Copy link

Karl-WE commented Jul 2, 2020

there are reports that diskspd in the latest version produce non reproducible performance "issues" depending on the RAM size of the VM. Seems like with few RAM the tool is lesser agressive utilizing it.
Please don't get irritated by all the other things in the thread. They have no relatrion to dskspd.exe

Cite:

es, those panels look similar - scary.

However, the caching is ON by default. I guess I could try turning the caching OFF for a volume and seeing if the results worsened.

But all of this does not explain why I can get nearly 50% more/better throughput and IOPS with heavy write workload by simply going from 64GB of vRAM to 128-192GB of vRAM on the same W2019 machine, running in the exact same infrastructure. I can look at the memory use and tell it's using the cache but when I goto 128GB, it uses a lot more obviously and more aggressively. I mean to nothing else is changing at all! I can get the same results using W2012/W2016 using only 64GB which is much more reasonable.

This is in the earlier postings but emphasizing the memory use here....

Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401220/5efcdfa9/attached_image/itperf-ixxxx-W006-NC0W2019-MemoryUseCapture-20200701.jpg

https://community.spiceworks.com/topic/post/8907183

Thanks for looking into it

Detailed information:
source: https://community.spiceworks.com/topic/post/8905088 and following

I apologize for the length of this posting and the time it has taken to write this up.

I will have to post in 2 parts.

While this thread is mostly about remote I/O, we are seeing very poor results with "local I/O" Here's what we reproduced and know so far.

BLUF – Surprise! Surprise! W2019’s I/O performance is memory dependent!
Problem: On a Windows 2012R2VM, our team achieved ~ 5400 MiB/sec with 5 CPUs writing large 64k block I/O locally on a 8 CPU VM. That is ~ 1100 MiB/sec/CPU. On a Windows 2019DC VM, using the same test workloads in the same infrastructure, we were able to achieve only ~ 3200 MiB/sec. The difference of ~ 2200 MiB/sec is significant for heavy I/O workloads. Across 5 CPUs, Windows 2012R2 consistently delivers another ~ 440MiB/sec/CPU! When we attempted to use Windows 2019DC in our CICD pipelines, the observed low I/O rates increased our run times into wholly unacceptable ranges!

Solution A: Install a minimum of 128GB of memory to improve W2019 I/O performance. When we used 192GB of memory, results remained consistent – even with overlapping OS cache flushes of the 50GB random (un-compressible) files.

Solution B: Allow engineers to move existing W2012R2DC VM’s to the new infrastructure and continue using W2012R2DC VMs “as-is”.

Solution C: Continue pursuing our open Microsoft case for answers on how to make W2019DC with 64GiB of memory perform the same or better than W2012R2DC with 64GiB of memory. Clearly today, we cannot with what we’ve learned thus far.

We are pursuing Solution B and C.

Why?

B) Project leads concluded increasing each VM’s memory footprint by 2x-3x to gain the I/O performance needed for CICD pipelines is not cost effective. Plus, there are resource constraints (in a now shortened timeline) to retool, retest applications to a W2019 CICD pipeline.

C) Our case remains open and active with Microsoft’s Engineering team.

Big Thanks!

Before we deep-dive, I want to say “Thank You!” to the following team who assisted in testing, migrating VMs, granting access, advancing theories, and testing independently.

· (removed for privacy)

· …any anyone I’ve omitted!

TL;DR;

After 6 months of internet mining; testing in at least 3 different environments; testing setups with varying storage providers (Pure, Dell EMC XtremIO); testing connectivity options(10GbE, 40GbE, 100GbE, 32GbF); varying server hardware (Dell/Lenovo); updating VMware hardware, tools and drivers; testing at two site locations; opening tickets with Microsoft, we were at an impasse. Microsoft could not “recreate” our results and we clearly had results Microsoft could not explain. Another key driver is we had a single customer running W2019STD VMs which were delivering outstanding I/O performance! His VM’s were easily delivering 2x+ what we were seeing with our setup and W2019DC VMs. His VM’s were lapping us around the track without breaking a sweat!

We needed desperately to find that setup’s “magic pixie dust!”

Well folks, after 1000’s of test runs and countless tweaks, it dawned on the other Engineer and I that the only crazy thing we had not tested were that his VM’s were running with 192GB while ours were 64GB. We joked,”…could that be the root cause?” Our 64GB VM had at least 60GB of free memory so when running our simple diskspd workload, there was plenty of memory for OS cache even with our difficult 50GB random and uncompressible file. W2019 should be sitting pretty with plenty of file cache or so I thought – “OK – long-shot – let’s go for it!”

Update 26 Jun 2020

· We expanded testing to include ReFS formatted storage at 4Kb & 64Kb. Earlier testing used NTFS formatted storage with 4Kb & 64Kb block sizes. Results collected from ReFS follow the NTFS results – very poor performance at 64GB and much improved performance at 128-192GB of vRAM. To me, that suggests the underlying file-system is not contributing the root cause of our poor results.

· As of 26 Jun 2020, we’ve had no positive updates from Microsoft. We plan to escalate again after the July 4th holidays.

Background – Where’s the performance?
In February 2020, our teams began investigating why a major system upgrade project was stalling. Our excellent IT Engineers had carefully designed a brand new, what should be a highly performing environment. At the new site, we deployed leveraging 100GbE iSCSI, single hop networking targeting all-flash storage running on high frequency Cascade Lake pCPUs with BIOS tuning for maximum performance. The shocker was – the CICD build jobs running on the new setup were taking much longer than the existing, older systems. To investigate, we began problem isolation and comparisons of, “What’s different in our new vs. old setups?” Turns out – a lot. New storage, new networking, iSCSI vs FC, different server vendors, different pCPUs, different OSes, different VMware versions, different patch levels, different locations, …. OK – you get it. So, we backtracked and created a W2019DC VM on their existing setup and well performance was horrible there also. That’s when we learned, it probably was not the hardware but we could not take it off the table yet.

Falling back on the general rule that ~ 98% of all “performance problems” can be traced to some level of storage issues, we started there. For example, excessive latencies, too many hops, the wrong MTU sizes, out-of-order frame delivery, generic off-brand cabling, adapter and system firmware versions, BIOS settings, poor HBA or NIC tuning, … not easy to sort.

Our action plan was to leverage storage I/O benchmarks which presented difficult, write heavy, large block workloads which stress almost any storage systems. (With a few exceptions….) We targeted the VM’s “local storage” to keep the tests simpler. We learned quickly that with large block (64K), heavy write(100%) workloads, W2019DC’s I/O performance was horrible when compared to a similar W2012R2 VM. This was consistent on any storage platform, any network or any location. W2019DC performed badly right beside the same W2012R2 VM on the same hardware, same networks, using the same storage. We were, select your own descriptive modifier for emphasis, baffled!

Our Epiphany
When we increased the total memory on our W2019STD test box from 64GB to 192GB, we hit the I/O performance jackpot! I nearly fell outta my seat with my head in my hands wondering, “How @(@(@ COULD THIS BE TRUE!?”

Further testing has proven a consistent correlation between W2019 I/O performance and the amount of installed memory. We can easily demonstrate huge I/O variances between a Windows 2012 R2 VM with 64GB memory vs. Windows 2019DC VM with 64GB. We can erase those variances by increasing the memory of the W2019STD/DC VM. When we provide the same W2019DC VM with 128GB-196GB of memory, I/O performance improves dramatically with large block, heavy write I/O and nears our best W2012R2 results!

Color me confused
How could this happen? It has been reported widely that Microsoft completely re-wrote the W2019 I/O stack. So, what did the Microsoft Engineers do to impart such poor performance? I don’t know. You see, during our tests probably 55GB of 64GB of memory is free is not in use. One other thing we know is Microsoft adjusted registry values for “LargeSystemCache” between W2012R2 (1) and W2019DC (0). We tried early on setting the W2019 value to (1) but it seems to have had no impact at 64GB. I believe this is not a coincidence.

We opened a Microsoft case in early April 2020. We have shared all the earlier findings and tests along with this new revelation with Microsoft’s representative. Thus far, no answers. Frankly, the engineer seemed as surprised at our proven results as we were.

What’s diskspd and why use it?

  1. We selected diskspd because it is a powerful Microsoft developed tool for stressing I/O on Windows OSes.

  2. We wanted to “remove doubts” whether using a 3rd party tool like FIO, CrystalDiskMark, ATTO, IOMETER, or any other stress tests were responsible for the varying results.

  3. diskspd is powerful and I love that it tracks the tail latencies which shows the wild variances in the I/O stream. Those are early canaries of impending problems.

  4. Download diskspd here -> https://gallery.technet.microsoft.com/DiskSpd-A-Robust-Storage-6ef84e62

  5. More about diskspd:

  6. https://dzone.com/articles/server-and-storage-io-0

  7. https://blog.purestorage.com/so-long-sqlio-hello-diskspd/

  8. https://blog.sqlterritory.com/2018/03/27/how-to-use-diskspd-to-check-io-subsystem-performance/

  9. https://storageioblog.com/server-storage-io-benchmarking-tools-microsoft-diskspd-part/

  10. The diskspd command we used and why. Be very careful with upper/lower case parameters as Microsoft re-used letters to engage different functions!

  11. .\diskspd -L -b64k -c50G -F5 -d65 -Z4M -Sb -w100 C:\temp\itperf-50GBNoCompress01.txt

  12. -L = Collect and display the critical latency stats

  13. -b64k -c50G = Create a large block (64K) 50GB file

  14. -F5 = Run with 5 threads (of 8 CPUS). So this a compromise. Normally, our compute engines consume all available OS CPUS with 1 writer thread per CPU. For this test, it is an 8 CPU VM and since this workload is not 100% stats, 5 of 8 is my compromise.

  15. -d65 = run for 65 seconds which is the minimum to stress the system more.

  16. -Z4M = use a 4MB buffer with RANDOM data to feed writes.

  17. -Sb = Leverage the OS’s cache vs direct write-thru which bypasses the OS file-cache.

  18. -w100 = Perform 100% writes (most difficult workload)

  19. C|D|E:\… = Target file 01, 02, 03, ….

Other conclusions and findings:
These findings surfaced from our tests regardless of the W2012 vs. W2019 comparison.

  1. CiscoAMP

  2. Running CiscoAMP consumes 3%-6% of the system’s I/O performance whether the target write directory is excluded by CiscoAMP’s setup or not!

  3. Not installing CiscoAMP yielded a ~ 3-4% immediate performance improvement.

  4. CiscoAMP installs “file filters” into the OSes filter stack.

  5. Updating VMware’s Hardware (vHW) and Tools (vT) in VMware VMs can have a significant impact.

  6. Updating Windows drivers to current, after step 2, is also important to performance.

  7. We saw gains of 4.5% – 19.5% after updating Windows drivers, vT and vHW in W2012 testing.
    
  8. These are well understood best practices but often fall by the wayside.

  9. Far too often, only “security fixes” are applied by IT. This approach omits critical performance driver updates which are not security related. IT should apply ALL fixes, not just security fixes!

  10. Using the default LSI SCSI vs. VMware optimized PVSCSI drivers had a smaller impact to heavy I/O performance than I anticipated.

  11. Testing using ReiserFS (ReFS) vs. NTFS shows no significant changes in the 64GB vs 128GB-192GB results. Overall, using ReFS shows higher latencies than NTFS on both 4Kb and 64Kb formatting. I would not use ReFS based on these results.

  12. Performance in the new environment (Single Hop 100GbE, Pure Storage, tuned CascadeLakes) is better when compared to the same W2012R2 VM in the older original environment.

  13. When using a smaller MTU of 1514 vs a MTU 9000, performance improved by 3.2%. This is contrary to expectations, especially with large file sequential workloads being slammed down the vNICs and pNICS. And yes, we verified a large MTU frame would pass from host to storage.

  14. On W2012, using PVSCSI driver with GPT/64K format decreased performance by ~ 10% compared to LSI/4K C: defaults. The was very unexpected and quite contrary to expectations! Using a 64K formatted block size on a large 1TB volume should have yielded better performance from the better size alignments. However, because this are remote storage systems, something else could be in play. This should be retested on local physical hardware in the future.

  15. More to come…

Prove it!
Below are screen caps of various recent runs comparing them to known baselines. We typically ran 10-20 samples to smooth human and system inconsistencies. We have multiple pages of these going back to early February 2020 testing.

Screen caps with notations
W2012R2/nc0w001/iSCSI/100GbE/AMP Baseline

Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401020/5efa618d/attached_image/itperf-Ixxxx-W001-NC0W2012-BestBaseline-20200619.jpg

W2019STD/nc0w006/iSCSI/100GbE/NoAMP/64GB
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401021/5efa61ad/attached_image/itperf-Ixxxx-W006-NC0W2019-64GB-20200608-02.jpg

W2019STD/nc0w006/iSCSI/100GbE/AMP/96GB – serialized, cached cleared
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401022/5efa61c7/attached_image/itperf-Ixxxx-W006-NC0W2019-96GB-20200608-Serial01-02.jpg

W2019STD/nc0w006/iSCSI/100GbE/AMP/96GB – cache overlaps, poor results – easy to push over the edge into poor performance again.
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401023/5efa61dd/attached_image/itperf-Ixxxx-W006-NC0W2019-96GB-20200608-Serial02-02.jpg

W2019STD/nc0w006/iSCSI/100GbE/AMP/128GB

Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401024/5efa620c/attached_image/itperf-Ixxxx-W006-NC0W2019-128GB-20200608-02.jpg

Part 2>

W2019STD/nc0w006/iSCSI/100GbE/AMP/192GB
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401025/5efa62dc/attached_image/itperf-Ixxxx-W006-NC0W2019-192GB-20200608-02a.jpg

Summary runs XLS comparisons (NTFS ONLY)
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401026/5efa62fd/attached_image/itperf-Ixxxx-W006-XLS-NC0W2019-64GB-192GB-Blur-20200609.jpg

Summary runs XLS comparisons (ReiserFS vs NTFS)
Image: https://content.spiceworksstatic.com/service.community/p/post_images/0000401027/5efa6318/attached_image/itperf-Ixxxx-W006-NC0W2019-ReFS-20200623.jpg

@gmobley
Copy link

gmobley commented Jul 2, 2020

Hi, I'll be glad to provide more information. Sorry for long posting which explains 5+ months of digging. If you read "BLUF – Surprise! Surprise! W2019’s I/O performance is memory dependent!" that should provide a good summary. At this point, I do not believe there is a problem with diskspd. I am convinced diskspd has helped up identify a serious issue in W2019. I'd say this issue against diskspd should be closed unless we have good reason to believe otherwise. Peace. and Stay Safe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants