Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Correctly benchmark using FIO w/ examples #46
Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/
The author does a fantastic job breaking down the performance characterization you would do for most IaaS / non-cloud environments, but missed the underlying behavior w.r.t the NAS/Local disk and other issues. For example, the spiky performance shown for all volume tests are examples of when the OS disk is throttled and the kernel call latency spikes exponentially.
When the OS / Kernel itself is locked in IO Wait, even the offloaded data paths will become constrained as the sockets et al and the container IO itself is on the (limited) OS disk - as the write to ANY disk / pvc increases, you pay a non zero IOPS cost to the OS disk
This is why the saturation metrics need to be watched from USE at the node, pool and cluster levels
@jnoller Thanks so much for the compliments. You are right, I did not have any knowledge of what's happening under the hood of a cloud provider, so approached it as good as possible with the tools I had at the time.
I'm adding a header to the post pointing to this issue so readers get a better picture. Hopefully I'll have time to go back and re-do it with the new info at hand :)
@StianOvrevage Feel free to pull me in / ping me - the big thing isn't the cloud provider logic per se - there's a large piece in the 'cloud native' world that's missing around systems engineering/performance/reliability testing under production scale, a lot of what I see in the community as a whole is all simulation based that tests say, specific functionality, but not the behavior.
I think this is wider than AKS / a single vendor :\