The author does a fantastic job breaking down the performance characterization you would do for most IaaS / non-cloud environments, but missed the underlying behavior w.r.t the NAS/Local disk and other issues. For example, the spiky performance shown for all volume tests are examples of when the OS disk is throttled and the kernel call latency spikes exponentially.
When the OS / Kernel itself is locked in IO Wait, even the offloaded data paths will become constrained as the sockets et al and the container IO itself is on the (limited) OS disk - as the write to ANY disk / pvc increases, you pay a non zero IOPS cost to the OS disk
This is why the saturation metrics need to be watched from USE at the node, pool and cluster levels
The text was updated successfully, but these errors were encountered:
@jnoller Thanks so much for the compliments. You are right, I did not have any knowledge of what's happening under the hood of a cloud provider, so approached it as good as possible with the tools I had at the time.
I'm adding a header to the post pointing to this issue so readers get a better picture. Hopefully I'll have time to go back and re-do it with the new info at hand :)
@StianOvrevage Feel free to pull me in / ping me - the big thing isn't the cloud provider logic per se - there's a large piece in the 'cloud native' world that's missing around systems engineering/performance/reliability testing under production scale, a lot of what I see in the community as a whole is all simulation based that tests say, specific functionality, but not the behavior.
I think this is wider than AKS / a single vendor :\
Any update here? I am running to some IO issues on AKS and after some research looks like there is not much information around. I'd be running some tests myself so any input on AKS/Cloud specifics would be helpful.
Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/
The author does a fantastic job breaking down the performance characterization you would do for most IaaS / non-cloud environments, but missed the underlying behavior w.r.t the NAS/Local disk and other issues. For example, the spiky performance shown for all volume tests are examples of when the OS disk is throttled and the kernel call latency spikes exponentially.
When the OS / Kernel itself is locked in IO Wait, even the offloaded data paths will become constrained as the sockets et al and the container IO itself is on the (limited) OS disk - as the write to ANY disk / pvc increases, you pay a non zero IOPS cost to the OS disk
This is why the saturation metrics need to be watched from USE at the node, pool and cluster levels
The text was updated successfully, but these errors were encountered: