Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly benchmark using FIO w/ examples #46

Open
jnoller opened this issue Feb 10, 2020 · 2 comments
Open

Correctly benchmark using FIO w/ examples #46

jnoller opened this issue Feb 10, 2020 · 2 comments

Comments

@jnoller
Copy link
Owner

@jnoller jnoller commented Feb 10, 2020

Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/

The author does a fantastic job breaking down the performance characterization you would do for most IaaS / non-cloud environments, but missed the underlying behavior w.r.t the NAS/Local disk and other issues. For example, the spiky performance shown for all volume tests are examples of when the OS disk is throttled and the kernel call latency spikes exponentially.

When the OS / Kernel itself is locked in IO Wait, even the offloaded data paths will become constrained as the sockets et al and the container IO itself is on the (limited) OS disk - as the write to ANY disk / pvc increases, you pay a non zero IOPS cost to the OS disk

This is why the saturation metrics need to be watched from USE at the node, pool and cluster levels

@StianOvrevage

This comment has been minimized.

Copy link

@StianOvrevage StianOvrevage commented Feb 24, 2020

@jnoller Thanks so much for the compliments. You are right, I did not have any knowledge of what's happening under the hood of a cloud provider, so approached it as good as possible with the tools I had at the time.

I'm adding a header to the post pointing to this issue so readers get a better picture. Hopefully I'll have time to go back and re-do it with the new info at hand :)

@jnoller

This comment has been minimized.

Copy link
Owner Author

@jnoller jnoller commented Feb 24, 2020

@StianOvrevage Feel free to pull me in / ping me - the big thing isn't the cloud provider logic per se - there's a large piece in the 'cloud native' world that's missing around systems engineering/performance/reliability testing under production scale, a lot of what I see in the community as a whole is all simulation based that tests say, specific functionality, but not the behavior.

I think this is wider than AKS / a single vendor :\

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
IO planning
  
In progress
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.