Heapster must be tested to ensure that it meets our v1.0 scalability goals - 100 node clusters (#3876) each running 30 - 50 pods each (#4188). A soak test might also be very helpful.
Heapster needs to expose some metrics to aid in scalability testing.
The text was updated successfully, but these errors were encountered:
Abhi (@ArtfulCoder) and I will work on this together.
Ideally we'd like to measure the following metrics:
These internal metrics require implementing a Heapster instrumentation infrastructure that doesn't yet exist. Therefore, we'll consider this lower priority and likely post V1 task.
Instead we'll focus on getting a baseline of the following basic process metrics for Heapster:
These are available today because Heapster is run in a container.
For a first stab at this, we plan on doing the following:
Abhi and I set up a GCE cluster with 4 nodes yesterday. We scheduled 275 pods (1 container each) on the cluster. Within an hour heapster stopped sending data to GCM because we hit quota limits:
By morning time the error switched to:
To by pass I will try to get quota increased, in the meantime, I'll set up a script to scrap docker stats off the machine directly.
The quota issue is expected.
On Fri, Jun 12, 2015 at 10:46 AM, Saad Ali email@example.com wrote:
Here are the initial set of results.
Heapster and InfluxDB memory usage grows proportionally with the number of pods/containers.
Heapster Memory Usage
InfluxDB Memory Usage
Fluentd/ElasticSearch Memory Usage
The CPU usage is cumulative and, thus, increases over time. Meaning the rate of increase is the interesting bit. But the most interesting bit is the relative rate of use overtime: in particular, InfluxDB appears to use almost an order of magnitude more CPU than ElasticSearch and even Heapster (I saw InfluxDB consuming, at times, 80% of CPU on the machine it was on).
Hepaster, ElasticSearch, InfluxDB, Grafana containers never restarted during the test.
These appear to be the numbers google is using to hit their 100-node 1.0 goal, per perf testing done under kubernetes/kubernetes#5880 This looks like 12X less data, and I've been finding influx unresponsive somewhere between 10-20 nodes, so maybe this is all the breathing room we need.
These appear to be the numbers google is using to hit their 100-node 1.0 goal, per perf testing done under kubernetes/kubernetes#5880 The defaults are 10s poll interval, 5s resolution, so this should back off load by about an order of magnitude. TODO: - drop the verbose flag once finished debugging
These appear to be the numbers google is using to hit their 100-node 1.0 goal, per perf testing done under kubernetes/kubernetes#5880 The defaults are 10s poll interval, 5s resolution, so this should back off load by about an order of magnitude. We're using `avoidColumns=true` to force heapster to avoid additional columns and instead append all metadata into the series names. It makes the series name ugly and hard to aggregate on the grafana side, but it wildly reduces CPU load. I guess that's why influxdb docs recommend more series with fewer points over fewer series with more points. Grafana's kraken dashboard updated to use the new series.
This is pretty old. Check out http://blog.kubernetes.io/2016/07/kubernetes-updates-to-performance-and-scalability-in-1.3.html
Check out the https://github.com/kubernetes/community/blob/master/sig-scalability/README.md they should be able to give you the current information/plans, and address any issues you are having with published numbers.