-
Notifications
You must be signed in to change notification settings - Fork 2
cf for k8s v0.6.0
- To have a sizing guide for each control plane components resource usages.
- To find the scalability limits.
- To find out how many apps can be handled by the system.
- AWS
- Kubernetes Version 1.17.8
- Worker Nodes - 30
- Machine type - 8cpu_30gb
Version: v-0.6.0
- Deployed Prometheus and grafana for monitoring.
- Configured basic Kubernetes monitoring (Kube-state-metrics,node-exporter ,cadivsor).
- Istio control plane components montoring.
- Scraped metrics for capi
- Configured a script and pushgateway to collect push,start and route time metrics.
- Used Diego stress tests framework to push applications. Setup the diego-stress-tests on a separate virtual machine of 8cpu_30gb. Made the tests to push source code apps instead of binary.
- Start Timeout is set to 10 mins. This is set to handle application staging failures.
- Configured each application to generate 1 req/sec and 1 log/sec to simulate real workload scenario.
- It deploys mixed application sizes and frequent crashing apps in 1:20 ratio.
- Max in flight --> 10 (Concurrent pushes).
- To have an eye on push, start and route availability, we created a script which will push --> a nodejs app with
--no-start
--> start the application --> check for route available --> Delete the application. Time is measured for each operation and using pushgateway we push the metrics to prometheus.
The following resource usages are taken on replicas at the point when we reached at 1600 apps.
![](https://github.com/perf-cfk8s/docs/raw/master/v0.6.0/resource_usage.png)
During the scalability tests we used to monitor resource usage and logs of the control plane components. Our initial idea is to start with one replica and based on the resource usage if it reaches soft limits or frequent failures, then decide upon to scale in/out the control plane.
During the entire scale out tests, there was only need for cf-apiserver to scale horizontally.
We started with one replica of cf-apiserver and started pushing apps. Whenever, apiserver starts throwing 503 response code for the pushes we scaled to plus one replica everytime. Following are the AI's counts at which we increased the cf-apiserver replica.
cf-apiserver Count | AI's count at which we scaled |
---|---|
1 | 170 |
2 | 520 |
3 | 700 |
4 | 940 |
5 | 1600 |
Issue Raised: capi-k8s-release #67
During the course of the test we could see eirini-events pod was consuming high memory > 10Gi and automatically restart happens and memory reduces. We suspect some memory leak. Raised an issue for the same.
Issue Raised: Eirini #175
![](https://github.com/perf-cfk8s/docs/raw/master/v0.6.0/eirini-events.png)
Next component under high memory consumption is log cache. It was consuming beyond 15 Gi and we scaled to two replicas. As it drains the complete node memory we scaled to two replicas. As we are not sure whether log-cache is designed for scale up or scale out. Fleuntd went for OOM, then we scaled the limit to 3Gi after which it sustained.
Issue Raised: cf-k8s-logging #36
![](https://github.com/perf-cfk8s/docs/raw/master/v0.6.0/log-cache.png)
Issues raised: #70
- This tests we used in-cluster postgres. Next we will configure external DB.
- To stress kpack more, in the next runs we will try to avoid image caching.