cf for k8s 1.0.0

Scalability tests on cf-for-k8s 1.0.0

Goals

Find Maximum number of application instances which can be handled by the system.
Scalability limits of the control plane components.
Resource Consumptions of control planes.
Simulate production like system with applications generating logs and requests constantly.

cf-for-k8s Setup:

Version: v-1.0.0

Infra

Gardener:

AWS
Kubernetes Version 1.19.2
Worker Nodes - 30
Machine type - 8cpu_30gb

Monitoring

Deployed Prometheus and grafana for monitoring.
Configured basic Kubernetes monitoring (Kube-state-metrics,node-exporter ,cadivsor).
Istio control plane components monitoring.
Scraped metrics for capi,uaa and metrics-proxy.
Configured a script and pushgateway to collect push,start and route time metrics.

Load Pattern:

Used Diego stress tests framework to push applications. Setup the diego-stress-tests on a separate virtual machine of 8cpu_30gb. Made the tests to push source code apps instead of binary.
Start Timeout is set to 10 mins. This is set to handle application staging failures.
Configured each application to generate 10 req/sec and 10 log/sec to simulate real workload scenario.
It deploys mixed application sizes and frequent crashing apps in 1:20 ratio.
Max in flight --> 20 (Concurrent pushes).

Route Availability:

To have an eye on push, start and route availability, we created a script which will push --> a nodejs app with --no-start --> start the application --> check for route available --> Delete the application. Time is measured for each operation and using pushgateway we push the metrics to prometheus.

Resource Utilization of Control planes.

We used default configurations provided by default for all the components. Based on our previous experiences we changed the cpu and memory resources of the components to handle the load. The following resource usages are taken on replicas at the point when we reached at 1000 apps.

Observations:

We started with 5 replicas of cf-api-server and istio-ingressgateway.
Due to high load on istio-ingressgateway CPU was getting throttled, so we scaled the replicas eventually.
At 750 applications, we scaled cf-apiserver to 7 replicas as we observed lot of 500 responses.
Beyond 750 we observed usual fail overs from cf-apiserver which we reported in issue #67
After 1000 applications we see tcp timeout errors from log-cache and istio-ingressgateway. We will deep dive more to figure the root cause with respective teams.

Improvements:

log-cache memory consumption is drastically reduced compared with previous releases(8Gi) even with large volume of logs from applications. We will get in touch with the team if any design changes contributing to this improvement or may be reducing logs for each application which is discussed here.

Recommendation:

🎁 With right number of replicas 1.0.0 can be good for environments which targets for 500-700 application instances

🎁 7000 logs/sec and req/sec.

🎁 Upto 20 concurrent pushes.

Next Steps:

Reachout to respective component teams (capi,log-cache and routing) to resolve these issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly