cf for k8s v0.6.0

Scalability tests on cf-for-k8s alpha release.

Goal

To have a sizing guide for each control plane components resource usages.
To find the scalability limits.
To find out how many apps can be handled by the system.

Infra

Gardener:

AWS
Kubernetes Version 1.17.8
Worker Nodes - 30
Machine type - 8cpu_30gb

cf-for-k8s Setup:

Version: v-0.6.0

Monitoring

Deployed Prometheus and grafana for monitoring.
Configured basic Kubernetes monitoring (Kube-state-metrics,node-exporter ,cadivsor).
Istio control plane components montoring.
Scraped metrics for capi
Configured a script and pushgateway to collect push,start and route time metrics.

Load Pattern:

Used Diego stress tests framework to push applications. Setup the diego-stress-tests on a separate virtual machine of 8cpu_30gb. Made the tests to push source code apps instead of binary.
Start Timeout is set to 10 mins. This is set to handle application staging failures.
Configured each application to generate 1 req/sec and 1 log/sec to simulate real workload scenario.
It deploys mixed application sizes and frequent crashing apps in 1:20 ratio.
Max in flight --> 10 (Concurrent pushes).

Route Availability:

To have an eye on push, start and route availability, we created a script which will push --> a nodejs app with --no-start --> start the application --> check for route available --> Delete the application. Time is measured for each operation and using pushgateway we push the metrics to prometheus.

Resource Utilization of Control planes.

The following resource usages are taken on replicas at the point when we reached at 1600 apps.

Test Outcomes

During the scalability tests we used to monitor resource usage and logs of the control plane components. Our initial idea is to start with one replica and based on the resource usage if it reaches soft limits or frequent failures, then decide upon to scale in/out the control plane.

During the entire scale out tests, there was only need for cf-apiserver to scale horizontally.

cf-apiserver obeservations:

We started with one replica of cf-apiserver and started pushing apps. Whenever, apiserver starts throwing 503 response code for the pushes we scaled to plus one replica everytime. Following are the AI's counts at which we increased the cf-apiserver replica.

cf-apiserver Count	AI's count at which we scaled
1	170
2	520
3	700
4	940
5	1600

Issue Raised: capi-k8s-release #67

Eirini Events High Memory Consumption:

During the course of the test we could see eirini-events pod was consuming high memory > 10Gi and automatically restart happens and memory reduces. We suspect some memory leak. Raised an issue for the same.

Issue Raised: Eirini #175

Logging:

Next component under high memory consumption is log cache. It was consuming beyond 15 Gi and we scaled to two replicas. As it drains the complete node memory we scaled to two replicas. As we are not sure whether log-cache is designed for scale up or scale out. Fleuntd went for OOM, then we scaled the limit to 3Gi after which it sustained.

Issue Raised: cf-k8s-logging #36

Maximum Concurrent Pushes:

Issues raised: #70

Stay Tuned for :

This tests we used in-cluster postgres. Next we will configure external DB.
To stress kpack more, in the next runs we will try to avoid image caching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly