cf for k8s v0.7.0

Scalability tests on cf-for-k8s 0.7.0

🎉 🎉 🎉 To start on with the happy note, in this release we were able to push 100 apps within 5 mins.:clap:

🎶 Coming together is a beginning, Keeping together is a progress, Working together is a success. 🎶

Outcomes

We were able to successfully deploy till 1200 applications.

Previous memory leak issue from eirini-events#175 has been resolved.

Changes Made from previous test runs

We introduced external Database RDS.
Made istio ingressgateway as Deployment.

CAPI Observations

As like in previous scale tests, bottlenecks arrived at capi to reach 1000 apps. After ~500 app pushes we started getting EOF errors for /v3/process/UUID/stats because of which most of the pushes failed. Issue Comments

Waiting for API to complete processing files...
Request error: Get https://api.example.com/v3/processes/dd934616-a406-439e-868d-f4ee7b240e6e/stats: EOF
TIP: If you are behind a firewall and require an HTTP proxy, verify the https_proxy environment variable is correctly set. Else, check your network connection.

Instances starting...\n\nStats unavailable: Stats server temporarily unavailable.\nFAILED\n"}}

When we try to do cf curl /v3/process/UUID/stats response takes around 18 seconds. The error logs which we could relate to above issue

{"timestamp":"2020-10-02T03:19:09.907623891Z","message":"Started GET \"/v3/processes/a42bab9e-d570-4593-a9f5-67c7f80892bf/stats\" for user: 4d427335-c1c0-4c15-87b8-c5fff7eba27c, ip: 10.250.43.143 with vcap-request-id: 5b1ea345-eebe-4aca-9c06-a99e7391ccb7::2515f881-0f2a-46df-8dec-388501673725 at 2020-10-02 03:19:09 UTC","log_level":"info","source":"cc.api","data":{"request_guid":"5b1ea345-eebe-4aca-9c06-a99e7391ccb7::2515f881-0f2a-46df-8dec-388501673725"},"thread_id":47248886568360,"fiber_id":70325610875860,"process_id":1,"file":"/workspace/middleware/request_logs.rb","lineno":28,"method":"call"}
{"timestamp":"2020-10-02T03:19:16.027511805Z","message":"stats_for_app.error","log_level":"info","source":"cc.diego.instances_reporter","data":{"request_guid":"5b1ea345-eebe-4aca-9c06-a99e7391ccb7::2515f881-0f2a-46df-8dec-388501673725","error":"No running instances"},"thread_id":47248886568360,"fiber_id":70325610875860,"process_id":1,"file":"/workspace/lib/cloud_controller/diego/reporters/instances_stats_reporter.rb","lineno":61,"method":"rescue in stats_for_app"}

Till 600 applications we were pushing with max_in_flight of 20 and due to this frequent failures we reduced concurrent pushes to 10 so that the rate of failures will be minimum.

In nginx container of cf-apiserver we frequently observer this connection close error and other similar issues which we reported here and latency issue.

epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 127.0.0.1, server: , request: "GET /healthz HTTP/1.1", upstream: "http://unix:/data/cloud_controller_ng/cloud_controller.sock:/healthz", host: "localhost:80"

Resource Consumptions

We started tests with default requests and limits assigned for the containers. We made 5 replicas of cf-api-server and one replica of istio-ingressgateway initially.

When apps were reaching at 400 , cpu got throttled (2cores)for istio-ingressgateway. So we scaled horizontally over the runs.

We observed most memory limit and requests set in PRs' Eirini #173 and capi #65 for the control plane components was not sufficient which led to OOMs'. So we scaled the resources accordingly to handle the load. This result gives better insights to define scaling interfaces. Happy to contribute.

Following table gives the components resource consumptions at 1000 apps.

Infra

Gardener:

AWS
Kubernetes Version 1.17.8
Worker Nodes - 30
Machine type - 8cpu_30gb
RDS - db.m5.xlarge

Monitoring

Deployed Prometheus and grafana for monitoring.
Configured basic Kubernetes monitoring (Kube-state-metrics,node-exporter ,cadivsor).
Istio control plane components montoring.
Scraped metrics for capi
Configured a script and pushgateway to collect push,start and route time metrics.

Load Pattern:

Used Diego stress tests framework to push applications. Setup the diego-stress-tests on a separate virtual machine of 8cpu_30gb. Made the tests to push source code apps instead of binary.
Start Timeout is set to 10 mins. This is set to handle application staging failures.
Configured each application to generate 1 req/sec and 1 log/sec to simulate real workload scenario.
It deploys mixed application sizes and frequent crashing apps in 1:20 ratio.
Max in flight --> 20 | 10 (Concurrent pushes).

Route Availability:

To have an eye on push, start and route availability, we created a script which will push --> a nodejs app with --no-start --> start the application --> check for route available --> Delete the application. Time is measured for each operation and using pushgateway we push the metrics to prometheus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly