Optimise failure scenario #393

PavelVlha · 2023-06-26T12:20:31Z

When killing a pod, wait for it to be up again to kill another
- measure the time it takes for the pod to come up, and possibly also capture the log just after it became ready, so we have both the timing and the messages
Save the logs before killing the pod (in .zip file using command line)
Pulling all KC logs before end of the test run
Capture the complete pod resource before killing and end of run
Find bottlenecks

ahus1 · 2023-07-03T14:21:21Z

@kami619 / @mhajas - when using the ec2 instances and some updated infrastructure, I was able to deliver the expected load in a "medium" sized scenario:

Setup of the system:

100000 clients
10000 users
6 Keycloak pods, 15 DB connections for each pod
kc-chaos.sh running continuouly
5 ec2 instances running the load against the cluster
OpenShift loadbalancer configured for round-robin
OTEL enabled

Test 1:

1000 client credential grants per second
Measuring of 10 minutes

./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.ClientSecret" \
    --server-url=https://... \
    --users-per-sec=1000  \
    --measurement=600 \
    --realm-name=realm-0 \
    --clients-per-realm=10000

Results: works, still there some high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share. Due to that, requests queue up for that pod quite a bit. AFAIK there is no way for a slow-start of a pod.

What didn't work:

Having only 3 pods of Keycloak running. This lead to overload scenarios (possibly the moment when they were put into the load balancing).
Using "leastconn" routing instead of round-robin; it lead to a very uneven load balancing which increased over time. The hope was to have the same number of active connections on each pod, and any queue due to slow responses during startup would ensure that a new pod would get only parts of the load. It is unclear what a "connection" is meaning in the context of "leastconn" when used with TLS passthrough.

Test 2:

280 authentication code per second
100% log out again
Measuring of 10 minutes

./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.AuthorizationCode" \
    --server-url=https://... \
    --users-per-sec=280  \
    --measurement=600 \
    --realm-name=realm-0 \
    --users-per-realm=1000

Results: the latencies tend to building up, and there are very high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share, plus additional troubles as the other Keycloak instances try to connect to the pod that has just been terminated? As we're randomly starting pods, it seems that those pods which are not restarted build up a high JVM GC overhead (gradually up to 10+%), and at the end of the test we see the latencies increase.

Test 3:

280 authentication code per second
1% log out again
Measuring of 10 minutes

./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.AuthorizationCode" \
    --server-url=https://... \
    --users-per-sec=280  \
    --measurement=600 \
    --logout-percentage=1 \
    --realm-name=realm-0 \
    --users-per-realm=1000

Results: With an increasing amount of sessions, the latencies of the requests grow longer. At around 100_000 sessions being created, the pods restart repeatedly, possibly due to the liveness checks timing out.

Diagrams for the client credentials:

OK scenario:

Scenario with "leastconn", where one Keycloak picks up a lot of requests, and things slow down/queue up:

Diagrams for the authentication code:

Not OK scenario, increased latencies after a pod restart, plus increasing latencies over time:

Response time histogram over time:

Not OK scenario, system failure after creating ~100000 sessions:

ahus1 · 2023-07-04T11:28:34Z

Created a spin-off issue #410 to analyze the problems around user sessions.

kami619 · 2023-07-05T11:23:26Z

thanks for the detailed analysis for the desired capacity runs for Keycloak @ahus1

Having only 3 pods of Keycloak running. This lead to overload scenarios (possibly the moment when they were put into the load balancing).

I am interested to know more about the sizing change from 3 to 6 Keycloak pods to accommodate for the overload, and where was the bottleneck observed. Does it mean we have hit the maximum capacity possible with 3 Keycloak pods in this case and what were the metrics achieved with that system ?

ahus1 · 2023-07-05T11:33:16Z

Does it mean we have hit the maximum capacity possible with 3 Keycloak pods in this case?

When using only three pods, and restarting one of them, this instance receives 1/3rd of the total load the moment it becomes ready. This is quite a peak, and with the JVM not being warmed up, this leads to a high spike of concurrent requests, which overloads the Pod. With 6 Pods, a restarting Pod receives only 1/6th of the load, which is a smaller peak.

This is caused by the "roundrobin" strategy which we are using by default. As f my understanding of the docs, a "leastconn" distribution would deliver better results, as each Keycloak would process the same number of active session - still this didn't show up in the results. Maybe this only works with reencrypt and not passthrough. We might to get to an expert of this.

For the metrics, refer to the screen shots above. Please re-run the tests in a more controlled way than I did. For the Authentication Code scenario, there is an additional spinn-off issue as the results were not as good as anticipated.

kami619 · 2023-07-05T11:35:32Z

This is caused by the "roundrobin" strategy which we are using by default. As f my understanding of the docs, a "leastconn" distribution would deliver better results, as each Keycloak would process the same number of active session - still this didn't show up in the results. Maybe this only works with reencrypt and not passthrough. We might to get to an expert of this.

thanks @ahus1 for more detail on this. I agree that we might need more info on the ingress/load balancer configuration to understand this better.

kami619 · 2023-07-20T14:40:05Z

We were able to confirm @ahus1's previous load test outcome that the Client Secret scenario works as expected against a similarly setup Keycloak application and would approximately need 1 vCPU for every 250 users-per-sec throughput and would it be wise to allocate 2 vCPU for that run rate to accommodate for any JDBC spikes that may occur during pod restarts.

We spend quite a bit of time understanding the peak for Authorization Code and we found that,

Without any failures we were able to exceed the target, with 100,000 sessions already in the ISPN cache, we were able achieve 50 users-per-sec throughput for 1 vCPU for the Authorization Code scenario with response time under 300 ms for endpoints.
With simulated failures we found CPU throttling to be a main concern when pod restarts happen and it would make sense to allocate 2X the vCPU of a standard load to accommodate for any resource utilization spikes.
In addition to that, we found a ISPN cache communication related issue which occurs randomly when a pod restarts and fails to come back up cleanly due to a failure in the cluster-wide communication.
We found couple more cluster wide communication related issues, which we suspect would be fixed by the ISPN fix from the above issue, we will retest once we have a build for it and log additional tickets as needed.
For a 6 Keycloak pods setup, the pods recover after a failure within 45 seconds with OTEL enabled and 25 seconds with OTEL disabled when the cluster doesn't experience other secondary issues such as CPU throttling or ISPN communication related issues

Overall I think we achieved what we set out for with the Single OCP cluster related failure tests.

ahus1 · 2023-07-21T09:46:27Z

Thank you @kami619 - this resolves it for me. Added #438 as a follow-up as the cluster failures we've seen are troubling me.

kami619 mentioned this issue Jun 26, 2023

Run proper failure tests using already prepared script #381

Closed

kami619 assigned kami619 and mhajas Jun 26, 2023

ahus1 mentioned this issue Jun 28, 2023

Define desirable capacity for single cluster #394

Closed

This was referenced Jun 29, 2023

updates for optimizing failure scenario #402

Merged

rearrange the pieces of kc-chao.sh, kcb.sh and distributed load testing playbooks for better execution path #405

Closed

kami619 closed this as completed in #402 Jun 30, 2023

ahus1 reopened this Jul 3, 2023

ahus1 self-assigned this Jul 3, 2023

ahus1 removed their assignment Jul 3, 2023

ahus1 mentioned this issue Jul 4, 2023

Performance degradation with more than active 100_000 user sessions #410

Closed

ahus1 closed this as completed Jul 21, 2023

kami619 mentioned this issue Jul 21, 2023

Run horizontal scalability tests on Single OCP cluster #441

Open

ahus1 mentioned this issue Sep 4, 2023

Re-run failure test to verify JGroups thread findings #520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise failure scenario #393

Optimise failure scenario #393

PavelVlha commented Jun 26, 2023 •

edited by ahus1

ahus1 commented Jul 3, 2023 •

edited

ahus1 commented Jul 4, 2023

kami619 commented Jul 5, 2023

ahus1 commented Jul 5, 2023

kami619 commented Jul 5, 2023

kami619 commented Jul 20, 2023 •

edited

ahus1 commented Jul 21, 2023

Optimise failure scenario #393

Optimise failure scenario #393

Comments

PavelVlha commented Jun 26, 2023 • edited by ahus1

ahus1 commented Jul 3, 2023 • edited

ahus1 commented Jul 4, 2023

kami619 commented Jul 5, 2023

ahus1 commented Jul 5, 2023

kami619 commented Jul 5, 2023

kami619 commented Jul 20, 2023 • edited

ahus1 commented Jul 21, 2023

PavelVlha commented Jun 26, 2023 •

edited by ahus1

ahus1 commented Jul 3, 2023 •

edited

kami619 commented Jul 20, 2023 •

edited