Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise failure scenario #393

Closed
PavelVlha opened this issue Jun 26, 2023 · 7 comments · Fixed by #402
Closed

Optimise failure scenario #393

PavelVlha opened this issue Jun 26, 2023 · 7 comments · Fixed by #402
Assignees

Comments

@PavelVlha
Copy link

PavelVlha commented Jun 26, 2023

  • When killing a pod, wait for it to be up again to kill another
    • measure the time it takes for the pod to come up, and possibly also capture the log just after it became ready, so we have both the timing and the messages
  • Save the logs before killing the pod (in .zip file using command line)
  • Pulling all KC logs before end of the test run
  • Capture the complete pod resource before killing and end of run
  • Find bottlenecks
@ahus1
Copy link
Contributor

ahus1 commented Jul 3, 2023

@kami619 / @mhajas - when using the ec2 instances and some updated infrastructure, I was able to deliver the expected load in a "medium" sized scenario:

Setup of the system:

  • 100000 clients
  • 10000 users
  • 6 Keycloak pods, 15 DB connections for each pod
  • kc-chaos.sh running continuouly
  • 5 ec2 instances running the load against the cluster
  • OpenShift loadbalancer configured for round-robin
  • OTEL enabled

Test 1:

  • 1000 client credential grants per second
  • Measuring of 10 minutes
./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.ClientSecret" \
    --server-url=https://... \
    --users-per-sec=1000  \
    --measurement=600 \
    --realm-name=realm-0 \
    --clients-per-realm=10000

Results: works, still there some high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share. Due to that, requests queue up for that pod quite a bit. AFAIK there is no way for a slow-start of a pod.

What didn't work:

  • Having only 3 pods of Keycloak running. This lead to overload scenarios (possibly the moment when they were put into the load balancing).
  • Using "leastconn" routing instead of round-robin; it lead to a very uneven load balancing which increased over time. The hope was to have the same number of active connections on each pod, and any queue due to slow responses during startup would ensure that a new pod would get only parts of the load. It is unclear what a "connection" is meaning in the context of "leastconn" when used with TLS passthrough.

Test 2:

  • 280 authentication code per second
  • 100% log out again
  • Measuring of 10 minutes
./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.AuthorizationCode" \
    --server-url=https://... \
    --users-per-sec=280  \
    --measurement=600 \
    --realm-name=realm-0 \
    --users-per-realm=1000

Results: the latencies tend to building up, and there are very high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share, plus additional troubles as the other Keycloak instances try to connect to the pod that has just been terminated? As we're randomly starting pods, it seems that those pods which are not restarted build up a high JVM GC overhead (gradually up to 10+%), and at the end of the test we see the latencies increase.

Test 3:

  • 280 authentication code per second
  • 1% log out again
  • Measuring of 10 minutes
./benchmark.sh eu-west-1 --scenario=keycloak.scenario."authentication.AuthorizationCode" \
    --server-url=https://... \
    --users-per-sec=280  \
    --measurement=600 \
    --logout-percentage=1 \
    --realm-name=realm-0 \
    --users-per-realm=1000

Results: With an increasing amount of sessions, the latencies of the requests grow longer. At around 100_000 sessions being created, the pods restart repeatedly, possibly due to the liveness checks timing out.

Diagrams for the client credentials:

OK scenario:

image

image

Scenario with "leastconn", where one Keycloak picks up a lot of requests, and things slow down/queue up:

image

Diagrams for the authentication code:

Not OK scenario, increased latencies after a pod restart, plus increasing latencies over time:

image

image

Response time histogram over time:

image
image

Not OK scenario, system failure after creating ~100000 sessions:

image

image

@ahus1
Copy link
Contributor

ahus1 commented Jul 4, 2023

Created a spin-off issue #410 to analyze the problems around user sessions.

@kami619
Copy link
Contributor

kami619 commented Jul 5, 2023

thanks for the detailed analysis for the desired capacity runs for Keycloak @ahus1

Having only 3 pods of Keycloak running. This lead to overload scenarios (possibly the moment when they were put into the load balancing).

I am interested to know more about the sizing change from 3 to 6 Keycloak pods to accommodate for the overload, and where was the bottleneck observed. Does it mean we have hit the maximum capacity possible with 3 Keycloak pods in this case and what were the metrics achieved with that system ?

@ahus1
Copy link
Contributor

ahus1 commented Jul 5, 2023

Does it mean we have hit the maximum capacity possible with 3 Keycloak pods in this case?

When using only three pods, and restarting one of them, this instance receives 1/3rd of the total load the moment it becomes ready. This is quite a peak, and with the JVM not being warmed up, this leads to a high spike of concurrent requests, which overloads the Pod. With 6 Pods, a restarting Pod receives only 1/6th of the load, which is a smaller peak.

This is caused by the "roundrobin" strategy which we are using by default. As f my understanding of the docs, a "leastconn" distribution would deliver better results, as each Keycloak would process the same number of active session - still this didn't show up in the results. Maybe this only works with reencrypt and not passthrough. We might to get to an expert of this.

For the metrics, refer to the screen shots above. Please re-run the tests in a more controlled way than I did. For the Authentication Code scenario, there is an additional spinn-off issue as the results were not as good as anticipated.

@kami619
Copy link
Contributor

kami619 commented Jul 5, 2023

This is caused by the "roundrobin" strategy which we are using by default. As f my understanding of the docs, a "leastconn" distribution would deliver better results, as each Keycloak would process the same number of active session - still this didn't show up in the results. Maybe this only works with reencrypt and not passthrough. We might to get to an expert of this.

thanks @ahus1 for more detail on this. I agree that we might need more info on the ingress/load balancer configuration to understand this better.

@kami619
Copy link
Contributor

kami619 commented Jul 20, 2023

We were able to confirm @ahus1's previous load test outcome that the Client Secret scenario works as expected against a similarly setup Keycloak application and would approximately need 1 vCPU for every 250 users-per-sec throughput and would it be wise to allocate 2 vCPU for that run rate to accommodate for any JDBC spikes that may occur during pod restarts.

We spend quite a bit of time understanding the peak for Authorization Code and we found that,

  • Without any failures we were able to exceed the target, with 100,000 sessions already in the ISPN cache, we were able achieve 50 users-per-sec throughput for 1 vCPU for the Authorization Code scenario with response time under 300 ms for endpoints.
  • With simulated failures we found CPU throttling to be a main concern when pod restarts happen and it would make sense to allocate 2X the vCPU of a standard load to accommodate for any resource utilization spikes.
  • In addition to that, we found a ISPN cache communication related issue which occurs randomly when a pod restarts and fails to come back up cleanly due to a failure in the cluster-wide communication.
  • We found couple more cluster wide communication related issues, which we suspect would be fixed by the ISPN fix from the above issue, we will retest once we have a build for it and log additional tickets as needed.
  • For a 6 Keycloak pods setup, the pods recover after a failure within 45 seconds with OTEL enabled and 25 seconds with OTEL disabled when the cluster doesn't experience other secondary issues such as CPU throttling or ISPN communication related issues

Overall I think we achieved what we set out for with the Single OCP cluster related failure tests.

@ahus1
Copy link
Contributor

ahus1 commented Jul 21, 2023

Thank you @kami619 - this resolves it for me. Added #438 as a follow-up as the cluster failures we've seen are troubling me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants