-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise failure scenario #393
Comments
@kami619 / @mhajas - when using the ec2 instances and some updated infrastructure, I was able to deliver the expected load in a "medium" sized scenario: Setup of the system:
Test 1:
Results: works, still there some high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share. Due to that, requests queue up for that pod quite a bit. AFAIK there is no way for a slow-start of a pod. What didn't work:
Test 2:
Results: the latencies tend to building up, and there are very high peaks when a pod is restarted. Presumably due to a new pod being put into load-balancing with its full share, plus additional troubles as the other Keycloak instances try to connect to the pod that has just been terminated? As we're randomly starting pods, it seems that those pods which are not restarted build up a high JVM GC overhead (gradually up to 10+%), and at the end of the test we see the latencies increase. Test 3:
Results: With an increasing amount of sessions, the latencies of the requests grow longer. At around 100_000 sessions being created, the pods restart repeatedly, possibly due to the liveness checks timing out. Diagrams for the client credentials: OK scenario: Scenario with "leastconn", where one Keycloak picks up a lot of requests, and things slow down/queue up: Diagrams for the authentication code: Not OK scenario, increased latencies after a pod restart, plus increasing latencies over time: Response time histogram over time: Not OK scenario, system failure after creating ~100000 sessions: |
Created a spin-off issue #410 to analyze the problems around user sessions. |
thanks for the detailed analysis for the desired capacity runs for Keycloak @ahus1
I am interested to know more about the sizing change from 3 to 6 Keycloak pods to accommodate for the overload, and where was the bottleneck observed. Does it mean we have hit the maximum capacity possible with 3 Keycloak pods in this case and what were the metrics achieved with that system ? |
When using only three pods, and restarting one of them, this instance receives 1/3rd of the total load the moment it becomes ready. This is quite a peak, and with the JVM not being warmed up, this leads to a high spike of concurrent requests, which overloads the Pod. With 6 Pods, a restarting Pod receives only 1/6th of the load, which is a smaller peak. This is caused by the "roundrobin" strategy which we are using by default. As f my understanding of the docs, a "leastconn" distribution would deliver better results, as each Keycloak would process the same number of active session - still this didn't show up in the results. Maybe this only works with reencrypt and not passthrough. We might to get to an expert of this. For the metrics, refer to the screen shots above. Please re-run the tests in a more controlled way than I did. For the Authentication Code scenario, there is an additional spinn-off issue as the results were not as good as anticipated. |
thanks @ahus1 for more detail on this. I agree that we might need more info on the ingress/load balancer configuration to understand this better. |
We were able to confirm @ahus1's previous load test outcome that the Client Secret scenario works as expected against a similarly setup Keycloak application and would approximately need 1 vCPU for every 250 users-per-sec throughput and would it be wise to allocate 2 vCPU for that run rate to accommodate for any JDBC spikes that may occur during pod restarts. We spend quite a bit of time understanding the peak for Authorization Code and we found that,
Overall I think we achieved what we set out for with the Single OCP cluster related failure tests. |
The text was updated successfully, but these errors were encountered: