Active/Active XSite fencing #29303

ryanemerson · 2024-05-02T10:27:53Z

In order to ensure that both sites in an Active/Active deployment don't continue to service requests when a split-brain occurs between the sites, it's necessary for us to implement a mechanism that will take one of the sites offline when such a scenario is detected.

Keycloak XSite Health Probe

Originally we envisaged that the Keycloak pods would expose a dedicated /xsite-check to serve as a health probe which can be periodically executed by the loadbalancer, however this approach has the following limitations:

As we're deploying only two sites no quorum on view membership exists, so when a split-brain occurs it's probable that both site's xsite-check will fail leading to both sites being taken offline if this is used directly by a LoadBalancer.
A new non-blocking check required to probe Infinispan site view status needs to be implemented in Keycloak.
Periodically checking the xsite-check by an alternative mechanism to the LoadBalancer would require another long-lived deployment, e.g. EC2 instance, Fargate container, further increasing cost and complexity.

Proposal

Instead of implementing a dedicated xsite-check, we can leverage Prometheus and the Infinispan metrics to determine when a split brain has occurred.

Prometheus alerts and AlertManager are used to trigger a webhook when the xsite site metric transitions from online to offline.
AWS API Gateway provides the webhook URL, forwarding events for processing.
AWS Lambda for processing event.

Lamdba Logic

At stage 3 the Lambda checks the state of the LoadBalancer and determines whether both sites are health.

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

If one site is unhealthy, then nothing to do as either:

That site has failed
The Lambda is processing the second event triggered by another site

To ensure that only a single Lambda is executed at one time, it maybe necessary to leverage AWS SQS or if that's not possible, implement some form of locking in the Lambda.

Advantages

No additional Keycloak implementation required
Infinispan metrics are already being scraped
Prometheus alerts de-facto K8s standard
Only steps 2 & 3 change if we support multiple cloud providers

Disadvantages

Users must utilise Prometheus Alerts
Webhook endpoint required

Missing Pieces

ISPN-16043 Infinispan metric for global site status
Prometheus Alert
API Gateway
Lambda

The text was updated successfully, but these errors were encountered:

kami619 · 2024-05-06T11:32:35Z

Thanks for the details on this @ryanemerson. I have few questions,

1 - Do we need X number of lambdas to be deployed and maintained for X number of sites as per this proposal ?

2 - Does the below statement indicate a manual intervention is required for a split brain scenario ?

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

4 - Concurrency issues due to a race condition between both or more lambda's could also create weird situations on the health status of the xsite check. Is it not possible to think of a quorum based solution which would be more reliable and reduce the probability for false alerts ? For me that seems more scalable approach when we would have increased number of sites in a HA landscape.

sventorben · 2024-05-06T17:12:25Z

I totally get the need for this, but relying on Prometheus to set this up, isn't really that easy, right? Most load balancers provide some kind of health check probing mechanisms based on HTTP endpoints.

Wouldn't it be way easier for KC adopters to leverage such a features instead of relying on prometheus beeing available and implementing steps 2. and 3. outlined above?

ryanemerson · 2024-05-07T09:47:53Z

1 - Do we need X number of lambdas to be deployed and maintained for X number of sites as per this proposal ?

For Multi-AZ deployments we can leverage a single Lambda per region.

2 - Does the below statement indicate a manual intervention is required for a split brain scenario ?

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

Yes. It's necessary for the SRE to determine what the appropriate course of action is to recover from split-brain, e.g. sync site state before bringing the offline site online.

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

Yes that's right, I need to investigate this more.

4 - Concurrency issues due to a race condition between both or more lambda's could also create weird situations on the health status of the xsite check. Is it not possible to think of a quorum based solution which would be more reliable and reduce the probability for false alerts ? For me that seems more scalable approach when we would have increased number of sites in a HA landscape.

I agree that race conditions between concurrent Lambda execution is a concern if we're unable to limit the number of simultaneous Lambda executions to 1.

Unfortunately a Quorum based solution at the Keycloak/Infinispan level is not possible for the Multi-AZ Active/Active scenario as we only have two members.

We have previously discussed the possibility of mandating that Active/Active deployments require 3 AZs, so Active/Active/Active, however this has the following disadvantages:

Increased cost
Increased latency per Keycloak operation as we rely on SYNC XSite replication between all 3 sites
Not all AWS regions have 3 AZs

ryanemerson · 2024-05-07T09:57:20Z

I totally get the need for this, but relying on Prometheus to set this up, isn't really that easy, right? Most load balancers provide some kind of health check probing mechanisms based on HTTP endpoints.

https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/#active-health-checks

https://doc.traefik.io/traefik/routing/services/#health-check

Wouldn't it be way easier for KC adopters to leverage such a features instead of relying on prometheus beeing available and implementing steps 2. and 3. outlined above?

I agree it would be much simpler, however with two sites this solution is not possible. As we're deploying only two sites no quorum on view membership exists, so when a split-brain occurs it's probable that both site's xsite-check will fail leading to both sites being taken offline and all availability being lost.

With a health-check based solution where we exclusively rely on a sites P2P view of the world, it's also not possible to differentiate between split-brain and a crashed site.

ryanemerson · 2024-05-08T09:57:14Z

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

Yes that's right, I need to investigate this more.

Luckily, there's a much simpler solution than SQS. We can simply set the "Reserve Concurrency" of the Lambda to 1, which will prevent concurrent executions of the function.

https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

- User alert routing enabled on ROSA clusters - PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints - Global Accelerator scripts refactored to use OpenTofu when creating AWS resources - Task created to deploy/undeploy Active/Active - Task created to simulate split-brain scenarios - 'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments Signed-off-by: Ryan Emerson <remerson@redhat.com>

- User alert routing enabled on ROSA clusters - PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints - Global Accelerator scripts refactored to use OpenTofu when creating AWS resources - Task created to deploy/undeploy Active/Active - Task created to simulate split-brain scenarios - 'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments - 'active-active' and 'active-passive' tags added to crossdc-tests to allow different behaviours/tests to be executed for the given deployment type. - Daily scheduled job updated to run tests against both active/passive and active/active deployments Signed-off-by: Ryan Emerson <remerson@redhat.com>

- User alert routing enabled on ROSA clusters - PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints - Global Accelerator scripts refactored to use OpenTofu when creating AWS resources - Task created to deploy/undeploy Active/Active - Task created to simulate split-brain scenarios - 'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments - 'active-active' and 'active-passive' tags added to crossdc-tests to allow different behaviours/tests to be executed for the given deployment type. - Active/Active specific test cases added. Testsuite now interacts directly with k8s clusters in order to have greater control over deployments being tested. This is necessary so that we can simulate split-brain scenarios between sites. - Daily scheduled job updated to run tests against both active/passive and active/active deployments Signed-off-by: Ryan Emerson <remerson@redhat.com> Co-authored-by: Michal Hajas <mhajas@redhat.com> Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com>

ryanemerson added the status/triage label May 2, 2024

ahus1 changed the title ~~Active/Active XSite ring fencing~~ Active/Active XSite fencing May 6, 2024

ahus1 transferred this issue from keycloak/keycloak-benchmark May 6, 2024

ahus1 added kind/epic and removed status/triage labels May 6, 2024

ahus1 assigned ryanemerson May 6, 2024

ahus1 added the team/cross-dc label May 6, 2024

ryanemerson linked a pull request May 20, 2024 that will close this issue

Active/Active XSite fencing. Resolves keycloak#29303 keycloak/keycloak-benchmark#819

Draft

5 tasks

mhajas closed this as completed in mhajas/keycloak-benchmark@587ceda May 29, 2024

ryanemerson reopened this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active/Active XSite fencing #29303

Active/Active XSite fencing #29303

ryanemerson commented May 2, 2024 •

edited by ahus1

kami619 commented May 6, 2024

sventorben commented May 6, 2024

ryanemerson commented May 7, 2024

ryanemerson commented May 7, 2024 •

edited

ryanemerson commented May 8, 2024

Active/Active XSite fencing #29303

Active/Active XSite fencing #29303

Comments

ryanemerson commented May 2, 2024 • edited by ahus1

Keycloak XSite Health Probe

Proposal

Lamdba Logic

Advantages

Disadvantages

Missing Pieces

kami619 commented May 6, 2024

sventorben commented May 6, 2024

ryanemerson commented May 7, 2024

ryanemerson commented May 7, 2024 • edited

ryanemerson commented May 8, 2024

ryanemerson commented May 2, 2024 •

edited by ahus1

ryanemerson commented May 7, 2024 •

edited