Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active/Active XSite fencing #29303

Open
4 tasks
ryanemerson opened this issue May 2, 2024 · 5 comments · May be fixed by keycloak/keycloak-benchmark#819
Open
4 tasks

Active/Active XSite fencing #29303

ryanemerson opened this issue May 2, 2024 · 5 comments · May be fixed by keycloak/keycloak-benchmark#819

Comments

@ryanemerson
Copy link
Contributor

ryanemerson commented May 2, 2024

In order to ensure that both sites in an Active/Active deployment don't continue to service requests when a split-brain occurs between the sites, it's necessary for us to implement a mechanism that will take one of the sites offline when such a scenario is detected.

Keycloak XSite Health Probe

Originally we envisaged that the Keycloak pods would expose a dedicated /xsite-check to serve as a health probe which can be periodically executed by the loadbalancer, however this approach has the following limitations:

  • As we're deploying only two sites no quorum on view membership exists, so when a split-brain occurs it's probable that both site's xsite-check will fail leading to both sites being taken offline if this is used directly by a LoadBalancer.
  • A new non-blocking check required to probe Infinispan site view status needs to be implemented in Keycloak.
  • Periodically checking the xsite-check by an alternative mechanism to the LoadBalancer would require another long-lived deployment, e.g. EC2 instance, Fargate container, further increasing cost and complexity.

Proposal

Instead of implementing a dedicated xsite-check, we can leverage Prometheus and the Infinispan metrics to determine when a split brain has occurred.

  1. Prometheus alerts and AlertManager are used to trigger a webhook when the xsite site metric transitions from online to offline.
  2. AWS API Gateway provides the webhook URL, forwarding events for processing.
  3. AWS Lambda for processing event.

Lamdba Logic

At stage 3 the Lambda checks the state of the LoadBalancer and determines whether both sites are health.

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

If one site is unhealthy, then nothing to do as either:

  1. That site has failed
  2. The Lambda is processing the second event triggered by another site

To ensure that only a single Lambda is executed at one time, it maybe necessary to leverage AWS SQS or if that's not possible, implement some form of locking in the Lambda.

Advantages

  • No additional Keycloak implementation required
  • Infinispan metrics are already being scraped
  • Prometheus alerts de-facto K8s standard
  • Only steps 2 & 3 change if we support multiple cloud providers

Disadvantages

  • Users must utilise Prometheus Alerts
  • Webhook endpoint required

Missing Pieces

  • ISPN-16043 Infinispan metric for global site status
  • Prometheus Alert
  • API Gateway
  • Lambda
@ahus1 ahus1 changed the title Active/Active XSite ring fencing Active/Active XSite fencing May 6, 2024
@ahus1 ahus1 transferred this issue from keycloak/keycloak-benchmark May 6, 2024
@kami619
Copy link
Contributor

kami619 commented May 6, 2024

Thanks for the details on this @ryanemerson. I have few questions,

1 - Do we need X number of lambdas to be deployed and maintained for X number of sites as per this proposal ?

2 - Does the below statement indicate a manual intervention is required for a split brain scenario ?

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

4 - Concurrency issues due to a race condition between both or more lambda's could also create weird situations on the health status of the xsite check. Is it not possible to think of a quorum based solution which would be more reliable and reduce the probability for false alerts ? For me that seems more scalable approach when we would have increased number of sites in a HA landscape.

@sventorben
Copy link
Contributor

I totally get the need for this, but relying on Prometheus to set this up, isn't really that easy, right? Most load balancers provide some kind of health check probing mechanisms based on HTTP endpoints.

Wouldn't it be way easier for KC adopters to leverage such a features instead of relying on prometheus beeing available and implementing steps 2. and 3. outlined above?

@ryanemerson
Copy link
Contributor Author

1 - Do we need X number of lambdas to be deployed and maintained for X number of sites as per this proposal ?

For Multi-AZ deployments we can leverage a single Lambda per region.

2 - Does the below statement indicate a manual intervention is required for a split brain scenario ?

If both sites are healthy, then a site configured by the user is taken offline by updating the /lb-check path in the Loadbalancer config (the same as with Active/Passive Lambda).

Yes. It's necessary for the SRE to determine what the appropriate course of action is to recover from split-brain, e.g. sync site state before bringing the offline site online.

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

Yes that's right, I need to investigate this more.

4 - Concurrency issues due to a race condition between both or more lambda's could also create weird situations on the health status of the xsite check. Is it not possible to think of a quorum based solution which would be more reliable and reduce the probability for false alerts ? For me that seems more scalable approach when we would have increased number of sites in a HA landscape.

I agree that race conditions between concurrent Lambda execution is a concern if we're unable to limit the number of simultaneous Lambda executions to 1.

Unfortunately a Quorum based solution at the Keycloak/Infinispan level is not possible for the Multi-AZ Active/Active scenario as we only have two members.

We have previously discussed the possibility of mandating that Active/Active deployments require 3 AZs, so Active/Active/Active, however this has the following disadvantages:

  • Increased cost
  • Increased latency per Keycloak operation as we rely on SYNC XSite replication between all 3 sites
  • Not all AWS regions have 3 AZs

@ryanemerson
Copy link
Contributor Author

ryanemerson commented May 7, 2024

I totally get the need for this, but relying on Prometheus to set this up, isn't really that easy, right? Most load balancers provide some kind of health check probing mechanisms based on HTTP endpoints.

Wouldn't it be way easier for KC adopters to leverage such a features instead of relying on prometheus beeing available and implementing steps 2. and 3. outlined above?

I agree it would be much simpler, however with two sites this solution is not possible. As we're deploying only two sites no quorum on view membership exists, so when a split-brain occurs it's probable that both site's xsite-check will fail leading to both sites being taken offline and all availability being lost.

With a health-check based solution where we exclusively rely on a sites P2P view of the world, it's also not possible to differentiate between split-brain and a crashed site.

@ryanemerson
Copy link
Contributor Author

3 - In the missing pieces we also want to add some mechanism to determine concurrency execution related issues such as a Queue (AWS SQS), right ? But having SQS also brings in the quirks of how the events are handled, such as handling in flight message situations.

Yes that's right, I need to investigate this more.

Luckily, there's a much simpler solution than SQS. We can simply set the "Reserve Concurrency" of the Lambda to 1, which will prevent concurrent executions of the function.

https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 20, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 20, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 20, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
@ryanemerson ryanemerson reopened this May 29, 2024
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 30, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 30, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 30, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 30, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue May 30, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Jun 4, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Jun 5, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Jun 6, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Active/Active specific test cases added. Testsuite now interacts
  directly with k8s clusters in order to have greater control over
  deployments being tested. This is necessary so that we can simulate
  split-brain scenarios between sites.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
Co-authored-by: Michal Hajas <mhajas@redhat.com>
Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com>
ryanemerson added a commit to ryanemerson/keycloak-benchmark that referenced this issue Jun 7, 2024
- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Active/Active specific test cases added. Testsuite now interacts
  directly with k8s clusters in order to have greater control over
  deployments being tested. This is necessary so that we can simulate
  split-brain scenarios between sites.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
Co-authored-by: Michal Hajas <mhajas@redhat.com>
Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants