Active/Active XSite fencing. Resolves keycloak#29303 #819

ryanemerson · 2024-05-16T09:55:52Z

Changes

User alert routing enabled on ROSA clusters
PrometheusRule used to trigger AWS Lambda webhook in the event of a
split-brain so that only a single site remains in the global accelerator endpoints
Global Accelerator scripts refactored to use OpenTofu when creating
AWS resources
Task created to deploy/undeploy Active/Active
Task created to simulate split-brain scenarios
'active-active' flag added to GH actions to differentiate between
active/passive and active/active deployments

Global Accelerator Provisioning

The global accelerator provisioning uses a hybrid approach for creating AWS resources. The NLB required for the accelerator endpoints is created via Kubernetes LoadBalancer services in each of the nodes. This is done as it's much simpler than trying to explicitly provision NLBs for each sites using OpenTofu. Consequently, the OpenTofu accelerator module simply references these existing NLBs via data sources so that we can add them to the accelerator endpoint group.

Testing

Provision an active/active deployment:

gh workflow run rosa-multi-az-cluster-create.yml -f activeActive=true -f clusterPrefix= -f region=

Inspect the AWS Global Accelerator console and ensure that the endpoint group contains two endpoints, one for each site.
Simulate a split-brain scenario:

cd provision/infinispan
PREFIX= ROSA_CLUSTER_NAME_1=$PREFIX-a ROSA_CLUSTER_NAME_2=$PREFIX-b NAMESPACE=runner-keycloak task crossdc-split

Navigate to the Openshift Console and ensure an event was fired, go to Observer -> Alerting and apply the "user" filter. A "SiteOffline" event should have been fired
Inspect the AWS Global Accelerator console and ensure that the endpoint group now only contains a single endpoint.

TODO

Still missing:

Infinispan 15.0.4.Final: ISPN-16043 Metric for JGroups cross-site view infinispan/infinispan#12368
Scaling Benchmark Integration
Webhook authentication
Documentation
Infinispan 15.0.5.Final: [15.0.x] ISPN-16154 Cross site view change event logs stale view infinispan/infinispan#12484

ahus1 · 2024-05-27T11:40:00Z

@ryanemerson - I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

ryanemerson · 2024-05-28T08:28:11Z

I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

Adding a comment here for interested parties who were not present for our discussion yesterday.

The vendor_jgroups_site_view_status metric represents the status of the JGroups site view. It will return 0 if a site is unreachable, 1 if it's reachable and 2 if it's somehow unknown. Marking an Infinispan site offline has no impact on this metric, as that is implemented at a higher-level within Infinispan and does not change the JGroups site view.

ryanemerson · 2024-05-28T08:35:57Z

Documentation Changes Required

In order for us to support Active/Active deployments we need to update the following items in the Keycloak HA guide:

Building Blocks

We need to introduce an equivalent of the ^ two guides for Active/Active guides.

Blueprints

Deploy an AWS Route 53 loadbalancer: This needs to be replaced with a guide on how to deploy a AWS Global Accelerator

Operational Procedures

Fail over to the secondary site: We no longer have the notion of a Primary and Backup site, so this needs to be replaced entirely. Instead we should provide a guide on how to correctly remove one of the Active sites.
Switch over to the secondary site: No longer relevant, see above.
Recover from an out-of-sync passive site: This needs to be replaced with an equivalent guide that explains how to re-sync a active site that has been offline for some time. The procedure for doing the synchronisation should be the same, so we can embed a partial .adoc file here to reduce repetition.
Switch back to the primary site: No longer relevant

We should also add the following procedures:

"Recover from Active failure": Detail how to re-sync data and re-add an endpoint to the AWS Accelerator so both sites are available

Proposal

Update the existing https://www.keycloak.org/high-availability/introduction page to link to dedicated Active/Passive and Active/Active overview page which has links to architecture specific Concepts, Building blocks and Operational procedure. Many of the building blocks will be re-usable, e.g. Deploy Keycloak for HA with the Keycloak Operator
Add the required Active/Active guides
Only include "Multi-site Deployments", "Active/Passive Overview" and "Active/Active Overview" thumbnails at https://www.keycloak.org/guides#high-availability

ryanemerson · 2024-05-30T16:21:20Z

I've updated the crossdc-tests and associated actions so that the functional tests are executed against both Active/Active and Active/Passive deployments. To allow for the fact that both deployment types have different semantics, and not all tests will be applicable to both, I have created two tag annotation to control which tests are triggered: @ActiveActive and @ActivePassive. For example, the FailoverTest#logoutUserWithFailoverTest will fail with Active/Active clusters as it expects a failover to occur from an Active to a Passive cluster.

mhajas

Nice work @ryanemerson!! I like the implementation. It seems you thought this through properly. I added a few comments but in general I think this setup is great.

One more thing I am missing in the PR is how to return the endpoint to the global accelerator? Should we have a task for that? Should we add a new test for this? We would need to teach functional tests how to do this.

...chmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/EntityReplicationTest.java

provision/infinispan/Utils.yaml

provision/infinispan/ispn-helm/templates/infinispan-alerts.yaml

.github/workflows/rosa-run-crossdc-func-tests.yml

.github/workflows/rosa-multi-az-cluster-create.yml

provision/opentofu/modules/aws/accelerator/src/stonith_lambda.py

pruivo

I have some minor comments.

...k-benchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/client/AWSClient.java

doc/kubernetes/modules/ROOT/pages/running/bring-active-site-online.adoc

.github/workflows/rosa-cluster-auto-provision-on-schedule.yml

doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc

provision/openshift/Taskfile.yaml

ryanemerson · 2024-06-06T09:10:20Z

Thanks for the review @pruivo. My intention was to add the TODO parts today, I just pushed the "WIP" commit so that I had a backup.

ryanemerson · 2024-06-06T17:14:14Z

Operational guides added for Take Site Offline and Bring Site Online, as well as a building block to Deploy an AWS Lambda to guard against Split-Brain.

We still need to add operational guides on how to synchronize sites state, but I think we first need to decide how users should do that as they could have conflicting state as there's a window during split-brain where both sites will be active (before split is detected and the STONITH Lambda fires) \cc @pruivo.

- User alert routing enabled on ROSA clusters - PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints - Global Accelerator scripts refactored to use OpenTofu when creating AWS resources - Task created to deploy/undeploy Active/Active - Task created to simulate split-brain scenarios - 'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments - 'active-active' and 'active-passive' tags added to crossdc-tests to allow different behaviours/tests to be executed for the given deployment type. - Active/Active specific test cases added. Testsuite now interacts directly with k8s clusters in order to have greater control over deployments being tested. This is necessary so that we can simulate split-brain scenarios between sites. - Daily scheduled job updated to run tests against both active/passive and active/active deployments Signed-off-by: Ryan Emerson <remerson@redhat.com> Co-authored-by: Michal Hajas <mhajas@redhat.com> Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com> Signed-off-by: Ryan Emerson <remerson@redhat.com>

mhajas

Thank you @ryanemerson! The PR looks great! I added a few comments.

doc/kubernetes/modules/ROOT/pages/running/bring-active-site-online.adoc

doc/kubernetes/modules/ROOT/pages/running/loadbalancing.adoc

doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc

provision/infinispan/ispn-helm/values.yaml

...cloak-benchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/FailoverTest.java

...chmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/client/DatacenterInfo.java

...enchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/AbstractCrossDCTest.java

…adoc Co-authored-by: Michal Hajas <mhajas@redhat.com>

Signed-off-by: Ryan Emerson <remerson@redhat.com>

mhajas

Nice work @ryanemerson! Thank you for addressing all the comments. I believe this is ready for merging but I would wait for the keycloak/keycloak#29474 to be merged first so we minimize the number of failures tomorrow if some occur.

If the Protostream change won't land today, we can merge this one.

Edit: I have set the DCO check to pass as the commits will be signed off correctly after squashing.

Signed-off-by: Ryan Emerson <remerson@redhat.com>

ahus1 · 2024-06-11T17:16:54Z

Protostream will land not today, so merging this one.

ryanemerson force-pushed the active_active_fencing branch 5 times, most recently from 7c02adc to 587ceda Compare May 21, 2024 08:14

ryanemerson mentioned this pull request May 21, 2024

Upgrade to Infinispan 15.0.4.Final #824

Merged

ryanemerson force-pushed the active_active_fencing branch 5 times, most recently from 88742c9 to 2ba7f43 Compare May 30, 2024 16:17

mhajas reviewed May 30, 2024

View reviewed changes

ryanemerson force-pushed the active_active_fencing branch 3 times, most recently from fae67a8 to faa1c2e Compare June 5, 2024 13:05

pruivo reviewed Jun 6, 2024

View reviewed changes

ryanemerson force-pushed the active_active_fencing branch from 47fc8d7 to 2ea6f20 Compare June 6, 2024 15:44

ryanemerson force-pushed the active_active_fencing branch from 07d8f6c to 4e198d1 Compare June 7, 2024 13:27

ryanemerson force-pushed the active_active_fencing branch from 4e198d1 to d612cda Compare June 10, 2024 09:36

mhajas reviewed Jun 10, 2024

View reviewed changes

ryanemerson and others added 2 commits June 10, 2024 14:27

Update doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.…

9649304

…adoc Co-authored-by: Michal Hajas <mhajas@redhat.com>

Michal feedback

5ffd611

Signed-off-by: Ryan Emerson <remerson@redhat.com>

ryanemerson marked this pull request as ready for review June 10, 2024 13:57

mhajas approved these changes Jun 11, 2024

View reviewed changes

Add basic authentication to Lambda webhook

ca53e87

Signed-off-by: Ryan Emerson <remerson@redhat.com>

ahus1 merged commit f30cebc into keycloak:main Jun 11, 2024
2 of 3 checks passed

This was referenced Jul 1, 2024

Active/Active XSite fencing keycloak/keycloak#29303

Closed

Active/Active x-site Infinispan and loadbalancer handling keycloak/keycloak#30971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active/Active XSite fencing. Resolves keycloak#29303 #819

Active/Active XSite fencing. Resolves keycloak#29303 #819

ryanemerson commented May 16, 2024 •

edited

Loading

ahus1 commented May 27, 2024

ryanemerson commented May 28, 2024

ryanemerson commented May 28, 2024 •

edited

Loading

ryanemerson commented May 30, 2024

mhajas left a comment

pruivo left a comment

ryanemerson commented Jun 6, 2024

ryanemerson commented Jun 6, 2024 •

edited

Loading

mhajas left a comment

mhajas left a comment •

edited

Loading

ahus1 commented Jun 11, 2024

Active/Active XSite fencing. Resolves keycloak#29303 #819

Active/Active XSite fencing. Resolves keycloak#29303 #819

Conversation

ryanemerson commented May 16, 2024 • edited Loading

Changes

Global Accelerator Provisioning

Testing

TODO

ahus1 commented May 27, 2024

ryanemerson commented May 28, 2024

ryanemerson commented May 28, 2024 • edited Loading

Documentation Changes Required

Building Blocks

Blueprints

Operational Procedures

Proposal

ryanemerson commented May 30, 2024

mhajas left a comment

Choose a reason for hiding this comment

pruivo left a comment

Choose a reason for hiding this comment

ryanemerson commented Jun 6, 2024

ryanemerson commented Jun 6, 2024 • edited Loading

mhajas left a comment

Choose a reason for hiding this comment

mhajas left a comment • edited Loading

Choose a reason for hiding this comment

ahus1 commented Jun 11, 2024

ryanemerson commented May 16, 2024 •

edited

Loading

ryanemerson commented May 28, 2024 •

edited

Loading

ryanemerson commented Jun 6, 2024 •

edited

Loading

mhajas left a comment •

edited

Loading