Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active/Active XSite fencing. Resolves keycloak#29303 #819

Merged
merged 4 commits into from
Jun 11, 2024

Conversation

ryanemerson
Copy link
Contributor

@ryanemerson ryanemerson commented May 16, 2024

Resolves keycloak/keycloak#29303

Changes

  • User alert routing enabled on ROSA clusters

  • PrometheusRule used to trigger AWS Lambda webhook in the event of a
    split-brain so that only a single site remains in the global accelerator endpoints

  • Global Accelerator scripts refactored to use OpenTofu when creating
    AWS resources

  • Task created to deploy/undeploy Active/Active

  • Task created to simulate split-brain scenarios

  • 'active-active' flag added to GH actions to differentiate between
    active/passive and active/active deployments

Global Accelerator Provisioning

The global accelerator provisioning uses a hybrid approach for creating AWS resources. The NLB required for the accelerator endpoints is created via Kubernetes LoadBalancer services in each of the nodes. This is done as it's much simpler than trying to explicitly provision NLBs for each sites using OpenTofu. Consequently, the OpenTofu accelerator module simply references these existing NLBs via data sources so that we can add them to the accelerator endpoint group.

Testing

  1. Provision an active/active deployment:
gh workflow run rosa-multi-az-cluster-create.yml -f activeActive=true -f clusterPrefix= -f region=
  1. Inspect the AWS Global Accelerator console and ensure that the endpoint group contains two endpoints, one for each site.

  2. Simulate a split-brain scenario:

cd provision/infinispan
PREFIX= ROSA_CLUSTER_NAME_1=$PREFIX-a ROSA_CLUSTER_NAME_2=$PREFIX-b NAMESPACE=runner-keycloak task crossdc-split
  1. Navigate to the Openshift Console and ensure an event was fired, go to Observer -> Alerting and apply the "user" filter. A "SiteOffline" event should have been fired

  2. Inspect the AWS Global Accelerator console and ensure that the endpoint group now only contains a single endpoint.

TODO

Still missing:

@ryanemerson ryanemerson force-pushed the active_active_fencing branch 5 times, most recently from 7c02adc to 587ceda Compare May 21, 2024 08:14
@ahus1
Copy link
Contributor

ahus1 commented May 27, 2024

@ryanemerson - I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

@ryanemerson
Copy link
Contributor Author

I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

Adding a comment here for interested parties who were not present for our discussion yesterday.

The vendor_jgroups_site_view_status metric represents the status of the JGroups site view. It will return 0 if a site is unreachable, 1 if it's reachable and 2 if it's somehow unknown. Marking an Infinispan site offline has no impact on this metric, as that is implemented at a higher-level within Infinispan and does not change the JGroups site view.

@ryanemerson
Copy link
Contributor Author

ryanemerson commented May 28, 2024

Documentation Changes Required

In order for us to support Active/Active deployments we need to update the following items in the Keycloak HA guide:

Building Blocks

We need to introduce an equivalent of the ^ two guides for Active/Active guides.

Blueprints

Operational Procedures

We should also add the following procedures:

  • "Recover from Active failure": Detail how to re-sync data and re-add an endpoint to the AWS Accelerator so both sites are available

Proposal

@ryanemerson ryanemerson force-pushed the active_active_fencing branch 5 times, most recently from 88742c9 to 2ba7f43 Compare May 30, 2024 16:17
@ryanemerson
Copy link
Contributor Author

I've updated the crossdc-tests and associated actions so that the functional tests are executed against both Active/Active and Active/Passive deployments. To allow for the fact that both deployment types have different semantics, and not all tests will be applicable to both, I have created two tag annotation to control which tests are triggered: @ActiveActive and @ActivePassive. For example, the FailoverTest#logoutUserWithFailoverTest will fail with Active/Active clusters as it expects a failover to occur from an Active to a Passive cluster.

Copy link
Contributor

@mhajas mhajas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ryanemerson!! I like the implementation. It seems you thought this through properly. I added a few comments but in general I think this setup is great.

One more thing I am missing in the PR is how to return the endpoint to the global accelerator? Should we have a task for that? Should we add a new test for this? We would need to teach functional tests how to do this.

@ryanemerson ryanemerson force-pushed the active_active_fencing branch 3 times, most recently from fae67a8 to faa1c2e Compare June 5, 2024 13:05
Copy link
Contributor

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments.

@ryanemerson
Copy link
Contributor Author

Thanks for the review @pruivo. My intention was to add the TODO parts today, I just pushed the "WIP" commit so that I had a backup.

@ryanemerson
Copy link
Contributor Author

ryanemerson commented Jun 6, 2024

Operational guides added for Take Site Offline and Bring Site Online, as well as a building block to Deploy an AWS Lambda to guard against Split-Brain.

We still need to add operational guides on how to synchronize sites state, but I think we first need to decide how users should do that as they could have conflicting state as there's a window during split-brain where both sites will be active (before split is detected and the STONITH Lambda fires) \cc @pruivo.

- User alert routing enabled on ROSA clusters

- PrometheusRule used to trigger AWS Lambda webhook in the event of a
  split-brain so that only a single site remains in the global accelerator endpoints

- Global Accelerator scripts refactored to use OpenTofu when creating
  AWS resources

- Task created to deploy/undeploy Active/Active

- Task created to simulate split-brain scenarios

- 'active-active' flag added to GH actions to differentiate between
  active/passive and active/active deployments

- 'active-active' and 'active-passive' tags added to crossdc-tests to
  allow different behaviours/tests to be executed for the given
  deployment type.

- Active/Active specific test cases added. Testsuite now interacts
  directly with k8s clusters in order to have greater control over
  deployments being tested. This is necessary so that we can simulate
  split-brain scenarios between sites.

- Daily scheduled job updated to run tests against both active/passive
  and active/active deployments

Signed-off-by: Ryan Emerson <remerson@redhat.com>
Co-authored-by: Michal Hajas <mhajas@redhat.com>
Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com>
Signed-off-by: Ryan Emerson <remerson@redhat.com>
Copy link
Contributor

@mhajas mhajas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ryanemerson! The PR looks great! I added a few comments.

ryanemerson and others added 2 commits June 10, 2024 14:27
…adoc

Co-authored-by: Michal Hajas <mhajas@redhat.com>
Signed-off-by: Ryan Emerson <remerson@redhat.com>
@ryanemerson ryanemerson marked this pull request as ready for review June 10, 2024 13:57
Copy link
Contributor

@mhajas mhajas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ryanemerson! Thank you for addressing all the comments. I believe this is ready for merging but I would wait for the keycloak/keycloak#29474 to be merged first so we minimize the number of failures tomorrow if some occur.

If the Protostream change won't land today, we can merge this one.

Edit: I have set the DCO check to pass as the commits will be signed off correctly after squashing.

Signed-off-by: Ryan Emerson <remerson@redhat.com>
@ahus1 ahus1 merged commit f30cebc into keycloak:main Jun 11, 2024
2 of 3 checks passed
@ahus1
Copy link
Contributor

ahus1 commented Jun 11, 2024

Protostream will land not today, so merging this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Active/Active XSite fencing
4 participants