-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Active/Active XSite fencing. Resolves keycloak#29303 #819
Conversation
7c02adc
to
587ceda
Compare
@ryanemerson - I see that the metric |
Adding a comment here for interested parties who were not present for our discussion yesterday. The |
Documentation Changes RequiredIn order for us to support Active/Active deployments we need to update the following items in the Keycloak HA guide: Building BlocksWe need to introduce an equivalent of the ^ two guides for Active/Active guides. Blueprints
Operational Procedures
We should also add the following procedures:
Proposal
|
88742c9
to
2ba7f43
Compare
I've updated the crossdc-tests and associated actions so that the functional tests are executed against both Active/Active and Active/Passive deployments. To allow for the fact that both deployment types have different semantics, and not all tests will be applicable to both, I have created two tag annotation to control which tests are triggered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ryanemerson!! I like the implementation. It seems you thought this through properly. I added a few comments but in general I think this setup is great.
One more thing I am missing in the PR is how to return the endpoint to the global accelerator? Should we have a task for that? Should we add a new test for this? We would need to teach functional tests how to do this.
...chmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/EntityReplicationTest.java
Outdated
Show resolved
Hide resolved
provision/infinispan/ispn-helm/templates/infinispan-alerts.yaml
Outdated
Show resolved
Hide resolved
provision/opentofu/modules/aws/accelerator/src/stonith_lambda.py
Outdated
Show resolved
Hide resolved
fae67a8
to
faa1c2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments.
...k-benchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/client/AWSClient.java
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/bring-active-site-online.adoc
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc
Outdated
Show resolved
Hide resolved
Thanks for the review @pruivo. My intention was to add the TODO parts today, I just pushed the "WIP" commit so that I had a backup. |
47fc8d7
to
2ea6f20
Compare
Operational guides added for We still need to add operational guides on how to synchronize sites state, but I think we first need to decide how users should do that as they could have conflicting state as there's a window during split-brain where both sites will be active (before split is detected and the STONITH Lambda fires) \cc @pruivo. |
07d8f6c
to
4e198d1
Compare
- User alert routing enabled on ROSA clusters - PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints - Global Accelerator scripts refactored to use OpenTofu when creating AWS resources - Task created to deploy/undeploy Active/Active - Task created to simulate split-brain scenarios - 'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments - 'active-active' and 'active-passive' tags added to crossdc-tests to allow different behaviours/tests to be executed for the given deployment type. - Active/Active specific test cases added. Testsuite now interacts directly with k8s clusters in order to have greater control over deployments being tested. This is necessary so that we can simulate split-brain scenarios between sites. - Daily scheduled job updated to run tests against both active/passive and active/active deployments Signed-off-by: Ryan Emerson <remerson@redhat.com> Co-authored-by: Michal Hajas <mhajas@redhat.com> Co-authored-by: Pedro Ruivo <pruivo@users.noreply.github.com> Signed-off-by: Ryan Emerson <remerson@redhat.com>
4e198d1
to
d612cda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ryanemerson! The PR looks great! I added a few comments.
doc/kubernetes/modules/ROOT/pages/running/bring-active-site-online.adoc
Outdated
Show resolved
Hide resolved
doc/kubernetes/modules/ROOT/pages/running/split-brain-stonith.adoc
Outdated
Show resolved
Hide resolved
...cloak-benchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/FailoverTest.java
Show resolved
Hide resolved
...chmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/client/DatacenterInfo.java
Outdated
Show resolved
Hide resolved
...enchmark-crossdc-tests/src/test/java/org/keycloak/benchmark/crossdc/AbstractCrossDCTest.java
Show resolved
Hide resolved
…adoc Co-authored-by: Michal Hajas <mhajas@redhat.com>
Signed-off-by: Ryan Emerson <remerson@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ryanemerson! Thank you for addressing all the comments. I believe this is ready for merging but I would wait for the keycloak/keycloak#29474 to be merged first so we minimize the number of failures tomorrow if some occur.
If the Protostream change won't land today, we can merge this one.
Edit: I have set the DCO check to pass as the commits will be signed off correctly after squashing.
Signed-off-by: Ryan Emerson <remerson@redhat.com>
Protostream will land not today, so merging this one. |
Resolves keycloak/keycloak#29303
Changes
User alert routing enabled on ROSA clusters
PrometheusRule used to trigger AWS Lambda webhook in the event of a
split-brain so that only a single site remains in the global accelerator endpoints
Global Accelerator scripts refactored to use OpenTofu when creating
AWS resources
Task created to deploy/undeploy Active/Active
Task created to simulate split-brain scenarios
'active-active' flag added to GH actions to differentiate between
active/passive and active/active deployments
Global Accelerator Provisioning
The global accelerator provisioning uses a hybrid approach for creating AWS resources. The NLB required for the accelerator endpoints is created via Kubernetes LoadBalancer services in each of the nodes. This is done as it's much simpler than trying to explicitly provision NLBs for each sites using OpenTofu. Consequently, the OpenTofu accelerator module simply references these existing NLBs via data sources so that we can add them to the accelerator endpoint group.
Testing
Inspect the AWS Global Accelerator console and ensure that the endpoint group contains two endpoints, one for each site.
Simulate a split-brain scenario:
Navigate to the Openshift Console and ensure an event was fired, go to Observer -> Alerting and apply the "user" filter. A "SiteOffline" event should have been fired
Inspect the AWS Global Accelerator console and ensure that the endpoint group now only contains a single endpoint.
TODO
Still missing: