Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development: Add safeguards against split Hazelcast clusters #8473

Merged
merged 15 commits into from
May 2, 2024

Conversation

Hialus
Copy link
Member

@Hialus Hialus commented Apr 25, 2024

Checklist

General

Server

  • Important: I implemented the changes with a very good performance and prevented too many (unnecessary) database calls.
  • I strictly followed the server coding and design guidelines.
  • I added multiple integration tests (Spring) related to the features (with a high test coverage). Not possible via JUnit
  • I documented the Java code using JavaDoc style.

Motivation and Context

Currently when starting Artemis it can happen that the Hazelcast cluster is not created properly and some nodes form separate cluster of varying sizes (1+). There is currently no way to fix this, except to restart the affected nodes. The prevention for this so far was to just start the nodes one-by-one with a delay of 20 seconds. This made deployment tedious.

Description

I added 2 changes to address this:

  1. I added a scheduled task to regularly check if the service registry contains hosts that are not in the Hazelcast cluster and then add those
  2. I added a basic configuration for split brain handling. This can help to resolve if the cluster split during runtime into multiple clusters.

As a small side-quest I also added the possibility for nodes to have different Hazelcast ports between one another. This could theoretically replace the old local dev cluster handling, though I left it in place for now.

Steps for Testing

Scenario 1

Prerequisites:

  • 1 Admin account on a multi node test server (ideally TS3)
  1. Deploy to the test server
  2. Go to the admin health page (via the "Server Administration" dropdown in the navbar)
  3. Click on the eye on the right for the "Hazelcast" row
  4. If there are 3 members: You are done
  5. If not, please wait a minute, refresh the page and start at 3. again
  6. It can take several minutes until the cluster heals itself, as Hazelcast only checks for a split brain every 2 minutes

Scenario 2

Prerequisites:

  • Do this on Staging ICL ideally
  1. Deploy to staging primary and directly after (possibly even while it is still running) secondary (it is important to start them in quick succession to break the cluster on startup)
  2. Do some stuff that stores stuff the Hazelcast cluster (e.g. continously trigger a lot of builds on an ICL server)
  3. In the beginning not all builds will be processed (You can check this in the build queue of the course by contionously reloading to get the states of the different nodes)
  4. After the cluster merged (maybe check in the admin area) the builds will be processed and the build queue is the same across instances

Testserver States

Note

These badges show the state of the test servers.
Green = Currently available, Red = Currently locked






Review Progress

Performance Review

  • I (as a reviewer) confirm that the server changes (in particular related to database calls) are implemented with a very good performance

Code Review

  • Code Review 1
  • Code Review 2

Manual Tests

  • Test 1
  • Test 2

Screenshots

How the health panel should look like:
image

Summary by CodeRabbit

  • New Features

    • Enhanced cluster connectivity checks with scheduled tasks in the system configuration.
    • Implemented split brain protection for improved system resilience.
  • Style

    • Reformatted code indentation in BuildJobContainerService to improve readability.

@Hialus Hialus self-assigned this Apr 25, 2024
@github-actions github-actions bot added the server Pull requests that update Java code. (Added Automatically!) label Apr 25, 2024
@Hialus Hialus changed the title Development: (WIP) Add safeguards for split Hazelcast clusters Development: (WIP) Add safeguards against split Hazelcast clusters Apr 25, 2024
@Hialus Hialus changed the title Development: (WIP) Add safeguards against split Hazelcast clusters Development: Add safeguards against split Hazelcast clusters Apr 25, 2024
@Hialus Hialus requested a review from Mtze April 27, 2024 02:07
krusche
krusche previously approved these changes Apr 27, 2024
Copy link
Member

@krusche krusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good to me 👍 great work!

@krusche krusche added this to the 7.0.3 milestone Apr 27, 2024
pzdr7
pzdr7 previously approved these changes Apr 29, 2024
Copy link
Contributor

@pzdr7 pzdr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on TS3 (cluster was split, but healed after a while) and LocalCI staging (I encountered two build queue states that resolved eventually)

bensofficial
bensofficial previously approved these changes May 1, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Out of diff range and nitpick comments (4)
src/main/java/de/tum/in/www1/artemis/config/CacheConfiguration.java (4)

Line range hint 158-252: The configuration of the Hazelcast instance is comprehensive, covering various scenarios like local and production environments. However, consider documenting the implications of disabling multicast and auto-detection, as well as the choice of TCP/IP config, to ensure clarity for future maintainers.


Line range hint 253-261: The configuration of the Hazelcast queue for local CI environments is appropriate. Consider adding more detailed comments explaining the choice of backup count and the role of the priority comparator class for better understanding.


Line range hint 262-267: The serializer configuration for Path objects is crucial for caching files in the Hazelcast cluster. Consider adding comments to explain its importance and the choice of serializer implementation for future maintainers.


Line range hint 268-270: The configuration of the key generator using Git and Build properties is a good practice for distinguishing cache entries. Consider documenting the rationale behind using prefixed keys to enhance clarity for future maintainers.

coderabbitai[bot]
coderabbitai bot previously approved these changes May 1, 2024
Copy link
Member

@bensofficial bensofficial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapprove

Copy link
Member

@krusche krusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me 👍

@krusche krusche merged commit 3b2d6b8 into develop May 2, 2024
26 of 28 checks passed
@krusche krusche deleted the feature/hazelcast-split-brain-protections branch May 2, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ready for review server Pull requests that update Java code. (Added Automatically!)
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants