`Development`: Add safeguards against split Hazelcast clusters #8473

Hialus · 2024-04-25T15:30:51Z

Checklist

General

I tested all changes and their related features with all corresponding user types on a test server.
This is a small issue that I tested locally and was confirmed by another developer on a test server.
Language: I followed the guidelines for inclusive, diversity-sensitive, and appreciative language.
I chose a title conforming to the naming conventions for pull requests.

Server

Important: I implemented the changes with a very good performance and prevented too many (unnecessary) database calls.
I strictly followed the server coding and design guidelines.
~~I added multiple integration tests (Spring) related to the features (with a high test coverage).~~ Not possible via JUnit
I documented the Java code using JavaDoc style.

Motivation and Context

Currently when starting Artemis it can happen that the Hazelcast cluster is not created properly and some nodes form separate cluster of varying sizes (1+). There is currently no way to fix this, except to restart the affected nodes. The prevention for this so far was to just start the nodes one-by-one with a delay of 20 seconds. This made deployment tedious.

Description

I added 2 changes to address this:

I added a scheduled task to regularly check if the service registry contains hosts that are not in the Hazelcast cluster and then add those
I added a basic configuration for split brain handling. This can help to resolve if the cluster split during runtime into multiple clusters.

As a small side-quest I also added the possibility for nodes to have different Hazelcast ports between one another. This could theoretically replace the old local dev cluster handling, though I left it in place for now.

Steps for Testing

Scenario 1

Prerequisites:

1 Admin account on a multi node test server (ideally TS3)

Deploy to the test server
Go to the admin health page (via the "Server Administration" dropdown in the navbar)
Click on the eye on the right for the "Hazelcast" row
If there are 3 members: You are done
If not, please wait a minute, refresh the page and start at 3. again
It can take several minutes until the cluster heals itself, as Hazelcast only checks for a split brain every 2 minutes

Scenario 2

Prerequisites:

Do this on Staging ICL ideally

Deploy to staging primary and directly after (possibly even while it is still running) secondary (it is important to start them in quick succession to break the cluster on startup)
Do some stuff that stores stuff the Hazelcast cluster (e.g. continously trigger a lot of builds on an ICL server)
In the beginning not all builds will be processed (You can check this in the build queue of the course by contionously reloading to get the states of the different nodes)
After the cluster merged (maybe check in the admin area) the builds will be processed and the build queue is the same across instances

Testserver States

Note

These badges show the state of the test servers.
Green = Currently available, Red = Currently locked

Review Progress

Performance Review

I (as a reviewer) confirm that the server changes (in particular related to database calls) are implemented with a very good performance

Code Review

Code Review 1
Code Review 2

Manual Tests

Test 1
Test 2

Screenshots

How the health panel should look like:

Summary by CodeRabbit

New Features
- Enhanced cluster connectivity checks with scheduled tasks in the system configuration.
- Implemented split brain protection for improved system resilience.
Style
- Reformatted code indentation in BuildJobContainerService to improve readability.

krusche

Code changes look good to me 👍 great work!

pzdr7

Tested on TS3 (cluster was split, but healed after a while) and LocalCI staging (I encountered two build queue states that resolved eventually)

coderabbitai

Actionable comments posted: 1

Out of diff range and nitpick comments (4)

src/main/java/de/tum/in/www1/artemis/config/CacheConfiguration.java (4)

Line range hint 158-252: The configuration of the Hazelcast instance is comprehensive, covering various scenarios like local and production environments. However, consider documenting the implications of disabling multicast and auto-detection, as well as the choice of TCP/IP config, to ensure clarity for future maintainers.

Line range hint 253-261: The configuration of the Hazelcast queue for local CI environments is appropriate. Consider adding more detailed comments explaining the choice of backup count and the role of the priority comparator class for better understanding.

Line range hint 262-267: The serializer configuration for Path objects is crucial for caching files in the Hazelcast cluster. Consider adding comments to explain its importance and the choice of serializer implementation for future maintainers.

Line range hint 268-270: The configuration of the key generator using Git and Build properties is a good practice for distinguishing cache entries. Consider documenting the rationale behind using prefixed keys to enhance clarity for future maintainers.

src/main/java/de/tum/in/www1/artemis/config/CacheConfiguration.java

bensofficial

Reapprove

krusche

Code looks good to me 👍

Add split brain protection and scheduled discovery task

32bc009

Hialus self-assigned this Apr 25, 2024

github-actions bot added the server Pull requests that update Java code. (Added Automatically!) label Apr 25, 2024

Hialus changed the title ~~Development: (WIP) Add safeguards for split Hazelcast clusters~~ Development: (WIP) Add safeguards against split Hazelcast clusters Apr 25, 2024

Merge branch 'develop' into feature/hazelcast-split-brain-protections

b1be3a9

Hialus added the deploy:artemis-test1 label Apr 25, 2024

Hialus temporarily deployed to artemis-test1.artemis.cit.tum.de April 25, 2024 15:55 — with GitHub Actions Inactive

github-actions bot added lock:artemis-test1 and removed deploy:artemis-test1 labels Apr 25, 2024

Merge branch 'develop' into feature/hazelcast-split-brain-protections

f81edfd

Hialus changed the title ~~Development: (WIP) Add safeguards against split Hazelcast clusters~~ Development: Add safeguards against split Hazelcast clusters Apr 25, 2024

Hialus removed the lock:artemis-test1 label Apr 25, 2024

Improve adding of members logic

d5ea8cf

Hialus added enhancement deploy:artemis-test1 labels Apr 25, 2024

Hialus temporarily deployed to artemis-test1.artemis.cit.tum.de April 25, 2024 18:49 — with GitHub Actions Inactive

github-actions bot added lock:artemis-test1 and removed deploy:artemis-test1 labels Apr 25, 2024

Fix a small issue

c400919

Hialus added the deploy:artemis-test1 label Apr 25, 2024

Hialus temporarily deployed to artemis-test1.artemis.cit.tum.de April 25, 2024 19:09 — with GitHub Actions Inactive

github-actions bot removed the deploy:artemis-test1 label Apr 25, 2024

Several smaller improvements

02f28ad

Hialus added the deploy:artemis-test1 label Apr 25, 2024

Hialus temporarily deployed to artemis-test1.artemis.cit.tum.de April 25, 2024 19:46 — with GitHub Actions Inactive

github-actions bot removed the deploy:artemis-test1 label Apr 25, 2024

Hialus removed the lock:artemis-test1 label Apr 25, 2024

Hialus added 3 commits April 25, 2024 22:05

Add fix for IPv6 addresses

ff04a3a

Merge branch 'develop' into feature/hazelcast-split-brain-protections

61484b2

IPv6 fix II

2f2e3be

Hialus requested a review from Mtze April 27, 2024 02:07

coderabbitai bot approved these changes Apr 27, 2024

View reviewed changes

krusche previously approved these changes Apr 27, 2024

View reviewed changes

krusche added this to the 7.0.3 milestone Apr 27, 2024

Merge branch 'develop' into feature/hazelcast-split-brain-protections

886151c

pzdr7 added the deploy:artemis-test3 label Apr 29, 2024

pzdr7 temporarily deployed to artemis-test3.artemis.cit.tum.de April 29, 2024 18:16 — with GitHub Actions Inactive

github-actions bot added lock:artemis-test3 and removed deploy:artemis-test3 labels Apr 29, 2024

pzdr7 removed the lock:artemis-test3 label Apr 29, 2024

pzdr7 previously approved these changes Apr 29, 2024

View reviewed changes

bensofficial previously approved these changes May 1, 2024

View reviewed changes

Merge branch 'develop' into feature/hazelcast-split-brain-protections

736571c

coderabbitai bot requested changes May 1, 2024

View reviewed changes

src/main/java/de/tum/in/www1/artemis/config/CacheConfiguration.java Show resolved Hide resolved

coderabbitai bot previously approved these changes May 1, 2024

View reviewed changes

Add missing import

ef3d0f3

Hialus dismissed stale reviews from coderabbitai, bensofficial, pzdr7, and krusche via ef3d0f3 May 1, 2024 13:13

Hialus requested review from bensofficial, pzdr7 and krusche May 1, 2024 13:14

coderabbitai bot approved these changes May 1, 2024

View reviewed changes

pzdr7 approved these changes May 1, 2024

View reviewed changes

bensofficial approved these changes May 1, 2024

View reviewed changes

krusche approved these changes May 2, 2024

View reviewed changes

Merge branch 'develop' into feature/hazelcast-split-brain-protections

6516c98

krusche merged commit 3b2d6b8 into develop May 2, 2024
26 of 28 checks passed

krusche deleted the feature/hazelcast-split-brain-protections branch May 2, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Development`: Add safeguards against split Hazelcast clusters #8473

`Development`: Add safeguards against split Hazelcast clusters #8473

Hialus commented Apr 25, 2024 •

edited

krusche left a comment

pzdr7 left a comment

coderabbitai bot left a comment

bensofficial left a comment

krusche left a comment

Development: Add safeguards against split Hazelcast clusters #8473

Development: Add safeguards against split Hazelcast clusters #8473

Conversation

Hialus commented Apr 25, 2024 • edited

Checklist

General

Server

Motivation and Context

Description

Steps for Testing

Scenario 1

Scenario 2

Testserver States

Review Progress

Performance Review

Code Review

Manual Tests

Screenshots

Summary by CodeRabbit

krusche left a comment

Choose a reason for hiding this comment

pzdr7 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

bensofficial left a comment

Choose a reason for hiding this comment

krusche left a comment

Choose a reason for hiding this comment

`Development`: Add safeguards against split Hazelcast clusters #8473

`Development`: Add safeguards against split Hazelcast clusters #8473

Hialus commented Apr 25, 2024 •

edited