FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

lprimak · 2023-06-26T09:12:18Z

Description

Probably more of a bug fix than enhancement

Auto-promotes a new cluster member into CP subsystem if necessary
Auto-resets CP subsystem when it gets into unusable state (not enough members leave and join)
handles distributed lock exceptions well when CP subsystem is enabled

See hazelcast/hazelcast#24897 for context

Documentation updates needed

New system property hazelcast.cp-subsystem.auto-promote (boolean, defaults to true) - Enable / Disable auto-promote functionality
New system property hazelcast.auto-partition-group (boolean, temporarily defaults to false) - Enable / Disable auto-partition-group functionality
Document existing system property hazelcast.cp-subsystem.cp-member-count (int, defaults to zero) to enable CP subsystem without additional Hazelcast configuration

Temporary workaround for Hazelcast issue

Auto partition groups are now disabled and system property has been added to control whether they are enabled or not.
Currently, disabled by default since issues are caused due to hazelcast/hazelcast#25100
Should be turned back on by default when Hazelcast is upgraded with the fix.

lprimak · 2023-06-28T05:20:06Z

jenkins test

lprimak · 2023-07-06T17:52:23Z

jenkins test

lprimak · 2023-07-06T17:52:29Z

@Pandrex247 ready to go

Pandrex247 · 2023-07-21T15:20:53Z

Do you have a simple test scenario for this? I'm not particularly familiar with the CP subsystem (I can't even find what "CP" stands for!)

I started a domain with 2 instances, set hazelcast.cp-subsystem.cp-member-count to 3 at the domain level and deployed my test clustered singleton app.
What I seemed to get is the DAS constantly spamming We are FOLLOWER and there is no current leader. Will start new election round... which I assume shouldn't be the case - I'd have thought at least one instance should promote themself to be the leader?

lprimak · 2023-07-21T15:25:59Z

Did you set the system properties for the instances as well (not just the DAS)? You need both

lprimak · 2023-07-21T15:26:58Z

What you are doing is what I did :). Tested by restarting nodes

Pandrex247 · 2023-07-24T10:26:31Z

Did you set the system properties for the instances as well (not just the DAS)? You need both

D'oh! I'd misconfigured it - thought I'd set it at the domain level but I'd set it specifically on the DAS.

I still seem to be able to get it to start spamming the logs, does this behaviour sound right?
Logs indicate there's a bug here.

I have a domain with 2 instances (Insty1 and Insty2), with hazelcast.cp-subsystem.cp-member-count set to 3 at the domain level.
Instances are started in the following order: DAS, Insty1, Insty2
Upon initial start it seems Insty2 became the leader
Stopping Insty2 had Insty1 become the leader
- Log statements in the DAS:
  - Status is set to: UPDATING_GROUP_MEMBER_LIST
  - Current leader RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} is not member anymore. Will start new election round...
  - Cannot REMOVE RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} because expected members commit index: 0 is different than group members commit index: 6
  - Removing connection to endpoint [172.28.160.1]:5901 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5901}, Error-Count: 5
  - Data Grid Instance Removed 94a8a45d-cc30-4b23-a326-032e1f26c514 from Address /172.28.160.1:5901
- Logs from Insty1:
  - Status is set to: UPDATING_GROUP_MEMBER_LIST
  - Current leader RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} is not member anymore. Will start new election round...
  - Removing connection to endpoint [172.28.160.1]:5901 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5901}, Error-Count: 5
  - Member [172.28.160.1]:5901 - 94a8a45d-cc30-4b23-a326-032e1f26c514 is suspected to be dead for reason: No connection
  - Data Grid Instance Removed 94a8a45d-cc30-4b23-a326-032e1f26c514 from Address /172.28.160.1:5901

Restarting Insty2 seems to show a NumberFormatException - Auto CP Promotion Failure

java.lang.NumberFormatException: null
at java.base/java.lang.Integer.parseInt(Integer.java:614)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at fish.payara.nucleus.hazelcast.HazelcastCore.lambda$autoPromoteCPMembers$1(HazelcastCore.java:836)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Both DAS and Insty1 otherwise show it readded to the data grid with no other complaints or messages

Stopping Insty1 now seems to cause issues
- From the DAS:
  - CPMember{uuid=c28c8122-9c2d-4281-9c51-7a8a985f36a0, address=[172.28.160.1]:5900} is directly removed as there are only 2 CP members
  - Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE} java.lang.IllegalStateException: Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
  - Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE} java.lang.IllegalStateException: Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
  - Removing connection to endpoint [172.28.160.1]:5900 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5900}, Error-Count: 5
  - Data Grid Instance Removed c28c8122-9c2d-4281-9c51-7a8a985f36a0 from Address /172.28.160.1:5900
  - Exception in thread "hz.zealous_brahmagupta.cached.thread-2"
  - Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
Upon restart of Insty1:
- Same NumberFormatException that Insty2 ran into
- From the DAS:
  - Demoting to FOLLOWER since not received acks from majority recently...
  - DAS starts spamming We are FOLLOWER and there is no current leader. Will start new election round...
- Nothing of note from Insty2 - just shows Insty1 rejoining the data grid.

lprimak · 2023-07-24T18:26:21Z

@Pandrex247 Good job! you found a bug :) I had set that variable to zero or one (depending on my hazelcast version) and never tested without it.
I will fix.

lprimak · 2023-07-24T19:44:48Z

jenkins test

lprimak · 2023-07-24T20:42:04Z

@Pandrex247 Fixed and I tested on a clean install of Payara with no parameters.

The warnings when restarting instances with CP enabled is normal, even repeating messages trying to re-establish CP continuity when less than 3 nodes are available.

See https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management for more info

lprimak · 2023-07-24T21:19:48Z

Argh... found another bug or two... stay tuned

lprimak · 2023-07-25T00:38:33Z

@Pandrex247 Ok, fixed now, for real this time :)

- handles distributed lock exceptions well when CP subsystem is enabled - performs auto-cp-subsystem-reset when CP gets into unusable state

lprimak · 2023-07-25T00:57:24Z

jenkins test

Pandrex247 · 2023-07-25T11:08:54Z

The DAS doesn't seem to be able to restart with this option turned on?

Essentially the same setup as above, but when you run the stop-domain command to stop the DAS (or attempt to restart it via the admin console) it starts spamming the log with Member [172.28.160.1]:4900 - 0d9feda6-19ed-481f-aff9-ea62be15fd2c this requested to shutdown but still in partition table and refuses to shut down.

Does there need to be some sort of shutdown hook override?

lprimak · 2023-07-25T16:03:01Z

CP is finicky about data loss and will try to prevent it.
It will eventually shut down, but it will spew and complain bitterly. There is nothing to be done here, it's just how the system works.
The app constantly pings the clustered lock (Stock service) which is the root cause of this

lprimak · 2023-07-26T03:37:10Z

On second thought let me do some more tests. This shouldn’t happen with 3 instances but it does
You may have found another bug. Thanks

lprimak · 2023-07-27T00:48:06Z

update

I did the same test without CP enabled (without the PR) and I get the same error.
I think it's a bug in Hazelcast. Nothing to do with this PR
Yay! yay :(

Ok, did some more tests. Even with the "new code" (that's in this PR) disabled by using
hazelcast.cp-subsystem.auto-promote system property set to false on all nodes, the shutdown long spam is still happening.

My conclusion is that Payara works better with this PR currently than without, so it's worth merging even if the issue remains.

The problem is that I can't reproduce this via my standalone reproducer app so I can't really report this bug to Hazelcast in any useful way :(

lprimak · 2023-07-27T01:17:05Z

Furthermore, this test fails with no apps deployed either :) Yay! Bug is Not in this PR :)

lprimak · 2023-07-28T05:04:46Z

Found it. Looks like a regression somewhere in my code somewhere. Not this PR functionality but I’ll work on a fix here. About 50/50 it’s a big in hazelcast but not quite sure yet. It’s in the discovery SPI and interaction with it

lprimak · 2023-07-28T15:22:49Z

Or, better yet, feel free to merge this one, the above error has nothing to do with this one, I can fix the above in a separate PR.

lprimak · 2023-07-30T08:19:10Z

Issue (with PR) filed with Hazelcast: hazelcast/hazelcast#25100

lprimak · 2023-07-30T08:22:56Z

I can't fix the issue here. This PR can be merged with no regressions.
I will think if there is any way to work around this in the meantime (separate PR)
Thanks!

…nable

lprimak · 2023-07-30T19:16:12Z

jenkins test

lprimak · 2023-07-30T19:16:39Z

@Pandrex247 I have disabled auto partition groups by default, which was triggering the Hazelcast bug. It works now... for realz :)

Pandrex247 · 2023-08-07T08:52:50Z

...a-modules/hazelcast-bootstrap/src/main/java/fish/payara/nucleus/hazelcast/HazelcastCore.java

+            if (Boolean.parseBoolean(System.getProperty("hazelcast.auto-partition-group", "false"))) {
+                PartitionGroupConfig partitionGroupConfig = config.getPartitionGroupConfig();
+                partitionGroupConfig.setEnabled(true);
+                partitionGroupConfig.setGroupType(PartitionGroupConfig.MemberGroupType.SPI);
+            }


Isn't this changing the default behaviour even with CP subsytem turned off?
Now the partitions will default to "member" with each instance being in its own partition group instead of being determined by the DomainDiscoveryStrategy which utilises the instance group attribute and is optionally host-aware.

Without this it won't be host aware and the instance groups will be ignored for partition purposes - I'm not convinced this is an improvement.

lprimak · 2023-08-07T15:55:20Z

Isn't this changing the default behaviour even with CP subsytem turned off?

Yes. The partitioning SPI bug is in Hazelcast and is present in the current Payra. Has nothing to do with CP.

Now the partitions will default to "member"

Incorrect. See https://github.com/payara/Payara/blob/eb8effef2c5984843f469fa564bfd42522cf1f22/nucleus/payara-modules/hazelcast-bootstrap/src/main/java/fish/payara/nucleus/hazelcast/HazelcastCore.java#L431C32-L431C32

Unless "host aware" is checked, everything will be in one partition group, which is the correct default behavior.

Of course, once Hazelcast merges my fix, we can put this back to "turn on by default" in Payara

lprimak · 2023-08-21T14:07:31Z

Looks like my fix for hazelcast/hazelcast#25100 has been merged by hazelcast

Pandrex247 · 2023-08-22T09:39:06Z

Looks like it is marked for Hazelcast 5.4.0, I'll try to keep an eye out.

I still haven't got around to looking into what you've said, I'll try to make some time over the coming days. Unless you're now intending to change the behaviour back? I don't know how far away Hazelcast 5.4.0 is.

lprimak · 2023-08-22T15:04:02Z

change the behaviour back

that's the big questions isn't it? I am inclined to change it back now. What do you think?

Pandrex247 · 2023-08-24T08:13:53Z

I think it makes sense to change it back - keeps this PR a bit cleaner with respects to not changing default behaviour

lprimak · 2023-08-24T16:04:40Z

ok, I'll change it back

lprimak · 2023-08-24T16:07:04Z

jenkins test

lprimak · 2023-08-31T16:42:33Z

At this stage, I think this is safe to merge now IMHO. Thank you!

lprimak marked this pull request as ready for review June 28, 2023 17:18

lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from b7f1846 to c5ee720 Compare July 6, 2023 17:52

Pandrex247 self-requested a review July 18, 2023 16:25

Pandrex247 changed the title ~~Improved Hazelcast functionality as it relates to CP subsystem~~ FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem Jul 18, 2023

Pandrex247 added the PR: CLA CLA submitted on PR by the contributor label Jul 20, 2023

lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from c27f594 to f73003f Compare July 24, 2023 19:44

lprimak marked this pull request as draft July 24, 2023 21:19

lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from f73003f to a735848 Compare July 25, 2023 00:38

lprimak marked this pull request as ready for review July 25, 2023 00:38

- Auto-promotes a new cluster member into CP subsystem if necessary

1bd5c57

- handles distributed lock exceptions well when CP subsystem is enabled - performs auto-cp-subsystem-reset when CP gets into unusable state

lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from a735848 to 1bd5c57 Compare July 25, 2023 00:57

disabled auto pratition groups by default, added system property to e…

eb8effe

…nable

Pandrex247 reviewed Aug 7, 2023

View reviewed changes

set default auto-partition group to true

5fd7566

Pandrex247 approved these changes Sep 1, 2023

View reviewed changes

Pandrex247 merged commit 66d15a4 into payara:master Sep 1, 2023
1 check passed

lprimak deleted the CP_SUBSYSTEM_FAILURE_FIXES branch September 2, 2023 04:46

This was referenced May 24, 2024

[FISH-7640] Custom Hazelcast System Properties payara/Payara-Documentation#437

Merged

[FISH-7640] Custom Hazelcast System Properties payara/Payara-Documentation#438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

lprimak commented Jun 26, 2023 •

edited

Loading

lprimak commented Jun 28, 2023

lprimak commented Jul 6, 2023

lprimak commented Jul 6, 2023

Pandrex247 commented Jul 21, 2023

lprimak commented Jul 21, 2023 •

edited

Loading

lprimak commented Jul 21, 2023

Pandrex247 commented Jul 24, 2023

lprimak commented Jul 24, 2023

lprimak commented Jul 24, 2023

lprimak commented Jul 24, 2023 •

edited

Loading

lprimak commented Jul 24, 2023

lprimak commented Jul 25, 2023 •

edited

Loading

lprimak commented Jul 25, 2023

Pandrex247 commented Jul 25, 2023

lprimak commented Jul 25, 2023 •

edited

Loading

lprimak commented Jul 26, 2023

lprimak commented Jul 27, 2023 •

edited

Loading

lprimak commented Jul 27, 2023 •

edited

Loading

lprimak commented Jul 28, 2023 •

edited

Loading

lprimak commented Jul 28, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

Pandrex247 Aug 7, 2023

lprimak commented Aug 7, 2023

lprimak commented Aug 21, 2023

Pandrex247 commented Aug 22, 2023

lprimak commented Aug 22, 2023

Pandrex247 commented Aug 24, 2023

lprimak commented Aug 24, 2023

lprimak commented Aug 24, 2023

lprimak commented Aug 31, 2023

FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

Conversation

lprimak commented Jun 26, 2023 • edited Loading

Description

Documentation updates needed

Temporary workaround for Hazelcast issue

lprimak commented Jun 28, 2023

lprimak commented Jul 6, 2023

lprimak commented Jul 6, 2023

Pandrex247 commented Jul 21, 2023

lprimak commented Jul 21, 2023 • edited Loading

lprimak commented Jul 21, 2023

Pandrex247 commented Jul 24, 2023

lprimak commented Jul 24, 2023

lprimak commented Jul 24, 2023

lprimak commented Jul 24, 2023 • edited Loading

lprimak commented Jul 24, 2023

lprimak commented Jul 25, 2023 • edited Loading

lprimak commented Jul 25, 2023

Pandrex247 commented Jul 25, 2023

lprimak commented Jul 25, 2023 • edited Loading

lprimak commented Jul 26, 2023

lprimak commented Jul 27, 2023 • edited Loading

lprimak commented Jul 27, 2023 • edited Loading

lprimak commented Jul 28, 2023 • edited Loading

lprimak commented Jul 28, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

lprimak commented Jul 30, 2023

Pandrex247 Aug 7, 2023

Choose a reason for hiding this comment

lprimak commented Aug 7, 2023

lprimak commented Aug 21, 2023

Pandrex247 commented Aug 22, 2023

lprimak commented Aug 22, 2023

Pandrex247 commented Aug 24, 2023

lprimak commented Aug 24, 2023

lprimak commented Aug 24, 2023

lprimak commented Aug 31, 2023

lprimak commented Jun 26, 2023 •

edited

Loading

lprimak commented Jul 21, 2023 •

edited

Loading

lprimak commented Jul 24, 2023 •

edited

Loading

lprimak commented Jul 25, 2023 •

edited

Loading

lprimak commented Jul 25, 2023 •

edited

Loading

lprimak commented Jul 27, 2023 •

edited

Loading

lprimak commented Jul 27, 2023 •

edited

Loading

lprimak commented Jul 28, 2023 •

edited

Loading