Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem #6318

Merged
merged 3 commits into from
Sep 1, 2023

Conversation

lprimak
Copy link
Contributor

@lprimak lprimak commented Jun 26, 2023

Description

Probably more of a bug fix than enhancement

  • Auto-promotes a new cluster member into CP subsystem if necessary
  • Auto-resets CP subsystem when it gets into unusable state (not enough members leave and join)
  • handles distributed lock exceptions well when CP subsystem is enabled

See hazelcast/hazelcast#24897 for context

Documentation updates needed

  • New system property hazelcast.cp-subsystem.auto-promote (boolean, defaults to true) - Enable / Disable auto-promote functionality
  • New system property hazelcast.auto-partition-group (boolean, temporarily defaults to false) - Enable / Disable auto-partition-group functionality
  • Document existing system property hazelcast.cp-subsystem.cp-member-count (int, defaults to zero) to enable CP subsystem without additional Hazelcast configuration

Temporary workaround for Hazelcast issue

Auto partition groups are now disabled and system property has been added to control whether they are enabled or not.
Currently, disabled by default since issues are caused due to hazelcast/hazelcast#25100
Should be turned back on by default when Hazelcast is upgraded with the fix.

@lprimak
Copy link
Contributor Author

lprimak commented Jun 28, 2023

jenkins test

@lprimak lprimak marked this pull request as ready for review June 28, 2023 17:18
@lprimak lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from b7f1846 to c5ee720 Compare July 6, 2023 17:52
@lprimak
Copy link
Contributor Author

lprimak commented Jul 6, 2023

jenkins test

@lprimak
Copy link
Contributor Author

lprimak commented Jul 6, 2023

@Pandrex247 ready to go

@Pandrex247 Pandrex247 self-requested a review July 18, 2023 16:25
@Pandrex247 Pandrex247 changed the title Improved Hazelcast functionality as it relates to CP subsystem FISH-7640 Improved Hazelcast functionality as it relates to CP subsystem Jul 18, 2023
@Pandrex247 Pandrex247 added the PR: CLA CLA submitted on PR by the contributor label Jul 20, 2023
@Pandrex247
Copy link
Member

Do you have a simple test scenario for this? I'm not particularly familiar with the CP subsystem (I can't even find what "CP" stands for!)

I started a domain with 2 instances, set hazelcast.cp-subsystem.cp-member-count to 3 at the domain level and deployed my test clustered singleton app.
What I seemed to get is the DAS constantly spamming We are FOLLOWER and there is no current leader. Will start new election round... which I assume shouldn't be the case - I'd have thought at least one instance should promote themself to be the leader?

@lprimak
Copy link
Contributor Author

lprimak commented Jul 21, 2023

Did you set the system properties for the instances as well (not just the DAS)? You need both

@lprimak
Copy link
Contributor Author

lprimak commented Jul 21, 2023

What you are doing is what I did :). Tested by restarting nodes

@Pandrex247
Copy link
Member

Did you set the system properties for the instances as well (not just the DAS)? You need both

D'oh! I'd misconfigured it - thought I'd set it at the domain level but I'd set it specifically on the DAS.

I still seem to be able to get it to start spamming the logs, does this behaviour sound right?
Logs indicate there's a bug here.

  • I have a domain with 2 instances (Insty1 and Insty2), with hazelcast.cp-subsystem.cp-member-count set to 3 at the domain level.
  • Instances are started in the following order: DAS, Insty1, Insty2
  • Upon initial start it seems Insty2 became the leader
  • Stopping Insty2 had Insty1 become the leader
    • Log statements in the DAS:
      • Status is set to: UPDATING_GROUP_MEMBER_LIST
      • Current leader RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} is not member anymore. Will start new election round...
      • Cannot REMOVE RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} because expected members commit index: 0 is different than group members commit index: 6
      • Removing connection to endpoint [172.28.160.1]:5901 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5901}, Error-Count: 5
      • Data Grid Instance Removed 94a8a45d-cc30-4b23-a326-032e1f26c514 from Address /172.28.160.1:5901
    • Logs from Insty1:
      • Status is set to: UPDATING_GROUP_MEMBER_LIST
      • Current leader RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'} is not member anymore. Will start new election round...
      • Removing connection to endpoint [172.28.160.1]:5901 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5901}, Error-Count: 5
      • Member [172.28.160.1]:5901 - 94a8a45d-cc30-4b23-a326-032e1f26c514 is suspected to be dead for reason: No connection
      • Data Grid Instance Removed 94a8a45d-cc30-4b23-a326-032e1f26c514 from Address /172.28.160.1:5901
  • Restarting Insty2 seems to show a NumberFormatException - Auto CP Promotion Failure
    • java.lang.NumberFormatException: null
      at java.base/java.lang.Integer.parseInt(Integer.java:614)
      at java.base/java.lang.Integer.parseInt(Integer.java:770)
      at fish.payara.nucleus.hazelcast.HazelcastCore.lambda$autoPromoteCPMembers$1(HazelcastCore.java:836)
      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      at java.base/java.lang.Thread.run(Thread.java:829)
      
    • Both DAS and Insty1 otherwise show it readded to the data grid with no other complaints or messages
  • Stopping Insty1 now seems to cause issues
    • From the DAS:
      • CPMember{uuid=c28c8122-9c2d-4281-9c51-7a8a985f36a0, address=[172.28.160.1]:5900} is directly removed as there are only 2 CP members
      • Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE} java.lang.IllegalStateException: Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
      • Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE} java.lang.IllegalStateException: Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
      • Removing connection to endpoint [172.28.160.1]:5900 Cause => java.io.IOException {Connection refused: no further information to address /172.28.160.1:5900}, Error-Count: 5
      • Data Grid Instance Removed c28c8122-9c2d-4281-9c51-7a8a985f36a0 from Address /172.28.160.1:5900
      • Exception in thread "hz.zealous_brahmagupta.cached.thread-2"
      • Missing CP member in active CP members: [CPMember{uuid=b9324f05-b0fa-4ce3-ac13-ab89ffeebd65, address=[172.28.160.1]:4900}] for CPGroupInfo{id=CPGroupId{name='METADATA', seed=0, groupId=0}, initialMembers=[RaftEndpoint{uuid='94a8a45d-cc30-4b23-a326-032e1f26c514'}, RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], membersCommitIndex=6, members=[RaftEndpoint{uuid='b9324f05-b0fa-4ce3-ac13-ab89ffeebd65'}, RaftEndpoint{uuid='c28c8122-9c2d-4281-9c51-7a8a985f36a0'}], status=ACTIVE}
  • Upon restart of Insty1:
    • Same NumberFormatException that Insty2 ran into
    • From the DAS:
      • Demoting to FOLLOWER since not received acks from majority recently...
      • DAS starts spamming We are FOLLOWER and there is no current leader. Will start new election round...
    • Nothing of note from Insty2 - just shows Insty1 rejoining the data grid.

@lprimak
Copy link
Contributor Author

lprimak commented Jul 24, 2023

@Pandrex247 Good job! you found a bug :) I had set that variable to zero or one (depending on my hazelcast version) and never tested without it.
I will fix.

@lprimak lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from c27f594 to f73003f Compare July 24, 2023 19:44
@lprimak
Copy link
Contributor Author

lprimak commented Jul 24, 2023

jenkins test

@lprimak
Copy link
Contributor Author

lprimak commented Jul 24, 2023

@Pandrex247 Fixed and I tested on a clean install of Payara with no parameters.

The warnings when restarting instances with CP enabled is normal, even repeating messages trying to re-establish CP continuity when less than 3 nodes are available.

See https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management for more info

@lprimak
Copy link
Contributor Author

lprimak commented Jul 24, 2023

Argh... found another bug or two... stay tuned

@lprimak lprimak marked this pull request as draft July 24, 2023 21:19
@lprimak lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from f73003f to a735848 Compare July 25, 2023 00:38
@lprimak
Copy link
Contributor Author

lprimak commented Jul 25, 2023

@Pandrex247 Ok, fixed now, for real this time :)

@lprimak lprimak marked this pull request as ready for review July 25, 2023 00:38
- handles distributed lock exceptions well when CP subsystem is enabled
- performs auto-cp-subsystem-reset when CP gets into unusable state
@lprimak lprimak force-pushed the CP_SUBSYSTEM_FAILURE_FIXES branch from a735848 to 1bd5c57 Compare July 25, 2023 00:57
@lprimak
Copy link
Contributor Author

lprimak commented Jul 25, 2023

jenkins test

@Pandrex247
Copy link
Member

The DAS doesn't seem to be able to restart with this option turned on?

Essentially the same setup as above, but when you run the stop-domain command to stop the DAS (or attempt to restart it via the admin console) it starts spamming the log with Member [172.28.160.1]:4900 - 0d9feda6-19ed-481f-aff9-ea62be15fd2c this requested to shutdown but still in partition table and refuses to shut down.

Does there need to be some sort of shutdown hook override?

@lprimak
Copy link
Contributor Author

lprimak commented Jul 25, 2023

CP is finicky about data loss and will try to prevent it.
It will eventually shut down, but it will spew and complain bitterly. There is nothing to be done here, it's just how the system works.
The app constantly pings the clustered lock (Stock service) which is the root cause of this

@lprimak
Copy link
Contributor Author

lprimak commented Jul 26, 2023

On second thought let me do some more tests. This shouldn’t happen with 3 instances but it does
You may have found another bug. Thanks

@lprimak
Copy link
Contributor Author

lprimak commented Jul 27, 2023

update

I did the same test without CP enabled (without the PR) and I get the same error.
I think it's a bug in Hazelcast. Nothing to do with this PR
Yay! yay :(


Ok, did some more tests. Even with the "new code" (that's in this PR) disabled by using
hazelcast.cp-subsystem.auto-promote system property set to false on all nodes, the shutdown long spam is still happening.

My conclusion is that Payara works better with this PR currently than without, so it's worth merging even if the issue remains.

The problem is that I can't reproduce this via my standalone reproducer app so I can't really report this bug to Hazelcast in any useful way :(

@lprimak
Copy link
Contributor Author

lprimak commented Jul 27, 2023

Furthermore, this test fails with no apps deployed either :) Yay! Bug is Not in this PR :)

@lprimak
Copy link
Contributor Author

lprimak commented Jul 28, 2023

Found it. Looks like a regression somewhere in my code somewhere. Not this PR functionality but I’ll work on a fix here. About 50/50 it’s a big in hazelcast but not quite sure yet. It’s in the discovery SPI and interaction with it

@lprimak
Copy link
Contributor Author

lprimak commented Jul 28, 2023

Or, better yet, feel free to merge this one, the above error has nothing to do with this one, I can fix the above in a separate PR.

@lprimak
Copy link
Contributor Author

lprimak commented Jul 30, 2023

Issue (with PR) filed with Hazelcast: hazelcast/hazelcast#25100

@lprimak
Copy link
Contributor Author

lprimak commented Jul 30, 2023

I can't fix the issue here. This PR can be merged with no regressions.
I will think if there is any way to work around this in the meantime (separate PR)
Thanks!

@lprimak
Copy link
Contributor Author

lprimak commented Jul 30, 2023

jenkins test

@lprimak
Copy link
Contributor Author

lprimak commented Jul 30, 2023

@Pandrex247 I have disabled auto partition groups by default, which was triggering the Hazelcast bug. It works now... for realz :)

Comment on lines 534 to 538
if (Boolean.parseBoolean(System.getProperty("hazelcast.auto-partition-group", "false"))) {
PartitionGroupConfig partitionGroupConfig = config.getPartitionGroupConfig();
partitionGroupConfig.setEnabled(true);
partitionGroupConfig.setGroupType(PartitionGroupConfig.MemberGroupType.SPI);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this changing the default behaviour even with CP subsytem turned off?
Now the partitions will default to "member" with each instance being in its own partition group instead of being determined by the DomainDiscoveryStrategy which utilises the instance group attribute and is optionally host-aware.

Without this it won't be host aware and the instance groups will be ignored for partition purposes - I'm not convinced this is an improvement.

@lprimak
Copy link
Contributor Author

lprimak commented Aug 7, 2023

Isn't this changing the default behaviour even with CP subsytem turned off?

Yes. The partitioning SPI bug is in Hazelcast and is present in the current Payra. Has nothing to do with CP.

Now the partitions will default to "member"

Incorrect. See https://github.com/payara/Payara/blob/eb8effef2c5984843f469fa564bfd42522cf1f22/nucleus/payara-modules/hazelcast-bootstrap/src/main/java/fish/payara/nucleus/hazelcast/HazelcastCore.java#L431C32-L431C32

Unless "host aware" is checked, everything will be in one partition group, which is the correct default behavior.

Of course, once Hazelcast merges my fix, we can put this back to "turn on by default" in Payara

@lprimak
Copy link
Contributor Author

lprimak commented Aug 21, 2023

Looks like my fix for hazelcast/hazelcast#25100 has been merged by hazelcast

@Pandrex247
Copy link
Member

Looks like it is marked for Hazelcast 5.4.0, I'll try to keep an eye out.

I still haven't got around to looking into what you've said, I'll try to make some time over the coming days. Unless you're now intending to change the behaviour back? I don't know how far away Hazelcast 5.4.0 is.

@lprimak
Copy link
Contributor Author

lprimak commented Aug 22, 2023

change the behaviour back

that's the big questions isn't it? I am inclined to change it back now. What do you think?

@Pandrex247
Copy link
Member

I think it makes sense to change it back - keeps this PR a bit cleaner with respects to not changing default behaviour

@lprimak
Copy link
Contributor Author

lprimak commented Aug 24, 2023

ok, I'll change it back

@lprimak
Copy link
Contributor Author

lprimak commented Aug 24, 2023

jenkins test

@lprimak
Copy link
Contributor Author

lprimak commented Aug 31, 2023

At this stage, I think this is safe to merge now IMHO. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: CLA CLA submitted on PR by the contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants