Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WAN events fired after split-brain merges where values don't change [HZ-2620] #24928

Merged
merged 14 commits into from Aug 10, 2023

Conversation

JamesHazelcast
Copy link
Contributor

@JamesHazelcast JamesHazelcast commented Jun 29, 2023

This commit alters the returned type of the RecordStore#merge and AbstractCacheRecordStore#merge methods to provide a more verbose result for the merge operation; this allows us to maintain all of the existing functionality, while also allowing the WAN replication stage in MergeOperation and CacheMergeOperation to check this more verbose response and avoid sending a WAN Replication event when the response is VALUES_ARE_EQUAL. The key to this change is that it only changes WAN replication behaviour in this circumstance, and all existing functionality (such as calling entry listeners) remains.

I introduced separate response classes for IMap vs ICache as their functionality and results offered differ slightly; we could get away with using a shared response class and adjusting the internals, but to allow for easy expansion and possible divergence in functionality between the two data structures in the future, I opted to use separate classes.

I also introduced the same logic handling for backup replica handling (skip WAN replication for these replications when the primary replica was already skipped).

This solution does break the existing API for the #merge function in RecordStore and ICacheRecordStore, but both classes are within our impl packages, and so we make no API guarantees about them.

Refer to HZ-2620 for detailed breakdown
Fixes https://hazelcast.atlassian.net/browse/HZ-2620
EE PR: https://github.com/hazelcast/hazelcast-enterprise/pull/6197

Breaking changes (list specific methods/types/messages):

  • API: ICacheRecordStore#merge and RecordStore#merge (and implementing classes) functions have had their return type changed. N.B. all changes are within internal impl packages.

The current mechanics of WAN replication mean that `MergeOperation` (for
`IMap) and `CacheMergeOperation` (for `ICache`) invocations generate a
WAN replication event for all results besides no data present on both
sides of the merge. This means that a scenario where the values of the
merge are identical, and only expiry metadata is updated, still causes
WAN replication to take place.

This is not desired behaviour as it may introduce unexpected results on
the WAN replication target, since the source cluster has not actually
had any data changed, it simply recovered from a split-brain scenario.
One example of this would be where a target cluster processes WAN
replicated data from a source cluster, then deletes it from its local
data set once processed. A split-brain merge on the source cluster would
cause data that has already been processed on the target cluster to be
re-introduced for unnecessary processing.

This commit aims to resolve this situation by altering the returned type
of the `RecordStore#merge` and `AbstractCacheRecordStore#merge` methods
to provide a more verbose result for the merge operation; this allows us
to maintain all of the existing functionality, while also allowing the
WAN replication stage in `MergeOperation` and `CacheMergeOperation` to
check this more verbose response and avoid sending a WAN Replication
event when the response is `VALUES_ARE_EQUAL`. The key to this change is
that it only changes WAN replication behaviour in this circumstance, and
all existing functionality (such as calling entry listeners) remains.

I introduced separate response classes for `IMap` vs `ICache` as their
functionality and results offered differ slightly; we could get away with
using a shared response class and adjusting the internals, but to allow
for easy expansion and possible divergence in functionality between the
two data structures in the future, I opted to use separate classes.

This solution does break the existing API for the `#merge` function in
`RecordStore` and `ICacheRecordStore`, but both classes are within our
`impl` packages, and so we make no API guarantees about them.
(I may have had `quick` profile selected in my verify run)
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 29, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 29, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 29, 2023
WIP solution that needs feedback and more thought before final approach
is decided upon. Without this addition, we get unnecessary WAN events
being replicated for backups where we didn't WAN replicate the primary.
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 30, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 30, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 30, 2023
This alternative approach is more performant and maintains backwards compatibility.
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 30, 2023
@vbekiaris vbekiaris requested review from vbekiaris and removed request for blazember July 3, 2023 08:43

private Map<Data, CacheRecord> cacheRecords;
private Map<Data, WanWrappedCacheRecord> cacheRecords;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about declaring cacheRecords as Map<Data, ? extends CacheRecord>?
This way we wouldn't impose record-wrapping to all operations (eg loadAll etc) and their serialized form. The disadvantage is there will be an additional complexity (instanceof check, conditionally run/serialize the isWanReplicated logic), so maybe it's not a great trade-off. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually like to do what I originally thought about doing - converting the Cache backup operations to use a list of pairs, like Map operations do; that way I don't need to have any wrapper at all, and everything can be dealt with in the same manner as Map operations are.

For PoC purposes and as it wasn't difficult to do, I made the changes in this commit: eccd294

Due to all backup operations being wrapped, we can't use our serde trick of appending to the end here.
Currently, Cache uses a Map to store key/value pairs pending backup
operations - changing it to a "pairs" List like IMap operations use
allows easier integration for things like "non-WAN replicated keys"
while also making ICache/IMap mechanics more familiar.
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 3, 2023
@vbekiaris vbekiaris self-requested a review July 6, 2023 14:20
@ahmetmircik
Copy link
Member

Could you please remind what was the rational of this change and the user complaint if there is any?

This PR removes WAN replication for split-brain merges where the
values are equal, but their metadata may still have changed - in
this case we should WAN replicate the event so this metadata can
be updated on target clusters as well.
@hz-devops-test
Copy link

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR]   reason: class file for aQute.bnd.annotation.Resolution not found
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HiDensityNativeMemoryCacheRecordStore.java:[305,4] error: method does not override or implement a method from a supertype
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HiDensityNativeMemoryCacheRecordStore.java:[309,79] error: incompatible types: CacheMergeResponse cannot be converted to HiDensityNativeMemoryCacheRecord
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HotRestartHiDensityNativeMemoryCacheRecordStore.java:[30,7] error: merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in HiDensityNativeMemoryCacheRecordStore cannot implement merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in ICacheRecordStore
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/operation/WanCacheMergeOperation.java:[38,46] error: incompatible types: CacheMergeResponse cannot be converted to CacheRecord
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/operation/CacheMergeOperation.java:[68,52] error: incompatible types: CacheMergeResponse cannot be converted to CacheRecord
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HiDensityNativeMemoryCacheRecordStore.java:[306,23] error: merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in HiDensityNativeMemoryCacheRecordStore cannot implement merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in ICacheRecordStore
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HiDensityNativeMemoryCacheRecordStore.java:[305,4] error: method does not override or implement a method from a supertype
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HiDensityNativeMemoryCacheRecordStore.java:[309,79] error: incompatible types: CacheMergeResponse cannot be converted to HiDensityNativeMemoryCacheRecord
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/nativememory/HotRestartHiDensityNativeMemoryCacheRecordStore.java:[30,7] error: merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in HiDensityNativeMemoryCacheRecordStore cannot implement merge(CacheMergeTypes,SplitBrainMergePolicy,Object>,CallerProvenance) in ICacheRecordStore
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/operation/WanCacheMergeOperation.java:[38,46] error: incompatible types: CacheMergeResponse cannot be converted to CacheRecord
--------------------------
[ERROR] /home/jenkins/jenkins_slave/workspace/Hazelcast-pr-EE-compiler/hazelcast-enterprise/hazelcast-enterprise/src/main/java/com/hazelcast/cache/impl/hidensity/operation/CacheMergeOperation.java:[68,52] error: incompatible types: CacheMergeResponse cannot be converted to CacheRecord
--------------------------

@hazelcast hazelcast deleted a comment from hz-devops-test Jul 12, 2023
@hazelcast hazelcast deleted a comment from hz-devops-test Jul 12, 2023
@JamesHazelcast JamesHazelcast merged commit 35c3d35 into hazelcast:master Aug 10, 2023
7 of 8 checks passed
@JamesHazelcast JamesHazelcast deleted the fix/5.4/hz-2620 branch September 21, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants