Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gathering MultiMap stats breaks split-brain healing with LatestAccessMergePolicy #16001

Closed
blazember opened this issue Nov 13, 2019 · 0 comments · Fixed by #16004
Closed

Gathering MultiMap stats breaks split-brain healing with LatestAccessMergePolicy #16001

blazember opened this issue Nov 13, 2019 · 0 comments · Fixed by #16004

Comments

@blazember
Copy link
Contributor

@blazember blazember commented Nov 13, 2019

Gathering the MultiMap statistics for the diagnostics (3.x) and for the metrics system (4.x) is taken as access, which breaks the LastAccessMergePolicy split-brain healing policy and may lead to merging the wrong side to the right side.

Analysis
Merging MultiMaps is done on container-basis rather than on entry-basis. Therefore, MultiMapContainer has the lastAccessTime field that the LastAccessMergePolicy relies on. This lastAccessTime gets updated every time the MultiMapContainer is accessed. MultiMapService#getStats() also updates lastAccessTime hence breaks LastAccessMergePolicy because stats collection should not be counted as an access.

Other accesses
There is MultiMapService#getOrCreateCollectionContainerWithoutAccess to avoid the aforementioned problem in some other cases like during migration or from backup operations. The fix for this issue should be introducing a method that gets the MultiMapContainer without accessing or creating it.

@blazember blazember added this to the 4.0 milestone Nov 13, 2019
@blazember blazember self-assigned this Nov 13, 2019
blazember added a commit to blazember/hazelcast that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count as an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes hazelcast#16001
blazember added a commit to blazember/hazelcast that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count as an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes hazelcast#16001
blazember added a commit to blazember/hazelcast that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes hazelcast#16001
blazember added a commit to blazember/hazelcast that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes hazelcast#16001
blazember added a commit that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes #16001
blazember added a commit that referenced this issue Nov 13, 2019
Gathering multimap statistics calls
MultiMapPartitionContainer#getMultiMapContainer() that updates the
lastAccessTime of the returned container, which in the end breaks the
last access merge policy. This is fixed by making gathering statistics
not to count an access. Also, retrieving the container on backups
for the given partition made not to be considered an access.

Fixes #16001
blazember added a commit to blazember/hazelcast that referenced this issue Nov 15, 2019
Follow-up on hazelcast#16003 that fixed hazelcast#16001 by introducing a second parameter
to `MultiMapPartitionContainer#getMultiMapContainer(String,boolean)` to
indicate whether or not the container should be `access()`'ed. This
broke `AnswerTest` and that can't be fixed easily since there is no such
method in HZ 3.11. This is fixed by making the aforementioned method
private and adding back the previously existing
`getMultiMapContainer(String name)` and introducing
`getMultiMapContainerWithoutAccess(String name)` that delegate to the
private `getMultiMapContainer(String,boolean)`.

Fixes hazelcast#16011
blazember added a commit that referenced this issue Nov 15, 2019
Follow-up on #16003 that fixed #16001 by introducing a second parameter
to MultiMapPartitionContainer#getMultiMapContainer(String,boolean) to
indicate whether or not the container should be access()'ed. This broke
AnswerTest that can't be fixed easily since
getMultiMapContainer(String,boolean) doesn't exist in HZ 3.11. Therefore,
the test is fixed by making the aforementioned method private and adding
back the previously existing getMultiMapContainer(String name) while
introducing getMultiMapContainerWithoutAccess(String name) that
delegate to the private getMultiMapContainer(String,boolean).

Fixes #16011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.