Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICMP block test started to report data loss. #12405

Closed
Danny-Hazelcast opened this issue Feb 21, 2018 · 4 comments · Fixed by #12691
Closed

ICMP block test started to report data loss. #12405

Danny-Hazelcast opened this issue Feb 21, 2018 · 4 comments · Fixed by #12691

Comments

@Danny-Hazelcast
Copy link
Contributor

this test, has started to report data loss

https://hazelcast-l337.ci.cloudbees.com/view/icmp/job/block-icmp/26/console

previously this test has been passing since it was introduced to test ICMP.

/disk1/jenkins/workspace/block-icmp/3.10-SNAPSHOT/2018_02_20-15_36_53/block-icmp Failed

fail HzMember2HZAA split_validate_map hzcmd.map.multi.SizeAssert threadId=0 global.AssertionException: mapBak1_block-icmp0 size 60 != 100

18:18:20 http://54.82.84.143/~jenkins/workspace/block-icmp/3.10-SNAPSHOT/2018_02_20-15_36_53/block-icmp

in the test we fill the HD and heap maps/cache with data, and validate the data is present.

then we block ICMP communications between 2 (AA)members and the remaining 3 (BB)members
after some time we restore the ICMP communications.

and the validation shows there has been data loss.

@mdogan
Copy link
Contributor

mdogan commented Feb 22, 2018

After initial investigation of the logs (http://54.82.84.143/~jenkins/workspace/block-icmp/3.10-SNAPSHOT/2018_02_20-15_36_53/block-icmp), this might be an issue related the new join as lite-member, then promote to data-member approach.

  • Initial cluster ABCDE;
Members {size:5, ver:5} [
	Member [10.0.0.89]:5701 - 8bd860ed-0b41-41d7-ab04-7b08259b9380
	Member [10.0.0.247]:5701 - 2ba7b373-1d08-404f-bc29-c95c5604ffc2 this
	Member [10.0.0.197]:5701 - 8beb29aa-b29a-46b6-92b7-823928bd88be
	Member [10.0.0.93]:5701 - 6fed7fde-776d-441a-858a-096cf39386e6
	Member [10.0.0.159]:5701 - 3d52a399-b8c3-4e07-9b32-09a703e7cf59
]
  • Then they split to 4 clusters, such as AB, C, D and E;
Members {size:2, ver:8} [
	Member [10.0.0.89]:5701 - 8bd860ed-0b41-41d7-ab04-7b08259b9380
	Member [10.0.0.247]:5701 - 2ba7b373-1d08-404f-bc29-c95c5604ffc2 this
]
Members {size:1, ver:6} [
	Member [10.0.0.93]:5701 - 6fed7fde-776d-441a-858a-096cf39386e6 this
]
Members {size:1, ver:7} [
	Member [10.0.0.159]:5701 - 3d52a399-b8c3-4e07-9b32-09a703e7cf59 this
]
Members {size:1, ver:8} [
	Member [10.0.0.197]:5701 - 8beb29aa-b29a-46b6-92b7-823928bd88be this
]
  • Then all merge to a single cluster back, ABCDE but notice that CDE are still lite members;
Members {size:5, ver:11} [
	Member [10.0.0.89]:5701 - 8bd860ed-0b41-41d7-ab04-7b08259b9380
	Member [10.0.0.247]:5701 - 2ba7b373-1d08-404f-bc29-c95c5604ffc2 this
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1 lite
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a lite
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3 lite
]
  • Then they split again as AB and CDE. Important part is CDE are all lite-members, because they are still trying to merge their data to the cluster they merged.
Members {size:2, ver:14} [
	Member [10.0.0.89]:5701 - 8bd860ed-0b41-41d7-ab04-7b08259b9380
	Member [10.0.0.247]:5701 - 2ba7b373-1d08-404f-bc29-c95c5604ffc2 this
]
Members {size:3, ver:12} [
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1 this lite
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a lite
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3 lite
]
  • Since CDE are lite-members, they cannot merge their data and they start with empty partitions after they get promoted to data-members after failing merge operations.
Members {size:3, ver:13} [
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1 this
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a lite
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3 lite
]
Members {size:3, ver:14} [
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1 this
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3 lite
]
Members {size:3, ver:15} [
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1 this
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3
]
  • The same scenario happens for AB side too for the next merge, they merge to CDE as lite-members, but then they get split again;
Members {size:5, ver:17} [
	Member [10.0.0.93]:5701 - a1fa8ab3-f94d-4654-a3e4-7ea17fab75c1
	Member [10.0.0.197]:5701 - 02cbc474-d6b0-4d78-a2c6-0bdbaed5948a
	Member [10.0.0.159]:5701 - 0c5d97cf-dda2-4541-8cc6-84ac3a4ce4f3
	Member [10.0.0.247]:5701 - 34324d2d-d778-4524-bff2-e06562ca1d1e this lite
	Member [10.0.0.89]:5701 - 187ec52c-97d9-4d40-8edc-0e24e8eda039 lite
]
Members {size:2, ver:18} [
	Member [10.0.0.247]:5701 - 34324d2d-d778-4524-bff2-e06562ca1d1e this lite
	Member [10.0.0.89]:5701 - 187ec52c-97d9-4d40-8edc-0e24e8eda039 lite
]
  • Sample logs from failing merge operations;

2018-02-20 12:48:51,142 WARN [hz._hzInstance_1_HZ.cached.thread-13]: [10.0.0.247]:5701 [HZ] [3.10-SNAPSHOT] Error while running map merge operation: com.hazelcast.partition.NoDataMemberInClusterException: Target of invocation cannot be found! Partition owner is null but partitions can't be assigned since all nodes in the cluster are lite members.

2018-02-20 12:46:09,440 WARN [hz._hzInstance_1_HZ.async.thread-2]: [10.0.0.247]:5701 [HZ] [3.10-SNAPSHOT] Error while running cache merge operation: Target of invocation cannot be found! Partition owner is null but partitions can't be assigned since all nodes in the cluster are lite members.

@metanet
Copy link
Contributor

metanet commented Feb 22, 2018

good analysis. I think we also ignore lite members in cluster safety checks

@ahmetmircik
Copy link
Member

@Danny-Hazelcast we have merge a possible fix for the issue, can you please re-run the test?

@Danny-Hazelcast
Copy link
Contributor Author

@ahmetmircik yes re-running, if the jobs start to fail again, I will open another issue

@mmedenjak mmedenjak added the Source: Community PR or issue was opened by a community user label Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants