You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When scaling down extremely fast, a tombstone can still go missing. The TestSingleBinaryWithMemberlistScaling can reproduce this on occasion with the default values. e.g.
What appears to be happening is that the final messages from the instance being scaled down are being sent the expected number of times, but the intended recipients are also shutting down. This is not trivial to fix because we do not get any feedback from memberlist as to whether our messages were actually received. Possible solutions:
Somehow monitor for failed sends and re-send until some number of successful sends are achieved
Send out messages tombstones more times (e.g. a form of retransmit multiplier specifically for tombstones)
To Reproduce
Run the TestSingleBinaryWithMemberlistScaling a few times.
make ./cmd/cortex/.uptodate
go test -timeout=1h -count=20 -v -tags=requires_docker ./integration -run "^TestSingleBinaryWithMemberlistScaling$"
Tweaking the scaling numbers in the test make it fail more often:
The original fix was to check `memberlist_client_cluster_members_count`
to only scale down the next instance once the previous has been removed
everywhere. However, this does not inhibit the test enough to ensure
that tombstones are not lost, because the memberlist membership
propagates much quicker than ring membership. Therefore, instead wait
until every instance has seen the tombstone before removing another
instance.
This unfortunately takes the teeth out the test so the alternative is
to skip it or remove the test entirely, until issue #44 is fixed.
Describe the bug
When scaling down extremely fast, a tombstone can still go missing. The
TestSingleBinaryWithMemberlistScaling
can reproduce this on occasion with the default values. e.g.memberlist-tombstone-with-debug.log
What appears to be happening is that the final messages from the instance being scaled down are being sent the expected number of times, but the intended recipients are also shutting down. This is not trivial to fix because we do not get any feedback from memberlist as to whether our messages were actually received. Possible solutions:
To Reproduce
Run the
TestSingleBinaryWithMemberlistScaling
a few times.Tweaking the scaling numbers in the test make it fail more often:
Expected behavior
The test doesn't fail.
Environment:
Additional Context
(Origin: cortexproject/cortex#4360)
The text was updated successfully, but these errors were encountered: