Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReliableTopic doesn't work after correct partition migration to previous owner node(where it was created first time and listener attached) without exceptions and warnings #13602

Closed
DimaGusev opened this issue Aug 19, 2018 · 0 comments

Comments

@DimaGusev
Copy link

@DimaGusev DimaGusev commented Aug 19, 2018

Test case:

1) First node starts and attaches MessageListener.
2) Start new nodes -> partition with topic will be migrated to another nodes.
3) Start new nodes or shutdown node -> partition with topic will be migrated to first node.

After that all message listeners attached to ReliableTopic don't work.

Example for Hazelcast 3.10.4:

public static void main(String ... args) throws InterruptedException {
    String topicName = "topic";
    HazelcastInstance first = Hazelcast.newHazelcastInstance();
    first.getReliableTopic(topicName).addMessageListener(message ->
            System.out.println("Received:" + message.getMessageObject()));
    HazelcastInstance second = Hazelcast.newHazelcastInstance();
    second.shutdown();
    while (true) {
        Thread.sleep(5000);
        first.getReliableTopic(topicName).publish("Hello");
    }
}

In this example, ReliableTopic "topic" has partition id 85. Originally it is created on first instance.
When second instance is starting Hazelcast migrates partitions [0-134] to second node.
When second instance is stopping Hazelcast migrates partitions [0-134] to first node.
After that we publish to the topic but listener doesn't receive messages.

Technical details:

After hours of debugging I finally found root cause and reproduction sequence on local environment.
When we attach MessageListener, Hazelcast generates GenericOperation and RingbufferService creates RingbufferContainer.
Next, Hazelcast generates PartitionInvocation with ReadManyOperation and executes it locally. OperationParkerImpl adds waitKey with this operation.
When Hazelcast migrates partition it removes waited operations from OperationParkerImpl.waitSetMap and generates PartitionMigratingException as response but RingbufferContainer still exists.
On PartitionMigratingException, ReadManyOperation is repeated but as remote call. Important thing is that instance ReadManyOperation is still the same and it has reference to RingbufferContainer on local node.
When after some time Hazelcast migrates the partition again to our node, ReplicationOperation is generated and sent to first node. After deserialization new RingbufferContainer is created and old container is replaced: ReplicationOperation.run -> RingbufferService.addRingbuffer -> put.
PartitionMigratingException generated for ReadManyOperation again and operation repeated locally and added to OperationParkerImpl.waitSetMap.
After all of that we have inconsistency: there are two RingbufferContainer, current one is located in RingbufferService and old one in ReadManyOperation, because Hazelcast uses the same instance ReadManyOperation but changes call options from local to remote and vice versa.
When we publish message, Hazelcast adds new element to RingbufferContainer in RingbufferService and notifies ReadManyOperation but it cannot work because RingbufferContainer related to ReadManyOperation doesn't have elements.
Notification is stopped by return statement in WaitSet.unpark on entry.shouldWait() which returns true and does not allow other even correct ReadManyOperation to be executed.

Summary

We use Hazelcast 3.8 in our project and we always reproduce this issue after starting servers in particular order with HOST_AWARE partitioning, nothing works after environment is up and running. But i checked that problem exists on all versions >=3.8.
I think this is quite serious issue because MessageListener doesn't works in such simple case which causes data loss without anything in logfiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.