Suspicious behavour of testRecovery_singleInstanceRemaining #5923

Closed
jerrinot opened this Issue Aug 11, 2015 · 5 comments

Projects

None yet

2 participants

@jerrinot
Contributor

https://github.com/hazelcast/hazelcast/blob/08857da91e497365437700243db026fda4c5c31a/hazelcast/src/test/java/com/hazelcast/xa/HazelcastXATest.java#L152 seems to have two operational modes:

  • fast mode (<10s)
  • slow mode - (>2 minutes)

There is nothing in between. My guess is either our transaction recovery or the test itself is racy - sometimes a transaction is recovered successfully, but sometimes it's waiting for a timeout (2 minutes)

https://hazelcast-l337.ci.cloudbees.com/job/Hazelcast-3.x-OpenJDK6/com.hazelcast$hazelcast/623/testReport/com.hazelcast.xa/HazelcastXATest/testRecovery_singleInstanceRemaining/history/

image

@jerrinot jerrinot self-assigned this Aug 11, 2015
@jerrinot jerrinot added this to the 3.6 milestone Aug 11, 2015
@jerrinot
Contributor

It seems like a side-effect of #5602 - the recent changes in the lock operations.

I can see the transaction is recovered, it's even committed, but the record is not unlocked because of this check. When the entry is not unlocked then the get() operation is waiting for tx timeout -> 2 minutes long mode.

@jerrinot
Contributor

I can reproduce & detect the problem reliably by annotating the testRecovery_singleInstanceRemaining test by

    @Test(timeout = 60 * 1000 * 1)
    @Repeat(100)

When I randomize the initial callId on members then the issue is gone.

Now the question is how to fix this properly. I could split the referenceId into lockReferenceId and unlockReferenceId and it would probably fix this very test. But I'm afraid it's not a real solution as a transaction rollback would be still affected. @mdogan: What's your view?

@jerrinot jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 13, 2015
@jerrinot jerrinot Fixes #5923
See hazelcast#5923 for details
f55b8d9
@mdogan
Member
mdogan commented Aug 13, 2015

I did not understand, how lock and unlock are using the same callId. Aren't they separate invocations?

@jerrinot
Contributor

They are. However in transactional map both invocations might be initiated by a different members and they might have the same callID - purely by a chance. That's why randomization of initial callID makes this to go away.

@mdogan
Member
mdogan commented Aug 13, 2015

Ah I see. lock and unlock can be called by different members in transactions. For normal lock operations, this cannot happen.

Maybe we can use some different reference-id (different from call-id) for transactional locks. Or we can remove reference check for them...

@jerrinot jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 14, 2015
@jerrinot jerrinot Fix #5923
For transactional locks the referenceId is not taken into consideration as lock/unlock operations might
be initiated by different members -> they might have the same referenceId and the unlock might be lost.

See the #5923 and #5954 for details.
335f8c0
@jerrinot jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 14, 2015
@jerrinot jerrinot Fix #5923
For transactional locks the referenceId is not taken into consideration as lock/unlock operations might
be initiated by different members -> they might have the same referenceId and the unlock might be lost.

See the #5923 and #5954 for details.
d88fb19
@jerrinot jerrinot closed this in #5954 Aug 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment