Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAN operations can fail if cluster clocks are different #13301

Closed
sertugkaya opened this Issue Jun 14, 2018 · 4 comments

Comments

Projects
None yet
6 participants
@sertugkaya
Copy link

sertugkaya commented Jun 14, 2018

Difference between cluster clocks of target and sender can completely block WAN communication, thus filling up the WAN queues on the source cluster. Such case can produce this logs below. Maybe we need to avoid checking some operation timeouts in case of WAN operations.

Jun 13, 2018 4:17:49 PM com.hazelcast.enterprise.wan.replication.WanBatchReplication
WARNING: [x.x.x.x]:5701 [groupname] [3.9] WanOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2018-06-13 16:17:49.221. Start time: 2018-06-13 16:17:49.139. Total elapsed time: 82 ms. Invocation{op=com.hazelcast.enterprise.wan.operation.WanOperation{serviceName='hz:core:wanReplicationService', identityHash=465534748, partitionId=-1, replicaIndex=0, callId=-75401, invocationTime=1528906442889 (2018-06-13 16:14:02.889), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1528906669139, firstInvocationTime='2018-06-13 16:17:49.139', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[x.x.x.x]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2078, /x.x.x.x:37543->/x.x.x.x:5701, endpoint=[x.x.x.x]:5701, alive=true, type=MEMBER]}
com.hazelcast.core.OperationTimeoutException: WanOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2018-06-13 16:17:49.221. Start time: 2018-06-13 16:17:49.139. Total elapsed time: 82 ms. Invocation{op=com.hazelcast.enterprise.wan.operation.WanOperation{serviceName='hz:core:wanReplicationService', identityHash=465534748, partitionId=-1, replicaIndex=0, callId=-75401, invocationTime=1528906442889 (2018-06-13 16:14:02.889), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1528906669139, firstInvocationTime='2018-06-13 16:17:49.139', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[x.x.x.x]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2078, /x.x.x.x:37543->/x.x.x.x:5701, endpoint=[x.x.x.x]:5701, alive=true, type=MEMBER]}
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:104)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
        at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
        at com.hazelcast.spi.impl.AbstractInvocationFuture.join(AbstractInvocationFuture.java:136)
        at com.hazelcast.enterprise.wan.replication.AbstractWanReplication.invokeOnWanTarget(AbstractWanReplication.java:154)
        at com.hazelcast.enterprise.wan.replication.WanBatchReplication$BatchStripedRunnable.run(WanBatchReplication.java:218)
        at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:225)
        at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:208)

@mmedenjak mmedenjak added this to the 3.12 milestone Aug 27, 2018

@dbrimley dbrimley modified the milestones: 3.12, 3.13 Nov 8, 2018

@yozank

This comment has been minimized.

Copy link
Contributor

yozank commented Dec 3, 2018

@dbrimley

This comment has been minimized.

Copy link
Member

dbrimley commented Dec 3, 2018

The entire error message is very confusing. Shouldn't we display the cluster clocks time rather than wall clocks?

@mmedenjak

This comment has been minimized.

Copy link
Contributor

mmedenjak commented Dec 3, 2018

@dbrimley yes. Or, we should explicitly say where we are showing the cluster clock and where we are showing the system clock. Here, we are reporting the system clock where the actual timeout check is using the cluster clock.

@yozank yozank modified the milestones: 3.13, 3.11.2 Dec 18, 2018

@sertugkaya

This comment has been minimized.

Copy link
Author

sertugkaya commented Dec 19, 2018

As discussed this will be fixed in the next maintenance by the quick fix of exluding WanOperation from timeout checks as JoinOperations are excluded here

mmedenjak pushed a commit to mmedenjak/hazelcast that referenced this issue Dec 27, 2018

Matko Medenjak
Ignore call timeout check for WAN operations
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: hazelcast#13301

@mmedenjak mmedenjak modified the milestones: 3.11.2, 3.12 Dec 27, 2018

@mmedenjak mmedenjak modified the milestones: 3.12, 3.11.2 Dec 27, 2018

mmedenjak added a commit that referenced this issue Jan 3, 2019

Ignore call timeout check for WAN operations (#14332)
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: #13301

mmedenjak pushed a commit to mmedenjak/hazelcast that referenced this issue Jan 7, 2019

Matko Medenjak
Ignore call timeout check for WAN operations (hazelcast#14332)
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: hazelcast#13301

(cherry picked from commit 6a5bc11)

mmedenjak added a commit that referenced this issue Jan 7, 2019

Ignore call timeout check for WAN operations (#14332) (#14345)
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: #13301

(cherry picked from commit 6a5bc11)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.