Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAN operations can fail if cluster clocks are different #13301

Closed
sertugkaya opened this issue Jun 14, 2018 · 4 comments · Fixed by #14332
Closed

WAN operations can fail if cluster clocks are different #13301

sertugkaya opened this issue Jun 14, 2018 · 4 comments · Fixed by #14332

Comments

@sertugkaya
Copy link
Contributor

@sertugkaya sertugkaya commented Jun 14, 2018

Difference between cluster clocks of target and sender can completely block WAN communication, thus filling up the WAN queues on the source cluster. Such case can produce this logs below. Maybe we need to avoid checking some operation timeouts in case of WAN operations.

Jun 13, 2018 4:17:49 PM com.hazelcast.enterprise.wan.replication.WanBatchReplication
WARNING: [x.x.x.x]:5701 [groupname] [3.9] WanOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2018-06-13 16:17:49.221. Start time: 2018-06-13 16:17:49.139. Total elapsed time: 82 ms. Invocation{op=com.hazelcast.enterprise.wan.operation.WanOperation{serviceName='hz:core:wanReplicationService', identityHash=465534748, partitionId=-1, replicaIndex=0, callId=-75401, invocationTime=1528906442889 (2018-06-13 16:14:02.889), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1528906669139, firstInvocationTime='2018-06-13 16:17:49.139', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[x.x.x.x]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2078, /x.x.x.x:37543->/x.x.x.x:5701, endpoint=[x.x.x.x]:5701, alive=true, type=MEMBER]}
com.hazelcast.core.OperationTimeoutException: WanOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2018-06-13 16:17:49.221. Start time: 2018-06-13 16:17:49.139. Total elapsed time: 82 ms. Invocation{op=com.hazelcast.enterprise.wan.operation.WanOperation{serviceName='hz:core:wanReplicationService', identityHash=465534748, partitionId=-1, replicaIndex=0, callId=-75401, invocationTime=1528906442889 (2018-06-13 16:14:02.889), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1528906669139, firstInvocationTime='2018-06-13 16:17:49.139', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[x.x.x.x]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2078, /x.x.x.x:37543->/x.x.x.x:5701, endpoint=[x.x.x.x]:5701, alive=true, type=MEMBER]}
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:104)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
        at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
        at com.hazelcast.spi.impl.AbstractInvocationFuture.join(AbstractInvocationFuture.java:136)
        at com.hazelcast.enterprise.wan.replication.AbstractWanReplication.invokeOnWanTarget(AbstractWanReplication.java:154)
        at com.hazelcast.enterprise.wan.replication.WanBatchReplication$BatchStripedRunnable.run(WanBatchReplication.java:218)
        at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:225)
        at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:208)
@mmedenjak mmedenjak added this to the 3.12 milestone Aug 27, 2018
@dbrimley dbrimley removed this from the 3.12 milestone Nov 8, 2018
@dbrimley dbrimley added this to the 3.13 milestone Nov 8, 2018
@yozank
Copy link
Contributor

@yozank yozank commented Dec 3, 2018

@dbrimley
Copy link
Member

@dbrimley dbrimley commented Dec 3, 2018

The entire error message is very confusing. Shouldn't we display the cluster clocks time rather than wall clocks?

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Dec 3, 2018

@dbrimley yes. Or, we should explicitly say where we are showing the cluster clock and where we are showing the system clock. Here, we are reporting the system clock where the actual timeout check is using the cluster clock.

@yozank yozank removed this from the 3.13 milestone Dec 18, 2018
@yozank yozank added this to the 3.11.2 milestone Dec 18, 2018
@sertugkaya
Copy link
Contributor Author

@sertugkaya sertugkaya commented Dec 19, 2018

As discussed this will be fixed in the next maintenance by the quick fix of exluding WanOperation from timeout checks as JoinOperations are excluded here

mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Dec 27, 2018
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: hazelcast#13301
@mmedenjak mmedenjak removed this from the 3.11.2 milestone Dec 27, 2018
@mmedenjak mmedenjak added this to the 3.12 milestone Dec 27, 2018
@mmedenjak mmedenjak removed this from the 3.12 milestone Dec 27, 2018
@mmedenjak mmedenjak added this to the 3.11.2 milestone Dec 27, 2018
mmedenjak added a commit that referenced this issue Jan 3, 2019
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: #13301
mmedenjak added a commit to mmedenjak/hazelcast that referenced this issue Jan 7, 2019
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: hazelcast#13301

(cherry picked from commit 6a5bc11)
mmedenjak added a commit that referenced this issue Jan 7, 2019
Call timeout check is based on the cluster clock. In the case of WAN,
the difference between the cluster clocks of the source and target
cluster is included in the timeout and when the cluster clocks are too
far apart, the WAN operations might get rejected immediately, thus
causing WAN to fail indefinitely. We ignore the call timeout check for
WAN operations, similarly as we do for join operations.

Fixes: #13301

(cherry picked from commit 6a5bc11)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants