Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When minority of members removed from network, read/write failures occur on cluster #12240

Closed
lazerion opened this issue Feb 1, 2018 · 12 comments

Comments

@lazerion
Copy link
Contributor

@lazerion lazerion commented Feb 1, 2018

Cluster information

Size: 5
Version: Hazelcast 3.10-SNAPSHOT (20180201 - 31dcc3e)
Client Count: 1
Client Docker : FROM openjdk:8-slim (debian)
Member Docker: FROM openjdk:8-jre (debian)
Client log level: DEBUG [slf4j]
Member log level: ALL [jdk]

Steps:

  1. Sanity check, wait for initial cluster size to be 5
  2. Begin generating traffic from client to cluster
  • 10 Threads with Callables each waits defined timeout (200ms-100ms-50ms) after write->read->test predicate verifications
  • Client owner connection is on majority side
  1. Remove hz-3 and hz-4 from network
  2. wait for cluster size to be 3
  3. Tear down

Expected no failure during read/write operations to majority of cluster.
However, occasionally we got read/write failures to cluster.
Logs regarding client/members and configurations attached. Logs are organized according to client side timeouts. However, there is no strong correlation observed with timeouts and failure counts.

Data structures that failures observed are:

ISemaphore
ISet
MultiMap
IAtomicLong

Archive.zip

@lazerion lazerion changed the title When minority of members removed from network, read/write failures occurs on cluster When minority of members removed from network, read/write failures occur on cluster Feb 1, 2018
@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Feb 2, 2018

Can you share the link to the test?

occasionally we got read/write failures

How do these failures manifest? Does the client get some kind of exception?

Can you also cleanup the client logs, as I see lots of logs unrelated to the client instance, e.g.

2018-02-01T18:37:14.146058481Z 18:37:14:145 [qtp1201506992-11] DEBUG HttpConnection - HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200
fe{r=1,c=false,a=IDLE,uri=null}<-SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=-,flush=-,to=9/3600000}{io=0/0,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0
 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null} parsed false HttpParser{s=START,0 of -1}
2018-02-01T18:37:14.146423743Z 18:37:14:146 [qtp1201506992-11] DEBUG HttpConnection - releaseRequestBuffer HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}<-SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=-,flush=-,to=10/3600000}{io=0/0,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}
2018-02-01T18:37:14.146783361Z 18:37:14:146 [qtp1201506992-11] DEBUG AbstractConnection - fillInterested HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}<-SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=-,flush=-,to=10/3600000}{io=0/0,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}
2018-02-01T18:37:14.147534295Z 18:37:14:147 [qtp1201506992-11] DEBUG FillInterest - FillInterest@522d6bf0{true,AC.ReadCB@53bdfcee{HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}<-SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=FI,flush=-,to=0/3600000}{io=0/0,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}}} register AC.ReadCB@53bdfcee{HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}<-SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=FI,flush=-,to=1/3600000}{io=0/0,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}}
2018-02-01T18:37:14.148006009Z 18:37:14:147 [qtp1201506992-11] DEBUG ChannelEndPoint - changeInterests p=false 0->1 for SocketChannelEndPoint@4506eebd{/172.22.0.1:56050<->/172.22.0.7:4567,OPEN,fill=FI,flush=-,to=1/3600000}{io=0/1,kio=0,kro=1}->HttpConnection@53bdfcee[p=HttpParser{s=START,0 of -1},g=HttpGenerator@74cf1af2{s=START}]=>HttpChannelOverHttp@3f0200fe{r=1,c=false,a=IDLE,uri=null}
@lazerion
Copy link
Contributor Author

@lazerion lazerion commented Feb 2, 2018

@mmedenjak thanks for the reply.

Test repo can be found here
Predicates are build in this builder
Predicates tested in this callable

During failures client side does not report exception, we collect predicate expected and actual mismatches.

For each data structure that we built there are simple predicate functions that we verify in QuorumTask executions. All data structures are called from an executor service, and therefore, predicates tested in concurrent.

I attached logs for a newer another run that has IAtomicLong failures that are reported on client side, this time client logs are cleared from noise, and only com.hazelcast package has debug level. Members log properties are same as before.

predicate-fails-50ms-2.zip

@lazerion
Copy link
Contributor Author

@lazerion lazerion commented Feb 12, 2018

Test code repo transfered to https://github.com/hazelcast/hazelcast-qe

@mmedenjak mmedenjak added this to the 3.10 milestone Feb 13, 2018
@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Feb 14, 2018

So when it says

2018-02-02T08:55:41.951071809Z 08:55:41:950 [pool-1-thread-6] WARN QuorumTask - com.hazelcast.core.IAtomicLong write fails
2018-02-02T08:55:41.952340493Z 08:55:41:951 [pool-1-thread-6] WARN QuorumTask - com.hazelcast.core.IAtomicLong read fails
2018-02-02T08:55:41.953772932Z 08:55:41:953 [pool-1-thread-6] WARN QuorumTask - com.hazelcast.core.IAtomicLong test fails

It means these checks have failed?

return QuorumTask.builder()
                .write(it -> atomicLong.incrementAndGet() > 0)
                .read(it -> atomicLong.get() > 0)
                .test(it -> atomicLong.decrementAndGet() == 0)

And this is using just one atomic long?
final IAtomicLong atomicLong = client.getAtomicLong("default");

Where is the test code?

@lazerion
Copy link
Contributor Author

@lazerion lazerion commented Feb 14, 2018

@mmedenjak added documentation for suite.

Regarding your questions yes.

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Feb 14, 2018

Where can I see which test exactly is failing in this issue? There are several of them in AcceptanceTest.

@lazerion
Copy link
Contributor Author

@lazerion lazerion commented Feb 14, 2018

@mmedenjak you can find it here.

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Feb 14, 2018

Can you always add this to the issue description when creating the github issue so the test code is easy to find for someone analysing the failure?

@metanet
Copy link
Contributor

@metanet metanet commented Mar 7, 2018

I have analyzed this failure. For instance, the following code shows how IAtomicLong is tested:

    private QuorumTask.QuorumTaskBuilder atomicLong() {
        final IAtomicLong atomicLong = client.getAtomicLong("default");
        atomicLong.set(0);
        return QuorumTask.builder()
                .write(it -> atomicLong.incrementAndGet() > 0)
                .read(it -> atomicLong.get() > 0)
                .test(it -> atomicLong.decrementAndGet() == 0)
                .name(IAtomicLong.class.getName());
    }

The client performs the write -> read -> test loop on the cluster and expects all steps to return true all the time, including the time period where it talks to majority sub-cluster after the split. However, our current replication mechanism is async replication. If the target partition is not owned by a node that will fall into the majority sub-cluster, this approach can fail because of the mutations either in the write() or test() steps.

For quorum tests, I think we should only test two scenarios:

  • do we get a quorum exception when we have a sufficiently large cluster and we shouldn't get the exception?
  • do we eventually get a quorum exception when we lose fall below required number of nodes?

@lazerion confirmed that we already have tests for both of these scenarios. Any other correctness test is possibly irrelevant for now.

I suggest two options. We can either close this issue or move it to the 3.11. Once we have raft impl behind atomic data structures, we can make proper correctness tests for these stuff.

cc @jerrinot @mdogan @mmedenjak

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Mar 8, 2018

I vote to close it. Once we have our raft impl, we will have a separate test suite for it.

@lazerion can you adapt the tests to work as Basri suggested?

@lazerion
Copy link
Contributor Author

@lazerion lazerion commented Mar 8, 2018

@mmedenjak done.

@metanet
Copy link
Contributor

@metanet metanet commented Mar 8, 2018

Closing the issue then. Thx guys

@metanet metanet closed this Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.