Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird split-brain like issue with Hazelcast 3.8.4 #11533

Closed
dimas opened this issue Oct 7, 2017 · 10 comments
Closed

Weird split-brain like issue with Hazelcast 3.8.4 #11533

dimas opened this issue Oct 7, 2017 · 10 comments

Comments

@dimas
Copy link

@dimas dimas commented Oct 7, 2017

First, some backgroung: a distributed map in a large cluster running in AWS. Many clients that do the same operation - they invoke an entry processor on an entries to modify them and a result from the entry processor indicates if entry is present in the map. If entry was not there, new value is computed and stored into the map. So map is just a cache for our values. Given re-calculation is quite expensive, we obviously depend on cache not giving false negatives.

We had an incident with that map which is somewhat difficult to describe. If anything, it looked very much like a split-brain situation.

From Hazelcast logs it looked like single node (10.0.134.190) lost contact with the master (10.0.128.138) first and then the rest of the cluster. However the the real mystery is the reaction of clients. Judging from what happened next, all clients lost visibility of all the entries in the map except for entries that were owned by that node 10.0.134.190. Because that node initially only had a small fraction of entries, clients had to recompute all the values and put them into the map.

We collect metrics that we can get with JMX and saw size metric for the map as reported by 10.0.134.190 dropped to number of entries on that node itself (kind of indicating the node believed it is the only node in the cluster). Lots of "partitionId=XXXX is completely lost!" messages logged by 10.0.134.190 supports that.
size reported by all other members only slightly decreased by the number of entries owned by 10.0.134.190 (so the rest of the cluster kept together and cluster believed it only lost one member - 10.0.134.190).

Then localOwnedEntryCount on 10.0.134.190 started growing rapidly indicating that all the clients started adding new values only to 10.0.134.190. This is where the real mystery lies - when one member drops out of the cluster - why clients may chose to talk only to it instead of talking to the rest of the cluster. We use "smart" Java clients so they are normally connected to all members not just one. My only guess is that 10.0.134.190 somehow told clients that from its view all members are dead and somehow clients believed that instead of using information from master and the rest of the cluster which they did not lose contact with!

That continues for few minutes minutes and then everything just went back to normal - localOwnedEntryCount on 10.0.134.190 first dropped to zero and then quickly returned to its initial value.

So in the end cluster kind of recovered but the damage was done already. Events like this completely defeat the purpose of a cache in Hazelcast when you can lose 90% of your cache in a matter of seconds...

PS: I have full Hazelcast log of the incident as well as all telemtry collected with JMX. I probably do not want to publish all of it here but can email details to Hazelcast developers if needed.

@mmedenjak mmedenjak added this to the 3.9.1 milestone Oct 9, 2017
@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Oct 9, 2017

Hi @dimas !

Regarding the members, it definitely looks like one member got separated from the rest of the cluster and declared other data to be lost, as well as the other members declaring the data from 10.0.134.190 to be lost.
For starters, what is your configured backup count for the map (async and sync)?

@dimas
Copy link
Author

@dimas dimas commented Oct 9, 2017

We are not using backups so both are zero.

@dimas
Copy link
Author

@dimas dimas commented Oct 11, 2017

@mmedenjak, anything else of interest there?

@gokhanoner
Copy link
Contributor

@gokhanoner gokhanoner commented Oct 18, 2017

@dimas

From Hazelcast logs it looked like single node (10.0.134.190) lost contact with the master (10.0.128.138) first and then the rest of the cluster. However the the real mystery is the reaction of clients. Judging from what happened next, all clients lost visibility of all the entries in the map except for entries that were owned by that node 10.0.134.190. Because that node initially only had a small fraction of entries, clients had to recompute all the values and put them into the map.

This indicates that members other than 10.0.134.190 are isolated & can communicate with each other. Clients & 10.0.134.190 cannot communicate with them. Ans since you don't have any backup configured, it's logical that only data local on 10.0.134.190 is available.

This is where the real mystery lies - when one member drops out of the cluster - why clients may chose to talk only to it instead of talking to the rest of the cluster. We use "smart" Java clients so they are normally connected to all members not just one. My only guess is that 10.0.134.190 somehow told clients that from its view all members are dead and somehow clients believed that instead of using information from master and the rest of the cluster which they did not lose contact with!

For the clients, when they connected to the cluster, they try to connect an address in their config. And after that, form that list, they select an owner connection, randomly. If you only 10.0.134.190 in the client XML, then all clients select this node as owner connection & get cluster related info from that member. Ans when a split-brain occurs, 10.0.134.190 inform the clients that he's the only one in the cluster.

What you can do, you can define a quorum for you maps. Please see: http://docs.hazelcast.org/docs/3.8.6/manual/html-single/index.html#configuring-quorum

@dimas
Copy link
Author

@dimas dimas commented Oct 18, 2017

@gokhanoner , thanks very much for your reply, I lost hope already :)

This indicates that members other than 10.0.134.190 are isolated & can communicate with each other. Clients & 10.0.134.190 cannot communicate with them.

What tells you that clients lost connectivity to all nodes? I cannot see anything like that in client logs.
Also, I believe we are using smart clients which means these are connected to all cluster members not to just one of them.

netstat -nt | grep :5701 | wc -l

on a client machine shows me a connection to each cluster member.

For the clients, when they connected to the cluster, they try to connect an address in their config. And after that, form that list, they select an owner connection, randomly. If you only 10.0.134.190 in the client XML, then all clients select this node as owner connection & get cluster related info from that member. Ans when a split-brain occurs, 10.0.134.190 inform the clients that he's the only one in the cluster.

Interesting. We have a DNS A record with all cluster member IP addresses in it and clients configured with that DNS entry. I assume Hazelcast does not get ALL entries from it and just tries to connect to that host name so it is down to DNS resolver to choose which one will it be. After the initial connection, client learns about all other cluster members and connects to them directly.

I personally find changes of all 60 Amazon EC2 machines resolving DNS hostname into the same IP address but I will try investigating possibility of that. (If I just do a ping on that hostname from a client machine - it tries to ping a different IP each time).

Is there a way to learn which member each client considers "the master" ?

@gokhanoner
Copy link
Contributor

@gokhanoner gokhanoner commented Oct 18, 2017

@dimas, when you start the client, client select a node & connect randomly and then, that node sends the member list to the client. And client open connection to all other nodes. In this case, even if client has connection to all members, if 10.0.134.190 is the owner connection & it reports that he's the only one in the cluster, member believes & drop connections to other members.

We see issues with the DNS resolution before. Are u using AWS discovery plugin?

"Owner connection" info is not available as an API, you can get it by casting to internal classes. Or when the client first starts, you can learn form below log:

Oct 18, 2017 3:34:07 PM com.hazelcast.client.connection.ClientConnectionManager
INFO: hz.client_0 [oep-test] [3.8.1] Setting ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/127.0.0.1:57284 remote=/127.0.0.1:5701]}, remoteEndpoint=[192.168.2.56]:5701, lastReadTime=2017-10-18 15:34:07.553, lastWriteTime=2017-10-18 15:34:07.522, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.1} as owner  with principal ClientPrincipal{uuid='13287246-5747-46c1-b128-8acaed081f61', ownerUuid='a792aa20-a74b-4521-bec1-528f69c822de'

ownerUuid is the uuid of the member & you can check the member list to get the address of that member.

@dimas
Copy link
Author

@dimas dimas commented Oct 18, 2017

Oh, that all is very interesting! I could not find that information in the client logs but your message made me to look better.

  1. We do use AWS discovery but only for nodes to find each other. We do not use AWS discovery on the client and just provide it with DNS name of the broker which resolved to all member IP addresses

  2. You are right, client logs connection when it starts. Also, in addition to

com.hazelcast.client.spi.impl.ClusterListenerSupport INFO: hz.client_0 [] [3.8.4] Trying to connect to [10.0.143.80]:5701 as owner member
com.hazelcast.client.connection.ClientConnectionManager INFO: hz.client_0 [] [3.8.4] Setting ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.0.146.4:53410 remote=/10.0.143.80:5701]}, remoteEndpoint=[10.0.143.80]:5701, lastReadTime=2017-10-04 22:16:05.745, lastWriteTime=2017-10-04 22:16:05.738, closedTime=never, lastHeartbeatRequested=never, lastHeartbeatReceived=never, connected server version=3.8.4} as owner  with principal ClientPrincipal{uuid='76d40fe1-3cdf-45b9-90b0-12c38392e69e', ownerUuid='3e73e46e-d2b4-425c-a833-da53cbaa7bac'}
  1. However quick greping across all the client logs for "as owner with principal" shows that only a fraction of clients connected to 10.0.134.190.

So while admin, the whole concept of "owner connection" is news to me (and thanks for explaining!), it does not look like all these clients used 10.0.134.190 as owner...

@sancar
Copy link
Member

@sancar sancar commented Nov 1, 2017

@dimas What is the situation with issue ?
First of all the behaviour you explain seems odd. Clients choosing different owner should continue to work with rest of the cluster, not 10.0.134.190.
If you are sure all clients continue to server to only 10.0.134.190, then you need to look into logs of clients that chooses a different owner. What are they printing as member list when incident happens and who is the owner ?

@dimas
Copy link
Author

@dimas dimas commented Nov 2, 2017

@sancar , hi.
The situation has not changed in any way :)

  1. According to initial client logs the initial "owner" connection was different on each broker which is probably expected given round-robin nature of our DNS A record pointing to all cluster members
  2. I cannot see any confirmation from client logs that they made any other connection "owner" one. I am not sure at that moment the information is logged and how it looks like. I searched client logs for the word "owner" and I could not find anything
  3. "What are they printing as member list when incident happens and who is the owner" - again, I do not see member list logged by clients at all. Is it a DEBUG level message? We only keep logs with level INFO and above.

However throughout the incident I see many clients complaining about losing connection to another node: 10.0.129.134 with errors like:

com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to ClientConnection{alive=true, connectionId=22, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/10.0.173.146:45246 remote=/10.0.129.134:5701]}, remoteEndpoint=[10.0.129.134]:5701, lastReadTime=2017-10-25 18:05:21.871, lastWriteTime=2017-10-25 18:05:11.503, closedTime=never, lastHeartbeatRequested=2017-10-25 18:05:54.217, lastHeartbeatReceived=2017-10-25 18:04:36.821, connected server version=3.8.4}
	at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.notifyException(ClientInvocationServiceSupport.java:231)
	at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.run(ClientInvocationServiceSupport.java:214)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at com.hazelcast.util.executor.LoggingScheduledExecutor$LoggingDelegatingFuture.run(LoggingScheduledExecutor.java:138)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
	at ------ submitted from ------.(Unknown Source)
	at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolveAndThrowIfException(ClientInvocationFuture.java:95)
	at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolveAndThrowIfException(ClientInvocationFuture.java:32)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
	at com.hazelcast.client.spi.ClientProxy.invokeOnPartition(ClientProxy.java:154)
	at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:148)
	at com.hazelcast.client.proxy.ClientMapProxy.executeOnKeyInternal(ClientMapProxy.java:1257)
	at com.hazelcast.client.proxy.ClientMapProxy.executeOnKey(ClientMapProxy.java:1251)
...

All the messages from the clients only refer that node - 10.0.129.134

As I said, I can provide logs for the server and clients.

Cheers

@sancar
Copy link
Member

@sancar sancar commented Nov 16, 2017

Hi @dimas ,

This issue is not suitable to follow from github anymore. I am closing this issue. Can you post the question in our google group by giving this link ? And you can also share the logs in mail so that we can take look at ourselves.
And Member list is printed as INFO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.