Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge after a Split-Brain Network doesn't work #9358

Closed
lvca opened this issue Dec 2, 2016 · 17 comments
Closed

Merge after a Split-Brain Network doesn't work #9358

lvca opened this issue Dec 2, 2016 · 17 comments
Milestone

Comments

@lvca
Copy link

@lvca lvca commented Dec 2, 2016

I've experimented a critical problem by using just one map with 10 items and 3 nodes. After isolating 1 node, the merge occurs only with some of the keys, not all of them. To be sure, I've also printed the merge calls by using my own merge policy strategy and for some of the keys, the merge is not called.

After the merge is completed (Lifecycle=MERGED), the content of the map of the joined server is different, verified by printing the values after the merge.

@jerrinot jerrinot added this to the 3.8 milestone Dec 2, 2016
@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 2, 2016

@lvca: Can you share your configuration? Do you have reading from backup enabled?

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 2, 2016

@lvca are you using IMap or ReplicatedMap?

@lvca
Copy link
Author

@lvca lvca commented Dec 2, 2016

I'm using IMap. This is my config:

<?xml version="1.0" encoding="UTF-8"?>
<!-- ~ Copyright (c) 2008-2012, Hazel Bilisim Ltd. All Rights Reserved. ~ 
	~ Licensed under the Apache License, Version 2.0 (the "License"); ~ you may 
	not use this file except in compliance with the License. ~ You may obtain 
	a copy of the License at ~ ~ http://www.apache.org/licenses/LICENSE-2.0 ~ 
	~ Unless required by applicable law or agreed to in writing, software ~ distributed 
	under the License is distributed on an "AS IS" BASIS, ~ WITHOUT WARRANTIES 
	OR CONDITIONS OF ANY KIND, either express or implied. ~ See the License for 
	the specific language governing permissions and ~ limitations under the License. -->

<hazelcast
        xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.3.xsd"
        xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <group>
        <name>orientdb</name>
        <password>orientdb</password>
    </group>
    <properties>
        <property name="hazelcast.mancenter.enabled">false</property>
        <property name="hazelcast.memcache.enabled">false</property>
        <property name="hazelcast.rest.enabled">false</property>
        <property name="hazelcast.wait.seconds.before.join">2</property>
        <property name="hazelcast.operation.thread.count">1</property>
        <property name="hazelcast.operation.generic.thread.count">1</property>
        <property name="hazelcast.client.event.thread.count">1</property>
        <property name="hazelcast.event.thread.count">1</property>
        <property name="hazelcast.heartbeat.interval.seconds">5</property>
        <property name="hazelcast.max.no.heartbeat.seconds">30</property>
        <property name="hazelcast.icmp.enabled">true</property>
        <property name="hazelcast.icmp.timeout">5000</property>
        <property name="hazelcast.icmp.ttl">3</property>
    </properties>
    <network>
        <port auto-increment="false">2434</port>
        <join>
            <multicast enabled="false">
                <multicast-group>224.2.2.3</multicast-group>
                <multicast-port>2434</multicast-port>
                <multicast-timeout-seconds>1</multicast-timeout-seconds>
            </multicast>
            <tcp-ip enabled="true" connection-timeout-seconds="3">
                <member>127.0.0.1:2435</member>
                <member>127.0.0.1:2436</member>
            </tcp-ip>
        </join>
        <interfaces enabled="true">
            <interface>127.0.0.1</interface>
        </interfaces>
    </network>
</hazelcast>

And at runtime I apply this:

hazelcastConfig.getMapConfig(CONFIG_REGISTEREDNODES).setBackupCount(6);
hazelcastConfig.getMapConfig(OHazelcastDistributedMap.ORIENTDB_MAP).setMergePolicy(OHazelcastMergeStrategy.class.getName());

We have 2 problems here:

  1. the merge strategy is not called for a value
  2. the map's callback is never called (registered with hzMap.addEntryListener(this, true);, where this implements EntryAddedListener<String, Object>, EntryRemovedListener<String, Object>, MapClearedListener, EntryUpdatedListener<String, Object>
@lvca
Copy link
Author

@lvca lvca commented Dec 5, 2016

Hi guys, testing with a network with 4 nodes, I isolated the network of 2 servers, then remerged only one and it worked. Then I merged the 2nd node and they never merge. This is the log on the merging server:

2016-12-05 13:55:39:629 INFO  [10.0.1.16]:2434 [orientdb] [3.6.5] Accepting socket connection from /10.0.1.19:64685 [SocketAcceptorThread]
2016-12-05 13:55:39:632 INFO  [10.0.1.16]:2434 [orientdb] [3.6.5] Established socket connection between /10.0.1.16:2434 and /10.0.1.19:64685 [TcpIpConnectionManager]

And this is the log on one of the other 3 servers:

2016-12-05 13:58:09:552 INFO  [10.0.1.19]:2434 [orientdb] [3.6.5] Established socket connection between /10.0.1.19:64685 and /10.0.1.16:2434 [TcpIpConnectionManager]
2016-12-05 13:58:09:561 INFO  [10.0.1.19]:2434 [orientdb] [3.6.5] Address[10.0.1.16]:2434 should merge to this node , because : currentDataMemberCount > joinMessage.getDataMemberCount() [3 > 1] [MulticastJoiner]
2016-12-05 14:00:09:669 INFO  [10.0.1.19]:2434 [orientdb] [3.6.5] Address[10.0.1.16]:2434 should merge to this node , because : currentDataMemberCount > joinMessage.getDataMemberCount() [3 > 1] [MulticastJoiner]
2016-12-05 14:02:09:485 INFO  [10.0.1.19]:2434 [orientdb] [3.6.5] Address[10.0.1.16]:2434 should merge to this node , because : currentDataMemberCount > joinMessage.getDataMemberCount() [3 > 1] [MulticastJoiner]
@lvca
Copy link
Author

@lvca lvca commented Dec 5, 2016

Any idea why the re-merging of the cluster doesn't work?

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 5, 2016

@lvca: Does 2nd node have addresses of the remaining nodes in its configuration?

@lvca
Copy link
Author

@lvca lvca commented Dec 5, 2016

I'm using Multicast, and the 2nd node was connected to the other 2 nodes.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 5, 2016

@lvca: right, I missed the multicast part in the logs. Can you reproduce it with debug logging?

@lvca
Copy link
Author

@lvca lvca commented Dec 5, 2016

@jerrinot How can I active it? I'm testing with direct IP and it looks like the node, after more than 3 minutes, it's able to rejoin. By the way, how can I speedup the rejoin process? Any additional settings?

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 5, 2016

@lvca: it depends on your logging framework. Hazelcast uses JUL by default.

I tend to use log4j or slf4j/logback. You can switch logging via the hazelcast.logging.type property - just set it to log4j or slf4j and then use config for your logging framework. com.hazelcast set to DEBUG/TRACE is a good start.

You can use this property to change frequency of the split-brain handler.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 6, 2016

hi @lvca, did you manage to reproduce it with debug logging?

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 12, 2016

hi @lvca, do you have the debug logs you could share?

@lvca
Copy link
Author

@lvca lvca commented Dec 12, 2016

Hey @jerrinot I've used the setting you suggested and now it's faster to execute the merge. I'm testing with direct IP addresses, I had no time to retry the multicast, sorry.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Dec 12, 2016

@lvca: I understand. Thanks for the heads-up and when you have a moment then I would really apprecicate the multicast logs. Happy Graphing! :)

@metanet
Copy link
Contributor

@metanet metanet commented Jan 6, 2017

I think that this issue can be related to the cluster merge mechanism, not the map service. Maybe the last node which could not merge back still has the master of the bigger side in its member list. Because of this reason, it can be ignoring split brain join requests coming from the bigger side.

More logs would be very useful.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Jan 19, 2017

I am moving this as we do not have sufficient amount of logging to conclude anything.
@lvca: Feel free to stop by at our Gitter if manage this reproduce it.

@jerrinot jerrinot modified the milestones: 3.9, 3.8 Jan 19, 2017
@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Jul 6, 2017

@lvca: I am closing the issue. Feel free to re-open it when you have additional logs / findings.

@jerrinot jerrinot closed this Jul 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.