Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discovery] multicast discovery doesn't work after loading some data #4721

Closed
pikazlou opened this issue Mar 4, 2015 · 7 comments
Closed

[discovery] multicast discovery doesn't work after loading some data #4721

pikazlou opened this issue Mar 4, 2015 · 7 comments
Assignees
Milestone

Comments

@pikazlou
Copy link

@pikazlou pikazlou commented Mar 4, 2015

I have two nodes with default configs.
If I start one node, then after some time start another - everything works.
But if I start one node, load it with some data (5000 entries to one map, overall 865 kbytes) and then start another node - they can't see each other. Nothing is shown in logs, no attempts to connect, second node starts as if the first node wasn't running at all.
I tried increasing multicast-timeout-seconds to 10 seconds - no effect.

Update: tried switching to TCP/IP discovery - bug is not reproduced

@pikazlou
Copy link
Author

@pikazlou pikazlou commented Mar 4, 2015

My configuration (default from distribution)

<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.4.xsd"
           xmlns="http://www.hazelcast.com/schema/config"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <group>
        <name>dev</name>
        <password>dev-pass</password>
    </group>
    <management-center enabled="true">http://localhost:8082/mancenter</management-center>
    <network>
        <port auto-increment="true" port-count="100">5701</port>
        <outbound-ports>
            <!--
            Allowed port range when connecting to other nodes.
            0 or * means use system provided port.
            -->
            <ports>0</ports>
        </outbound-ports>
        <join>
            <multicast enabled="true">
                <multicast-group>224.2.2.3</multicast-group>
                <multicast-port>54327</multicast-port>
            </multicast>
            <tcp-ip enabled="false">
                <interface>127.0.0.1</interface>
            </tcp-ip>
            <aws enabled="false">
                <access-key>my-access-key</access-key>
                <secret-key>my-secret-key</secret-key>
                <!--optional, default is us-east-1 -->
                <region>us-west-1</region>
                <!--optional, default is ec2.amazonaws.com. If set, region shouldn't be set as it will override this property -->
                <host-header>ec2.amazonaws.com</host-header>
                <!-- optional, only instances belonging to this group will be discovered, default will try all running instances -->
                <security-group-name>hazelcast-sg</security-group-name>
                <tag-key>type</tag-key>
                <tag-value>hz-nodes</tag-value>
            </aws>
        </join>
        <interfaces enabled="false">
            <interface>10.10.1.*</interface>
        </interfaces>
        <ssl enabled="false"/>
        <socket-interceptor enabled="false"/>
        <symmetric-encryption enabled="false">
            <!--
               encryption algorithm such as
               DES/ECB/PKCS5Padding,
               PBEWithMD5AndDES,
               AES/CBC/PKCS5Padding,
               Blowfish,
               DESede
            -->
            <algorithm>PBEWithMD5AndDES</algorithm>
            <!-- salt value to use when generating the secret key -->
            <salt>thesalt</salt>
            <!-- pass phrase to use when generating the secret key -->
            <password>thepass</password>
            <!-- iteration count to use when generating the secret key -->
            <iteration-count>19</iteration-count>
        </symmetric-encryption>
    </network>
    <partition-group enabled="false"/>
    <executor-service name="default">
        <pool-size>16</pool-size>
        <!--Queue capacity. 0 means Integer.MAX_VALUE.-->
        <queue-capacity>0</queue-capacity>
    </executor-service>
    <queue name="default">
        <!--
            Maximum size of the queue. When a JVM's local queue size reaches the maximum,
            all put/offer operations will get blocked until the queue size
            of the JVM goes down below the maximum.
            Any integer between 0 and Integer.MAX_VALUE. 0 means
            Integer.MAX_VALUE. Default is 0.
        -->
        <max-size>0</max-size>
        <!--
            Number of backups. If 1 is set as the backup-count for example,
            then all entries of the map will be copied to another JVM for
            fail-safety. 0 means no backup.
        -->
        <backup-count>1</backup-count>

        <!--
            Number of async backups. 0 means no backup.
        -->
        <async-backup-count>0</async-backup-count>

        <empty-queue-ttl>-1</empty-queue-ttl>
    </queue>
    <map name="default">
        <!--
           Data type that will be used for storing recordMap.
           Possible values:
           BINARY (default): keys and values will be stored as binary data
           OBJECT : values will be stored in their object forms
           NATIVE : values will be stored in non-heap region of JVM
        -->
        <in-memory-format>BINARY</in-memory-format>

        <!--
            Number of backups. If 1 is set as the backup-count for example,
            then all entries of the map will be copied to another JVM for
            fail-safety. 0 means no backup.
        -->
        <backup-count>1</backup-count>
        <!--
            Number of async backups. 0 means no backup.
        -->
        <async-backup-count>0</async-backup-count>
        <!--
            Maximum number of seconds for each entry to stay in the map. Entries that are
            older than <time-to-live-seconds> and not updated for <time-to-live-seconds>
            will get automatically evicted from the map.
            Any integer between 0 and Integer.MAX_VALUE. 0 means infinite. Default is 0.
        -->
        <time-to-live-seconds>0</time-to-live-seconds>
        <!--
            Maximum number of seconds for each entry to stay idle in the map. Entries that are
            idle(not touched) for more than <max-idle-seconds> will get
            automatically evicted from the map. Entry is touched if get, put or containsKey is called.
            Any integer between 0 and Integer.MAX_VALUE. 0 means infinite. Default is 0.
        -->
        <max-idle-seconds>0</max-idle-seconds>
        <!--
            Valid values are:
            NONE (no eviction),
            LRU (Least Recently Used),
            LFU (Least Frequently Used).
            NONE is the default.
        -->
        <eviction-policy>NONE</eviction-policy>
        <!--
            Maximum size of the map. When max size is reached,
            map is evicted based on the policy defined.
            Any integer between 0 and Integer.MAX_VALUE. 0 means
            Integer.MAX_VALUE. Default is 0.
        -->
        <max-size policy="PER_NODE">0</max-size>
        <!--
            When max. size is reached, specified percentage of
            the map will be evicted. Any integer between 0 and 100.
            If 25 is set for example, 25% of the entries will
            get evicted.
        -->
        <eviction-percentage>25</eviction-percentage>
        <!--
            Minimum time in milliseconds which should pass before checking
            if a partition of this map is evictable or not.
            Default value is 100 millis.
        -->
        <min-eviction-check-millis>100</min-eviction-check-millis>
        <!--
            While recovering from split-brain (network partitioning),
            map entries in the small cluster will merge into the bigger cluster
            based on the policy set here. When an entry merge into the
            cluster, there might an existing entry with the same key already.
            Values of these entries might be different for that same key.
            Which value should be set for the key? Conflict is resolved by
            the policy set here. Default policy is PutIfAbsentMapMergePolicy

            There are built-in merge policies such as
            com.hazelcast.map.merge.PassThroughMergePolicy; entry will be overwritten if merging entry exists for the key.
            com.hazelcast.map.merge.PutIfAbsentMapMergePolicy ; entry will be added if the merging entry doesn't exist in the cluster.
            com.hazelcast.map.merge.HigherHitsMapMergePolicy ; entry with the higher hits wins.
            com.hazelcast.map.merge.LatestUpdateMapMergePolicy ; entry with the latest update wins.
        -->
        <merge-policy>com.hazelcast.map.merge.PutIfAbsentMapMergePolicy</merge-policy>

    </map>

    <multimap name="default">
        <backup-count>1</backup-count>
        <value-collection-type>SET</value-collection-type>
    </multimap>

    <multimap name="default">
        <backup-count>1</backup-count>
        <value-collection-type>SET</value-collection-type>
    </multimap>

    <list name="default">
        <backup-count>1</backup-count>
    </list>

    <set name="default">
        <backup-count>1</backup-count>
    </set>

    <jobtracker name="default">
        <max-thread-size>0</max-thread-size>
        <!-- Queue size 0 means number of partitions * 2 -->
        <queue-size>0</queue-size>
        <retry-count>0</retry-count>
        <chunk-size>1000</chunk-size>
        <communicate-stats>true</communicate-stats>
        <topology-changed-strategy>CANCEL_RUNNING_OPERATION</topology-changed-strategy>
    </jobtracker>

    <semaphore name="default">
        <initial-permits>0</initial-permits>
        <backup-count>1</backup-count>
        <async-backup-count>0</async-backup-count>
    </semaphore>

    <serialization>
        <portable-version>0</portable-version>
    </serialization>

    <services enable-defaults="true"/>

</hazelcast>
@hasancelik
Copy link
Contributor

@hasancelik hasancelik commented Mar 4, 2015

Thanks for reporting your issue. Please share with us the following information, to help us resolve your issue quickly and efficiently.

  • Hazelcast version that you are using (e.g. 3.4, also specify minor release or latest snapshot)
  • Java version (also JVM parameters can be added)
  • Operating System (for Linux kernel version will be helpful)
  • Logs and Stack Traces (if available)
  • Steps to reproduce (detailed description of steps to reproduce your issue)
    • Unit Test (if you can include a unit test which reproduces your issue, we would be grateful)
  • Integration module versions if available (e.g. Tomcat, Jetty, Spring, Hibernate)
    • Detailed configuration information (web.xml, Hibernate configuration, Spring context.xml etc.)
@pikazlou
Copy link
Author

@pikazlou pikazlou commented Mar 4, 2015

Hazelcast version: 3.4.1
Hazelcast configuration: same for both nodes, pasted above
Java version: Oracle 1.7.0_75
Operating System: CentOS 6.5, kernel 2.6.32-431.29.2.el6.x86_64
Logs: see below
Steps to reproduce: see description

@pikazlou
Copy link
Author

@pikazlou pikazlou commented Mar 4, 2015

Logs (one more detail - I'm connection to mancenter which is started on the first node)

First node (real IPs replaced with FIRST_NODE and SECOND_NODE, real client IP replaced with CLIENT_IP)

# ./server.sh 
JAVA_HOME environment variable not available.
Path to Java : /usr/bin/java
########################################
# RUN_JAVA=/usr/bin/java
# JAVA_OPTS=
# starting now....
########################################
Mar 04, 2015 11:28:06 AM com.hazelcast.config.XmlConfigLocator
INFO: Loading 'hazelcast.xml' from working directory.
Mar 04, 2015 11:28:07 AM com.hazelcast.instance.DefaultAddressPicker
INFO: [LOCAL] [dev] [3.4.1] Prefer IPv4 stack is true.
Mar 04, 2015 11:28:07 AM com.hazelcast.instance.DefaultAddressPicker
INFO: [LOCAL] [dev] [3.4.1] Picked Address[FIRST_NODE]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
Mar 04, 2015 11:28:07 AM com.hazelcast.spi.OperationService
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Backpressure is disabled
Mar 04, 2015 11:28:07 AM com.hazelcast.spi.impl.BasicOperationScheduler
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Starting with 2 generic operation threads and 4 partition operation threads.
Mar 04, 2015 11:28:07 AM com.hazelcast.system
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Hazelcast 3.4.1 (20150213 - 780828c) starting at Address[FIRST_NODE]:5701
Mar 04, 2015 11:28:07 AM com.hazelcast.system
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Copyright (C) 2008-2014 Hazelcast.com
Mar 04, 2015 11:28:07 AM com.hazelcast.instance.Node
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Creating MulticastJoiner
Mar 04, 2015 11:28:07 AM com.hazelcast.core.LifecycleService
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Address[FIRST_NODE]:5701 is STARTING
Mar 04, 2015 11:28:13 AM com.hazelcast.cluster.impl.MulticastJoiner
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] 


Members [1] {
    Member [FIRST_NODE]:5701 this
}

Mar 04, 2015 11:28:13 AM com.hazelcast.core.LifecycleService
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Address[FIRST_NODE]:5701 is STARTED
Mar 04, 2015 11:28:13 AM com.hazelcast.management.ManagementCenterService
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Hazelcast will connect to Hazelcast Management Center on address: 
http://localhost:8082/mancenter
Mar 04, 2015 11:28:13 AM com.hazelcast.partition.InternalPartitionService
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Initializing cluster partition table first arrangement...
Mar 04, 2015 11:28:37 AM com.hazelcast.nio.tcp.SocketAcceptor
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Accepting socket connection from /CLIENT_IP:53443
Mar 04, 2015 11:28:37 AM com.hazelcast.nio.tcp.TcpIpConnectionManager
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Established socket connection between /FIRST_NODE:5701 and CLIENT_IP/CLIENT_IP:53443
Mar 04, 2015 11:28:37 AM com.hazelcast.client.impl.client.AuthenticationRequest
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Received auth from Connection [/FIRST_NODE:5701 -> CLIENT_IP/CLIENT_IP:53443], endpoint=null, live=true, type=JAVA_CLIENT, successfully authenticated
Mar 04, 2015 11:28:37 AM com.hazelcast.nio.tcp.SocketAcceptor
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Accepting socket connection from /CLIENT_IP:53444
Mar 04, 2015 11:28:37 AM com.hazelcast.nio.tcp.TcpIpConnectionManager
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Established socket connection between /FIRST_NODE:5701 and CLIENT_IP/CLIENT_IP:53444
Mar 04, 2015 11:28:37 AM com.hazelcast.client.impl.client.AuthenticationRequest
INFO: [FIRST_NODE]:5701 [dev] [3.4.1] Received auth from Connection [/FIRST_NODE:5701 -> CLIENT_IP/CLIENT_IP:53444], endpoint=null, live=true, type=JAVA_CLIENT, successfully authenticated

Second node (real IPs replaced with FIRST_NODE and SECOND_NODE)

# ./server.sh 
JAVA_HOME environment variable not available.
Path to Java : /usr/bin/java
########################################
# RUN_JAVA=/usr/bin/java
# JAVA_OPTS=
# starting now....
########################################
Mar 04, 2015 11:34:27 AM com.hazelcast.config.XmlConfigLocator
INFO: Loading 'hazelcast.xml' from working directory.
Mar 04, 2015 11:34:27 AM com.hazelcast.instance.DefaultAddressPicker
INFO: [LOCAL] [dev] [3.4.1] Prefer IPv4 stack is true.
Mar 04, 2015 11:34:27 AM com.hazelcast.instance.DefaultAddressPicker
INFO: [LOCAL] [dev] [3.4.1] Picked Address[SECOND_NODE]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
Mar 04, 2015 11:34:27 AM com.hazelcast.spi.OperationService
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Backpressure is disabled
Mar 04, 2015 11:34:27 AM com.hazelcast.spi.impl.BasicOperationScheduler
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Starting with 2 generic operation threads and 4 partition operation threads.
Mar 04, 2015 11:34:27 AM com.hazelcast.system
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Hazelcast 3.4.1 (20150213 - 780828c) starting at Address[SECOND_NODE]:5701
Mar 04, 2015 11:34:27 AM com.hazelcast.system
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Copyright (C) 2008-2014 Hazelcast.com
Mar 04, 2015 11:34:27 AM com.hazelcast.instance.Node
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Creating MulticastJoiner
Mar 04, 2015 11:34:27 AM com.hazelcast.core.LifecycleService
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Address[SECOND_NODE]:5701 is STARTING
Mar 04, 2015 11:34:33 AM com.hazelcast.cluster.impl.MulticastJoiner
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] 


Members [1] {
    Member [SECOND_NODE]:5701 this
}

Mar 04, 2015 11:34:33 AM com.hazelcast.core.LifecycleService
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Address[SECOND_NODE]:5701 is STARTED
Mar 04, 2015 11:34:33 AM com.hazelcast.management.ManagementCenterService
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Hazelcast will connect to Hazelcast Management Center on address: 
http://FIRST_NODE:8082/mancenter
Mar 04, 2015 11:34:33 AM com.hazelcast.partition.InternalPartitionService
INFO: [SECOND_NODE]:5701 [dev] [3.4.1] Initializing cluster partition table first arrangement...
@enesakar enesakar removed the PENDING label Nov 2, 2015
@andyrhee
Copy link

@andyrhee andyrhee commented May 5, 2016

I am experiencing similar issue with 3.5 and jdk1.8. Multicast discovery only works if two nodes are started within small interval (< 1 min), otherwise the second node fails to find the master node, and forms its own cluster.

@mmedenjak mmedenjak changed the title multicast discovery doesn't work after loading some data [discovery] multicast discovery doesn't work after loading some data Jul 13, 2017
@mmedenjak mmedenjak added this to the 3.9 milestone Jul 13, 2017
@emrahkocaman emrahkocaman self-assigned this Aug 16, 2017
@emrahkocaman
Copy link
Contributor

@emrahkocaman emrahkocaman commented Aug 16, 2017

@andyrhee @pikazlou Hazelcast 3.4 and 3.5 is not maintained anymore. Could you please give it a try with Hazelcast 3.8.4 to see if the problem still exists.

@ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented Aug 17, 2017

Please reopen this if it still exists after re-trying with 3.8.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.