Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to connect to any address in the config #11194

Closed
abhinavsinha09 opened this issue Aug 23, 2017 · 8 comments
Closed

Unable to connect to any address in the config #11194

abhinavsinha09 opened this issue Aug 23, 2017 · 8 comments

Comments

@abhinavsinha09
Copy link

@abhinavsinha09 abhinavsinha09 commented Aug 23, 2017

Hi All,

I'm using Hazelcast 3.8 and the issue occurs in Production after 4-5 hours of continuous execution.
I created 4 IMaps to cache the values and it worked fine for 4-5 hours. But after 5 hours the client was not able to connect to the cache. I even reproduced this in lower environment with the use of Jmeter.

What I can make out is that it is occurring only when the client is invoked continuously for 4 hours. If I give a break like front user trying to invoke client and there is no delta load/initial load occurring then it works fine.

Request your help.

Error:

Error while creating Hazelcast client: Unable to connect to any address in the config! The following addresses were tried: [localhost/127.0.0.1:5703, localhost/127.0.0.1:5702, localhost/127.0.0.1:5701]

Hazelcast.xml is configured like below:

<?xml version="1.0" encoding="UTF-8"?>
<hazelcast
	xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.8.xsd"
	xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<properties>
		<property name="hazelcast.health.monitoring.level">OFF</property>
		<property name="hazelcast.logging.type">none</property>
		<property name="hazelcast.property.foo">value</property>
		<property name="hazelcast.health.monitoring.level">OFF</property>
		<property name="hazelcast.client.max.no.heartbeat.seconds">5000</property>
		<property name="hazelcast.client.heartbeat.interval">5000</property>
		<property name="hazelcast.client.heartbeat.timeout">60000</property>
		<property name="hazelcast.client.event.thread.count">5</property>
               <property name="hazelcast.client.event.queue.capacity">1000000</property>
                <property name="hazelcast.client.invocation.timeout.seconds">120</property>
		<property name="hazelcast.operation.priority.generic.thread.count">1</property>
	</properties>
	<management-center enabled="false" />
	<network>
		<port auto-increment="true">5701</port>
		<join>
			<multicast enabled="false">
				<multicast-group>224.2.2.3</multicast-group>
				<multicast-port>54327</multicast-port>
			</multicast>
			<tcp-ip enabled="true">
				<interface>localhost</interface>
			</tcp-ip>
			<aws enabled="false" />
		</join>
		<interfaces enabled="true">
			<interface>127.0.0.*</interface>       
		</interfaces>

	</network>

	<map name="usergroupMap">
		<backup-count>0</backup-count>
		<async-backup-count>1</async-backup-count>
		<time-to-live-seconds>21600</time-to-live-seconds>
		<max-idle-seconds>21600</max-idle-seconds>
		<eviction-policy>NONE</eviction-policy>
		<eviction-percentage>25</eviction-percentage>
		<merge-policy>com.hazelcast.map.merge.PassThroughMergePolicy
		</merge-policy>
	</map>
	<map name="jurisdictionMap">
		<backup-count>0</backup-count>
		<async-backup-count>1</async-backup-count>
		<time-to-live-seconds>21600</time-to-live-seconds>
		<max-idle-seconds>21600</max-idle-seconds>
		<eviction-policy>NONE</eviction-policy>
		<eviction-percentage>25</eviction-percentage>
		<merge-policy>com.hazelcast.map.merge.PassThroughMergePolicy
		</merge-policy>
	</map>
	<map name="restrictedPartyMap">
		<backup-count>0</backup-count>
		<async-backup-count>1</async-backup-count>
		<time-to-live-seconds>21600</time-to-live-seconds>
		<max-idle-seconds>21600</max-idle-seconds>
		<eviction-policy>NONE</eviction-policy>
		<eviction-percentage>25</eviction-percentage>
		<merge-policy>com.hazelcast.map.merge.PassThroughMergePolicy
		</merge-policy>
	</map>
	<map name="restrictedSourceMap">
		<backup-count>0</backup-count>
		<async-backup-count>1</async-backup-count>
		<time-to-live-seconds>21600</time-to-live-seconds>
		<max-idle-seconds>21600</max-idle-seconds>
		<eviction-policy>NONE</eviction-policy>
		<eviction-percentage>25</eviction-percentage>
		<merge-policy>com.hazelcast.map.merge.PassThroughMergePolicy
		</merge-policy>
	</map>

</hazelcast>

Below is from the diagnostic logs:

23-8-2017 13:45:43 Metrics[
                          classloading.loadedClassesCount=42,223
                          classloading.totalLoadedClassesCount=42,223
                          classloading.unloadedClassCount=0
                          client.endpoint.count=675
                          client.endpoint.totalRegistrations=1,140
                          cluster.clock.clusterTimeDiff=0
                          cluster.clock.maxClusterTimeDiff=0
                          event.eventQueueSize=0
                          event.eventsProcessed=576
                          executor.[hz:async].queueSize=0
                          executor.[hz:client-query].queueSize=0
                          executor.[hz:client].queueSize=0
                          executor.[hz:cluster:event].queueSize=0
                          executor.[hz:cluster].queueSize=0
                          executor.[hz:scheduled:cqc:6ef4cb88-dee5-42f1-bdbf-c57e0c4e74e3].queueSize=0
                          executor.[hz:scheduled].queueSize=0
                          executor.[hz:system].queueSize=0
                          file.partition[user.home].freeSpace=34,953,764,864
                          file.partition[user.home].totalSpace=42,949,672,960
                          file.partition[user.home].usableSpace=34,953,764,864
                          gc.majorCount=0
                          gc.majorTime=0
                          gc.minorCount=0
                          gc.minorTime=0
                          gc.unknownCount=41
                          gc.unknownTime=7,411
                          operation.callTimeoutCount=0
                          operation.completedCount=11,950
                          operation.failedBackups=0
                          operation.invocations.backupTimeouts=0
                          operation.invocations.normalTimeouts=0
                          operation.invocations.pending=0
                          operation.invocations.responses[backup]=0
                          operation.invocations.responses[error]=0
                          operation.invocations.responses[missing]=0
                          operation.invocations.responses[normal]=0
                          operation.invocations.responses[timeout]=0
                          operation.operationTimeoutCount=0
                          operation.priorityQueueSize=0
                          operation.queueSize=0
                          operation.responseQueueSize=0
                          operation.retryCount=0
                          os.freeSwapSpaceSize=11,515,752,448
                          os.processCpuLoad=0.03482150961004884
                          os.systemLoadAverage=535.0006103515625
                          os.totalSwapSpaceSize=17,179,869,184
                          proxy.createdCount=1,860
                          proxy.destroyedCount=0
                          runtime.availableProcessors=4
                          runtime.freeMemory=756,657,632
                          runtime.maxMemory=21,021,851,648
                          runtime.totalMemory=1,363,673,088
                          runtime.uptime=1,247,022
                          runtime.usedMemory=607,015,456
                          tcp.connection.acceptedSocketCount=0
                          tcp.connection.activeCount=675
                          tcp.connection.clientCount=675
                          tcp.connection.count=0
                          tcp.connection.textCount=0
                          thread.daemonThreadCount=7,015
                          thread.peakThreadCount=10,547
                          thread.threadCount=10,431
                          thread.totalStartedThreadCount=17,979
                          transactions.commitCount=0
                          transactions.rollbackCount=0
                          transactions.startCount=0]
23-8-2017 13:45:43 SlowOperations[]
23-8-2017 13:45:43 HazelcastInstance[
                          thisAddress=[localhost]:5701
                          isRunning=true
                          isLite=false
                          joined=true
                          nodeState=ACTIVE
                          clusterId=de1a3200-a8f7-49ba-8f91-f090fe41bb1c
                          clusterSize=1
                          isMaster=true
                          masterAddress=[localhost]:5701
                          Members[
                                  [localhost]:5701]]
@pveentjer pveentjer changed the title Haze Unable to connect to any address in the config Aug 23, 2017
@pveentjer
Copy link
Member

@pveentjer pveentjer commented Aug 23, 2017

Can you run with the following settings on the server

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
-Dhazelcast.diagnostics.overloaded.connections.period.seconds=30

And the following to the client

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.overloaded.connections.period.seconds=30
@pveentjer
Copy link
Member

@pveentjer pveentjer commented Aug 23, 2017

client.endpoint.count=675

You have 675 clients connected to a single server?

thread.threadCount=10,431
thread.daemonThreadCount=7,015

You have more than 10k threads (7k are daemon threads) that doesn't look very healthy.

runtime.availableProcessors=4

I see you have 4 cores.. which is less than a modern mobile phone has these days. So I guess you are running on a virtualized environment?

@abhinavsinha09
Copy link
Author

@abhinavsinha09 abhinavsinha09 commented Aug 23, 2017

@pveentjer - The diagnostic trace is from dev envrionment not Production and I used Jmeter to execute that many clients.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Aug 23, 2017

Probably you have a setup problem because you don't want to test with that many clients, unless you have 675 client-machines. Also look at my other comments; the number of threads is ridiculously high

@mmedenjak
Copy link
Contributor

@mmedenjak mmedenjak commented Aug 23, 2017

Hi @abhinavsinha09!
As Peter mentioned, it seems that the thread and client count is probably too high. You have a cluster of three members? Do you expect to be having up to 2000 clients connected to the cluster? Can you check why the thread count is so high? Are you spawning threads? You can also make a thread dump on each member and analyse where the threads are coming from.

As for the client disconnect, we should take a look at the server logs at the time the issue started to see why the client could not connect to the server. But I would first try the suggestions Peter and I mentioned.

@abhinavsinha09
Copy link
Author

@abhinavsinha09 abhinavsinha09 commented Aug 23, 2017

Thanks @pveentjer @mmedenjak, I'll debug it in production like environment and share the details.

@sancar sancar added this to the 3.8.6 milestone Aug 23, 2017
@abhinavsinha09
Copy link
Author

@abhinavsinha09 abhinavsinha09 commented Sep 1, 2017

@pveentjer: I figured the issue which is occurring in Production.
Root Cause - I've set the time to live seconds as 21600 (6 hours). At 21600 seconds when my cache is automatically getting refreshed and at the same time if client is invoked by multiple threads then cache is not able to create and result in error:

[9/1/17 12:01:06:335 AEST] 000149d4 ClusterListen W com.hazelcast.client.spi.impl.ClusterListenerSupport hz.client_6615 [dev] [3.8] Unable to get alive cluster connection, try in 2998 ms later, attempt 2 of 2.

I was able to reproduce this scenario in non-prod env. by setting the below configuration:
Number of threads = 80

Below is the code:
static class UserGroupListener implements
EntryEvictedListener<String, String> {

    @Override
    public void entryEvicted(EntryEvent<String, String> arg0) {
        try {
            IMap<String, ArrayList<String>> mapping = instance
                    .getMap("usergroupMap");
            createUserGroupIMap(mapping);
        } catch (IOException | NamingException | SQLException e) {
            log.error("User Group Map - entryEvicted] "
                    + "Error while re-creating maps: " + e.getMessage());
        }
    }
}
@abhinavsinha09
Copy link
Author

@abhinavsinha09 abhinavsinha09 commented Sep 4, 2017

Can someone please let me know how to fix this issue permanently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.