Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report of distributed objects present in cluster not showing same values #8743

Closed
cstamas opened this issue Aug 18, 2016 · 5 comments

Comments

Projects
None yet
2 participants
@cstamas
Copy link

commented Aug 18, 2016

I have a cluster with two members:

jvm 1    | 2016-08-18 16:50:24,170+0200 INFO  [hz._hzInstance_1_hzgroup.generic-operation.thread-3] *SYSTEM com.hazelcast.internal.cluster.ClusterService - [192.168.99.1]:5701 [hzgroup] [3.7] 
jvm 1    | 
jvm 1    | Members [2] {
jvm 1    |  Member [192.168.99.1]:5702 - fddb47c2-65ba-4f12-90ad-81111c78e375
jvm 1    |  Member [192.168.99.1]:5701 - 174f8d5f-0e28-40d4-a479-5fbb610c1ab5 this
jvm 1    | }
jvm 1    | 
jvm 1    | 2016-08-18 16:50:26,183+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.internal.jmx.ManagementService - [192.168.99.1]:5701 [hzgroup] [3.7] Hazelcast JMX agent enabled.
jvm 1    | 2016-08-18 16:50:26,284+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.core.LifecycleService - [192.168.99.1]:5701 [hzgroup] [3.7] [192.168.99.1]:5701 is STARTED

Both of them have JMX agents enabled, monitoring them using JConsole, here is what it tells me:
cluster-jmx
cluster-jmx2

One of the members claims there are no ISemaphore instances in cluster while other lists many of them. Isn't it suppose to have the two member JMX agents show me what cluster have? Actually, what does the cluster have, are those semaphores existing or not?

Hazelcast version: 3.7
Java: Oracle Java8
OS: dev OSX, prod Linux

@cstamas

This comment has been minimized.

Copy link
Author

commented Aug 18, 2016

Um, worse. Seems JMX is right, but cluster is confused...

Current cluster standing:

nodeA
vm 1    | 2016-08-18 17:11:59,270+0200 INFO  [hz._hzInstance_1_hzgroup.IO.thread-Acceptor] *SYSTEM com.hazelcast.nio.tcp.SocketAcceptorThread - [192.168.99.1]:5701 [hzgroup] [3.7] Accepting socket connection from /192.168.99.1:61459
jvm 1    | 2016-08-18 17:11:59,271+0200 INFO  [hz._hzInstance_1_hzgroup.cached.thread-4] *SYSTEM com.hazelcast.nio.tcp.TcpIpConnectionManager - [192.168.99.1]:5701 [hzgroup] [3.7] Established socket connection between /192.168.99.1:5701 and /192.168.99.1:61459
jvm 1    | 2016-08-18 17:12:01,101+0200 INFO  [hz-semaphore-remover] *SYSTEM org.sonatype.sisu.locks.HazelcastResourceLockFactory - HZ-Semaphores: removed=1447, unused=1447, local=2439, total=2439
jvm 1    | 2016-08-18 17:12:05,280+0200 INFO  [hz._hzInstance_1_hzgroup.priority-generic-operation.thread-0] *SYSTEM com.hazelcast.internal.cluster.ClusterService - [192.168.99.1]:5701 [hzgroup] [3.7] 
jvm 1    | 
jvm 1    | Members [2] {
jvm 1    |  Member [192.168.99.1]:5701 - 05a96ebb-e428-48de-b2f2-eafab09c7e92 this
jvm 1    |  Member [192.168.99.1]:5702 - 5300615d-627f-4e70-a3d7-bf097bab1d77
jvm 1    | }
jvm 1    | 
jvm 1    | 2016-08-18 17:12:05,432+0200 INFO  [hz._hzInstance_1_hzgroup.cached.thread-4] *SYSTEM com.hazelcast.internal.partition.InternalPartitionService - [192.168.99.1]:5701 [hzgroup] [3.7] Remaining migration tasks in queue => 1
jvm 1    | 2016-08-18 17:12:05,546+0200 INFO  [hz._hzInstance_1_hzgroup.migration] *SYSTEM com.hazelcast.internal.partition.impl.MigrationManager - [192.168.99.1]:5701 [hzgroup] [3.7] Re-partitioning cluster data... Migration queue size: 271
jvm 1    | 2016-08-18 17:12:07,104+0200 INFO  [hz._hzInstance_1_hzgroup.migration] *SYSTEM com.hazelcast.internal.partition.impl.MigrationThread - [192.168.99.1]:5701 [hzgroup] [3.7] All migration tasks have been completed, queues are empty.


nodeB
jvm 1    | 2016-08-18 17:11:58,367+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.config.XmlConfigLocator - Loading 'hazelcast.xml' from classpath.
jvm 1    | 2016-08-18 17:11:58,594+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.instance.DefaultAddressPicker - [LOCAL] [hzgroup] [3.7] Prefer IPv4 stack is true.
jvm 1    | 2016-08-18 17:11:58,609+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.instance.DefaultAddressPicker - [LOCAL] [hzgroup] [3.7] Picked [192.168.99.1]:5702, using socket ServerSocket[addr=/0.0.0.0,localport=5702], bind any local is true
jvm 1    | 2016-08-18 17:11:58,620+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.system - [192.168.99.1]:5702 [hzgroup] [3.7] Hazelcast 3.7 (20160817 - 1302600) starting at [192.168.99.1]:5702
jvm 1    | 2016-08-18 17:11:58,620+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.system - [192.168.99.1]:5702 [hzgroup] [3.7] Copyright (c) 2008-2016, Hazelcast, Inc. All Rights Reserved.
jvm 1    | 2016-08-18 17:11:58,620+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.system - [192.168.99.1]:5702 [hzgroup] [3.7] Configured Hazelcast Serialization version : 1
jvm 1    | 2016-08-18 17:11:58,731+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.spi.impl.operationservice.impl.BackpressureRegulator - [192.168.99.1]:5702 [hzgroup] [3.7] Backpressure is disabled
jvm 1    | 2016-08-18 17:11:59,064+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.instance.Node - [192.168.99.1]:5702 [hzgroup] [3.7] Creating MulticastJoiner
jvm 1    | 2016-08-18 17:11:59,068+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.core.LifecycleService - [192.168.99.1]:5702 [hzgroup] [3.7] [192.168.99.1]:5702 is STARTING
jvm 1    | 2016-08-18 17:11:59,153+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl - [192.168.99.1]:5702 [hzgroup] [3.7] Starting 8 partition threads
jvm 1    | 2016-08-18 17:11:59,153+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl - [192.168.99.1]:5702 [hzgroup] [3.7] Starting 5 generic threads (1 dedicated for priority tasks)
jvm 1    | 2016-08-18 17:11:59,157+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.nio.tcp.nonblocking.NonBlockingIOThreadingModel - [192.168.99.1]:5702 [hzgroup] [3.7] TcpIpConnectionManager configured with Non Blocking IO-threading model: 3 input threads and 3 output threads
jvm 1    | 2016-08-18 17:11:59,262+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.internal.cluster.impl.MulticastJoiner - [192.168.99.1]:5702 [hzgroup] [3.7] Trying to join to discovered node: [192.168.99.1]:5701
jvm 1    | 2016-08-18 17:11:59,268+0200 INFO  [hz._hzInstance_1_hzgroup.cached.thread-2] *SYSTEM com.hazelcast.nio.tcp.InitConnectionTask - [192.168.99.1]:5702 [hzgroup] [3.7] Connecting to /192.168.99.1:5701, timeout: 0, bind-any: true
jvm 1    | 2016-08-18 17:11:59,275+0200 INFO  [hz._hzInstance_1_hzgroup.cached.thread-2] *SYSTEM com.hazelcast.nio.tcp.TcpIpConnectionManager - [192.168.99.1]:5702 [hzgroup] [3.7] Established socket connection between /192.168.99.1:61459 and /192.168.99.1:5701
jvm 1    | 2016-08-18 17:12:05,287+0200 INFO  [hz._hzInstance_1_hzgroup.priority-generic-operation.thread-0] *SYSTEM com.hazelcast.internal.cluster.ClusterService - [192.168.99.1]:5702 [hzgroup] [3.7] 
jvm 1    | 
jvm 1    | Members [2] {
jvm 1    |  Member [192.168.99.1]:5701 - 05a96ebb-e428-48de-b2f2-eafab09c7e92
jvm 1    |  Member [192.168.99.1]:5702 - 5300615d-627f-4e70-a3d7-bf097bab1d77 this
jvm 1    | }
jvm 1    | 
jvm 1    | 2016-08-18 17:12:07,298+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.internal.jmx.ManagementService - [192.168.99.1]:5702 [hzgroup] [3.7] Hazelcast JMX agent enabled.
jvm 1    | 2016-08-18 17:12:07,407+0200 INFO  [jetty-main-1] *SYSTEM com.hazelcast.core.LifecycleService - [192.168.99.1]:5702 [hzgroup] [3.7] [192.168.99.1]:5702 is STARTED

Seemingly all fine and dandy, but I extended my logging to report what HazelcastInstance.getDistributedObjects() method report regarding semaphores.

nodeA: total of 1448 semaphores
nodeB: total of 0 semaphores

Code to query distributed objects on both members:

int semaphoreCount = 0;
Collection<DistributedObject> distributedObjects = instance.getDistributedObjects();
for (DistributedObject distributedObject : distributedObjects) {
  if (distributedObject instanceof ISemaphore) {
    semaphoreCount++;
  }
}

There is no exception, error or any peculiarity in the logs.

@cstamas cstamas changed the title JMX report of members inconsistent? Report of distributed objects present in cluster not showing same values Aug 18, 2016

@cstamas

This comment has been minimized.

Copy link
Author

commented Aug 18, 2016

So, this seems it was a split-brain situation but Hazelcast was unaware of it?

As I wrote above:
nodeA report: HZ-Semaphores: local=0, total=1448 (so it saw 1448 “remote” semaphores, not stored on nodeA)
nodeB report: HZ-Semaphores: local=0, total=0 (so it saw NO semaphores in system)

Then I shut down nodeB.

Suddenly on nodeA:
HZ-Semaphores: local=1448, total=1448 (the 1448 semaphore it saw arrived to a local partition)

Code to figure out is DistributedObject local or not:

instance.getPartitionService().getPartition(distributedObject.getPartitionKey()).getOwner().localMember();
@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2016

@cstamas: many thanks for another great report! now it should be quite easy to figure out what's going on.

@cstamas

This comment has been minimized.

Copy link
Author

commented Aug 18, 2016

So, here is a test that I think reproduces this:

package org.sonatype.sisu.locks;

import java.util.ArrayList;
import java.util.List;

import org.sonatype.sisu.litmus.testsupport.TestSupport;

import com.hazelcast.core.DistributedObject;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.ISemaphore;
import com.hazelcast.core.Member;
import com.hazelcast.core.Partition;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HazelcastTest
    extends TestSupport
{
  private final Logger log = LoggerFactory.getLogger(HazelcastTest.class);

  private final List<HazelcastInstance> instances = new ArrayList<>();

  @Test
  public void joinLeaveAndSemaphores() throws InterruptedException {
    instances.add(Hazelcast.newHazelcastInstance());
    instances.add(Hazelcast.newHazelcastInstance());
    final ISemaphore s1 = instances.get(0).getSemaphore("test1");
    final ISemaphore s2 = instances.get(1).getSemaphore("test2");

    report("1");
    Thread.sleep(1000);
    report("2");

    instances.add(Hazelcast.newHazelcastInstance());
    final ISemaphore s3 = instances.get(2).getSemaphore("test1");
    s3.destroy();

    report("3");
    Thread.sleep(1000);
    report("4");

    HazelcastInstance instance = instances.remove(0);
    instance.shutdown();

    report("5");
    Thread.sleep(1000);
    report("6");
  }

  private void report(final String marker) {
    for (HazelcastInstance instance : instances) {
      int total = 0;
      int local = 0;
      for (DistributedObject distributedObject : instance.getDistributedObjects()) {
        if (isSemaphore(distributedObject)) {
          total++;
          if (isLocallyStored(instance, distributedObject)) {
            local++;
          }
        }
      }
      log.info("{}: {}: total={}, local={}", instance.getCluster().getLocalMember().getUuid(), marker, total, local);
    }
  }

  private boolean isSemaphore(final DistributedObject distributedObject) {
    return distributedObject instanceof ISemaphore;
  }

  private boolean isLocallyStored(final HazelcastInstance instance, final DistributedObject distributedObject) {
    Partition partition = instance.getPartitionService().getPartition(distributedObject.getPartitionKey());
    if (partition != null) {
      Member owner = partition.getOwner();
      if (owner != null) {
        return owner.localMember();
      }
    }
    return false; // lie but be on safe side
  }
}

Note: odd and even (so, "1" and "2", "3" and "4") reports are same done twice but with a pause after the odd, to leave some propagation time if needed.

Cleaned up output (from HZ logs, just the test output is left):

37b6e52c-8e76-4da4-a334-c876e9e16a9f: 1: total=1, local=0
3e51adc4-fe80-4606-84f7-9f8fe0165af9: 1: total=2, local=0

37b6e52c-8e76-4da4-a334-c876e9e16a9f: 2: total=2, local=1
3e51adc4-fe80-4606-84f7-9f8fe0165af9: 2: total=2, local=1

37b6e52c-8e76-4da4-a334-c876e9e16a9f: 3: total=1, local=1
3e51adc4-fe80-4606-84f7-9f8fe0165af9: 3: total=1, local=0
ed6730f8-5a33-4671-914d-b4a295d4a51e: 3: total=0, local=0

37b6e52c-8e76-4da4-a334-c876e9e16a9f: 4: total=1, local=1
3e51adc4-fe80-4606-84f7-9f8fe0165af9: 4: total=1, local=0
ed6730f8-5a33-4671-914d-b4a295d4a51e: 4: total=0, local=0

3e51adc4-fe80-4606-84f7-9f8fe0165af9: 5: total=1, local=1
ed6730f8-5a33-4671-914d-b4a295d4a51e: 5: total=0, local=0

3e51adc4-fe80-4606-84f7-9f8fe0165af9: 6: total=1, local=1
ed6730f8-5a33-4671-914d-b4a295d4a51e: 6: total=0, local=0

(Full output is here https://gist.github.com/cstamas/9c39aa74bf0c923d0126b89b54082830)

So, IMHO step 4 introduced this state: one member cluster does not "see" the semaphore in system, while the other two does.

@jerrinot jerrinot self-assigned this Aug 19, 2016

@jerrinot jerrinot added this to the 3.8 milestone Aug 19, 2016

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 19, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixed hazelcast#8743

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 19, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixes hazelcast#8743

@jerrinot jerrinot added the Team: Core label Aug 19, 2016

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 19, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixes hazelcast#8743
@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2016

fix candidate: #8752

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Aug 23, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixes hazelcast#8743

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Dec 2, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixes hazelcast#8743

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Dec 2, 2016

Always populate proxy info in post join operations
Reasoning:
Even when a proxy does not need any initialization it has to be send
the other members during merge. Otherwise some objects won't be
visible in `getDistributedObjects()`

Fixes hazelcast#8743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.