Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event lost and QueryCache#tryRecover() triggered in a loop even if is succesfull and data is not lost #12928

Closed
acieplak opened this issue Apr 19, 2018 · 6 comments

Comments

@acieplak
Copy link

@acieplak acieplak commented Apr 19, 2018

Hello,
Recently, I've found an issue with QueryCache which might be some other edge case of #12572
Hazelcast version: 3.10-20180416
What I do:

  1. Run one hazelcast instance
  2. Run a second hazelcast instance
  3. Submit task to all members in which I put some values to the IMap
  4. Kill the first node and wait some time to let do repartition
  5. Submit task to all members in which I modify existing objects from IMap

As a result of this, I see that hazelcast try to recover data even if previously (just after killing the first node) repartitioning was ok:

Members {size:1, ver:3} [
Member [192.168.1.38]:5702 - 28203b3d-7f4f-4304-a284-cfa3720b490d this
]

16:25:32,497 INFO MembershipManager:65 - [192.168.1.38]:5702 [dev] [3.10-SNAPSHOT] Mastership is claimed with: MembersView{version=3, members=[MemberInfo{address=[192.168.1.38]:5702, uuid=28203b3d-7f4f-4304-a284-cfa3720b490d, liteMember=false, memberListJoinVersion=2}]}
16:25:32,498 INFO InternalPartitionService:65 - [192.168.1.38]:5702 [dev] [3.10-SNAPSHOT] Fetching most recent partition table! my version: 678
16:25:32,498 INFO InternalPartitionService:65 - [192.168.1.38]:5702 [dev] [3.10-SNAPSHOT] Most recent partition table version: 678
16:25:32,498 INFO TransactionManagerService:65 - [192.168.1.38]:5702 [dev] [3.10-SNAPSHOT] Committing/rolling-back alive transactions of Member [192.168.1.38]:5701 - 655b5d83-c113-4ee5-bdc1-28faea92f405, UUID: 655b5d83-c113-4ee5-bdc1-28faea92f405
16:25:32,530 INFO MigrationManager:65 - [192.168.1.38]:5702 [dev] [3.10-SNAPSHOT] Partition balance is ok, no need to re-partition cluster data...

And the logs after killing node and submitting new task:

16:25:53,967 WARN SubscriberAccumulator:50 - Event lost detected for partitionId=183, expectedSequence=4 but foundSequence=2, cacheSize=200
16:25:53,968 WARN MyHazelcast:56 - lost event 67e6177c-fc59-4ba2-a986-fd26cbc3090b from member null partition 263, trying to recover

This appers about every 5 seconds for lots of partitionIds. And what's worth to mention - it looks like recovery is succesfull - see that QueryCache#tryRecover() returns true.

Here is my configuration:

    static HazelcastInstance hazelcastInstance() {
        HazelcastInstance hazelcastInstance = Hazelcast.newHazelcastInstance(configuration());
        addEventLostListener(hazelcastInstance);
        log.info("started hazelcast node");
        return hazelcastInstance;
    }

    private static void addEventLostListener(HazelcastInstance hazelcastInstance) {
        IMap<Long, String> map = hazelcastInstance.getMap(MAP_NAME);
        final QueryCache<Long, String> qc = map.getQueryCache(MAP_NAME);
        String id = qc.addEntryListener(eventListenerFor(qc), false);
        log.info("added event lost listener with {} id", id);
    }

    private static Config configuration() {
        return new Config()
                .setProperty("hazelcast.logging.type", "slf4j")
                .addMapConfig(new MapConfig(MAP_NAME)
                        .setBackupCount(1)
                        .setReadBackupData(true)
                        .addQueryCacheConfig(new QueryCacheConfig()
                                .setName(MAP_NAME)
                                .setPredicateConfig(new PredicateConfig(INSTANCE))
                                .setIncludeValue(true)
                                .setPopulate(true)
                                .setDelaySeconds(0)
                                .setBufferSize(1_000)
                        )
                );
    }

    private static EventLostListener eventListenerFor(QueryCache qc) {
        return event -> {
            log.warn("lost event {} from member {} partition {}, trying to recover", event.getName(), event.getMember(), event.getPartitionId());
            boolean recoveryFailed = !qc.tryRecover();
            if (recoveryFailed) {
                log.error("unable to recover lost event {} from member {} partition {}", event.getName(), event.getMember(), event.getPartitionId());
            }
        };
    }

And my main test method:

    public static void main(String[] args) throws Exception {
        HazelcastInstance hazelcast = hazelcastInstance();
        IMap<Long, String> iMap = hazelcast.getMap(MAP_NAME);
        iMap.clear();

        log.info("create some entries on all members");
        IExecutorService executor = hazelcast.getExecutorService(EXECUTOR_SERVICE);
        Map<Member, Future<Void>> tasks = executor.submitToAllMembers(new PutTask(100));
        for (Future<Void> future : tasks.values()) {
            future.get();
        }
        log.info("entries created, map size is {}", iMap.size());

        log.info("gonna sleep for {}s, please kill secondary node", 15);
        sleep(15_000);

        log.info("about to submit dummy task on map size {}", iMap.size());
        Map<Member, Future<Void>> dummyTasks2 = executor.submitToAllMembers(new ChangeTask("||change-two"));
        for (Future<Void> future : dummyTasks2.values()) {
            future.get();
        }
        log.info("everything done");
    }

Additional info:
I put to the map 100 objects and have QueryCacheConfig#bufferSize set to 1000. In the situation when I have smaller buffer size and a lot more objects in map I get logs that recovery was unsuccessfull.

Regards

@ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented Apr 26, 2018

Hi @acieplak, thanks for the reproducer i managed to reproduce it.

Actually, in case of node kill, we can not be sure whether or not there is an event loss. It can be ok as in your scenario but it can also be possible to lose events from internal buffers that feed query cache if one is using event batching (since events has no backup in hazelcast). From this perspective, due to the uncertainty about event loss, your case seems an irrecoverable situation and as a possible fix, you can try to create a new query cache in this kind of situations.

@ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented Apr 27, 2018

closing this one according to the above explanation.

@TomaszGaweda
Copy link

@TomaszGaweda TomaszGaweda commented May 10, 2018

@ahmetmircik Hi, I'll continue thread from PR: I quite don't understand why PR was not merged.

Imagine situation: cluster with 10 nodes, each has a query cache copy of one partitioned cache. Only one node goes down and whole cluster becomes unstable, because node wants to recover events.

Your approach from PR seems to be ok - it node stops responding and partition migration occurs, then we also allow to restart query cache sequence. It will just restart broken query caches.

Could you please explain why this may broke anything? I'm possibly missing something

@ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented May 10, 2018

Hi, current behavior when query cache(QC) cannot recover itself, to stop it. If we solve this problem as in my PR, QC will continue to work even there is event loss and will not stop. This is a behavior change. I was trying to explain this.

@TomaszGaweda
Copy link

@TomaszGaweda TomaszGaweda commented May 10, 2018

@ahmetmircik Yes, but you must stop all nodes, because they got "crazy". We are not able to detect that recovery is not successful - in case presented by @acieplak, tryRecover returns true. So, whole tryRecover now doesn't make sense and users must re-create all query caches in case of any error.

Isn't is better to allow QC to continue working? Or, instead of just reseting accumulator, silently recreate QCs in Migration Service?

Second approach can be done by users, but is annoying and - from Coherence's user point of view, where identical case is not a problem and QC continue working - it's just not user-friendly and not trustworthy - you never now if it's working or not, as even listeners can be in infite loop

@ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented May 11, 2018

@TomaszGaweda Reopening this issue for tryRecover returns true problem, seems it was missed before.

Regarding making QC to continue working, i think its worth to re-think behavior change for 3.11. Probably we can do this change. I will update this issue with the latest status.

Thanks making this visible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.