Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

wangzw · 2019-12-13T11:10:49Z

Move the work of updating service instance list to DiscoveryClientResolverFactory to reduce the workload of Eureka.
set listener of DiscoveryClientNameResolver to null to avoid leak of Channel.
Add method of register/unregister listener to DiscoveryClientResolverFactory

Fixes #311

ST-DDT · 2019-12-13T16:02:28Z

Do you want a preliminary review/feedback or only after you're done?
(there is no point in reviewing something that is going to change anyway)

…nce issue.

wangzw · 2019-12-14T10:33:57Z

Patch has been deployed to our staging environment for test and ready for review. @ST-DDT

wangzw · 2019-12-15T13:48:45Z

Heap usage before and after this patch. Object leak has been fixed.

ST-DDT · 2019-12-15T13:57:21Z

Looks good.
I will review it tomorrow.

ST-DDT

All in all a very good improvement just, there are just a few things I would like to address.

...ure/src/main/java/net/devh/boot/grpc/client/nameresolver/DiscoveryClientResolverFactory.java

...figure/src/main/java/net/devh/boot/grpc/client/nameresolver/DiscoveryClientNameResolver.java

ST-DDT · 2019-12-16T23:58:25Z

There might be a rare case, where Listeners don't receiver their first (second) update.
If serviceInstanceList is not empty, then there won't be a force update.
In combination with a running resolve (that stores the previous listenerList) it will omit the new listener and thus exclude it from being updated with the new config. The needsToUpdateConnections check further excludes it from ever receiving that update until the server list changes again.

ST-DDT · 2019-12-17T00:43:38Z

I created a patch with some fixes/changes/ideas.
HashDataInc/grpc-spring-boot-starter@master...ST-DDT:pr/313

Unfortunately I forgot to commit them one by one, so the commit looks somewhat messy.
Not tested yet.

wangzw · 2019-12-17T09:42:44Z

There might be a rare case, where Listeners don't receiver their first (second) update.

Any reason why listener did not receiver their first(second) update? If there is such case, current implementation has the same issue.

New listener has been notified when registering or put into the list which will pass to Resolver before a force refresh. It will be notified either as error or success at the force refresh. I do not think there is a case of missing the update.

wangzw · 2019-12-17T09:59:26Z

I created a patch with some fixes/changes/ideas.
HashDataInc/grpc-spring-boot-starter@master...ST-DDT:pr/313

Unfortunately I forgot to commit them one by one, so the commit looks somewhat messy.
Not tested yet.

I have reviewed your patch, it does improve a lot. But I have only one concern about your patch.

You notify all listeners in the synchronized block, I not sure if listeners are notified asynchronously or not. In our system, there are maybe thousands listeners or even more. The action of listener on notifying is out of control of resolver, listener implementation may put heavy work or synchronized block in action, it will cause performance issue. That's why I make a copy of listener and pass into Resolver

ST-DDT · 2019-12-17T10:33:46Z

There might be a rare case, where Listeners don't receiver their first (second) update.

Any reason why listener did not receiver their first(second) update? If there is such case, currently implementation has the same issue.

The issue is that the cache updates aren't distributed to newly registered listeners (Race-Condition).

I will try to explain the issue using code:

public final synchronized void registerListener(String name , Listener listener ) { // name = "service", listener = listener2
    listenerMap.computeIfAbsent(name, n -> Sets.newHashSet()).add(listener); // listener2 gets added to [listener1]

    List<ServiceInstance> instances = serviceInstanceMap.get(name); // instances = ["10.0.0.1"]

    if (instances != null) { // true
        List<EquivalentAddressGroup> targets = convert(name, instances); // targets = ["10.0.0.1"]

        if (targets.isEmpty()) { // false
            // serviceInstanceMap.remove(name);
        } else {
            listener.onAddresses(targets, Attributes.EMPTY); // Addresses set to ["10.0.0.1"]
        }
    }

    refresh(name);
}

public final synchronized void refresh(String name) { // name = "service"
    Future<List<ServiceInstance>> future = discoverClientTasks.get(name); // future = future12345 (running)

    // no resolver is running with this service name.
    if (CollectionUtils.isEmpty(listenerMap.get(name))) { // false
        // return;
    }

    // exit when resolving but not a force refresh
    if (resolving(future) && !forceRefresh(name)) { // resolving && no force refresh => true
        return; // Exit
    }

    [...]
}

Next refresh interval:

public final synchronized void refresh(String name) { // name = "service"
    Future<List<ServiceInstance>> future = discoverClientTasks.get(name); // future = future12345 (done)

    // no resolver is running with this service name.
    if (CollectionUtils.isEmpty(listenerMap.get(name))) { // false 
        // return;
    }

    // exit when resolving but not a force refresh
    if (resolving(future) && !forceRefresh(name)) { // false && true => false
        // return;
    }

    // update cached instances when not a force refresh.
    if (!forceRefresh(name)) { // true
        List<ServiceInstance> instances = getResolveResult(future); // instances = ["10.0.0.2"]

        if (instances != KEEP_PREVIOUS) {
            serviceInstanceMap.put(name, instances); // ["10.0.0.2"] replaces ["10.0.0.1"] and listener2 not updated
        }
    }

    discoverClientTasks.put(name,
            executor.submit(new Resolve(name, 
                    Sets.newHashSet(listenerMap.get(name)), // [listener1, listener2]
                    Lists.newArrayList(serviceInstanceMap.get(name)) // ["10.0.0.2"]
            )) // future67890
    ); // future67890 replaces future12345
}

// future67890
// name = "service"
private List<ServiceInstance> resolveInternal() {
    final List<ServiceInstance> newInstanceList = client.getInstances(name); // newInstanceList = ["10.0.0.2"]
    if (CollectionUtils.isEmpty(newInstanceList)) { // false
        [...]
    }
    if (!needsToUpdateConnections(newInstanceList)) { // still ["10.0.0.2"] => no need to update => true
        return KEEP_PREVIOUS; // Exit => listener2 will never be updated with ["10.0.0.2"]
    }
    [...]
}

I hope this explains the issue.

EDIT: On further thought, the issue happens if a listener is registered after the first resolution and the last resolution returned a different result then the cache one.

While writing the above explanation I actually found a more serious problem, but that is easy to fix.

If for some reason the connection to the discovery service fails, then notifyStatus will mark the listener with the error, but it does not reset the cache, which causes future refreshs (successful but with the old state) to return at needsToUpdateConnections. So the listeners won't recover from the one time failure even, until the target's address changes.

Fix:

private void notifyStatus(Status status) {
    serviceInstanceMap.put(name, ImmutableList.of()); // <--- Reset cache (needs synchronization)
    for (Listener listener : savedListenerList) {
        listener.onError(status);
    }
}

ST-DDT · 2019-12-17T10:52:17Z

You notify all listeners in the synchronized block, I not sure if listeners are notified asynchronously or not. In our system. there are maybe thousands listeners or even more. And the action of listener on notifying is out of control of resolver, listener implementation may put heavy work or synchronized block in action, it will cause performance issue.

According to the javadocs:

 /**
   * Receives address updates.
   *
   * <p>All methods are expected to return quickly. // <---
   *
   * @since 1.0.0
   */
  @ExperimentalApi("https://github.com/grpc/grpc-java/issues/1770")
  @ThreadSafe
  public interface Listener {

So it shouldn't take "that" long.

That's why I make a copy of listener and pass into Resolver

If we create a copy, then we might run into the above cache issue again
( start resolve, register listener, complete resolve (but copy does not contain the newly registered one and thus does not get updated)).

We could try to avoid that by invoking force refreshs when registering new listeners, but this could cause performance issues as well.

I also considered removing the "nothing has changed check", but I'm not sure about the performance implications here (maybe this forces all connections to be reopened, which would be mind blowing expensive) (See also: grpc/grpc-java#6524)

ST-DDT · 2019-12-17T17:57:39Z

I just got a reply from: grpc/grpc-java#6523 (comment)

The onAddresses() might be removed some time in the near future, so do not use it in your new code.

So we should probably update our code here as well.

ST-DDT · 2019-12-19T14:15:53Z

@wangzw Do you have time to fix this in the next few days?

wangzw · 2019-12-22T13:50:24Z

@wangzw Do you have time to fix this in the next few days?

I will do it in this week.

ST-DDT · 2019-12-23T19:47:44Z

I updated my branch to also include a unit test

John24y · 2019-12-26T07:49:38Z

内存泄漏确实存在，但是性能问题似乎是和DiscoveryClientResolver无关的。
gRPC的channel每次建立连接（Idle时会断开连接）都会创建新的NameResolver，DiscoveryClientResolverFactory从来不释放它创建过的NameResolver，造成内存泄漏，这个问题只要在创建NameResolver时传给它一个shutdown时调用的Runnable，Runnable清理它在Factory中的记录，就解决了。把代码搬到Factory不太合适。
心跳是 DiscoveryClient负责的， DiscoveryClient在整个应用中只有一个，心跳频率是配置决定的，和有多少NameResolver也没关系，从 DiscoveryClient查询服务列表也是在内存里面查的. 如果一个service有多个服务提供者，实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

wangzw · 2019-12-28T07:07:54Z

如果一个service有多个服务提供者，实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

…nce issue.

wangzw · 2019-12-28T13:03:25Z

@ST-DDT I have merged you patch and make some small modifications. I moved the code of notifying listeners out of synchronized block.

ST-DDT

如果一个service有多个服务提供者，实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

Do you have any metrics/data that show how long each call takes? I looked at the sources and it looks like it uses a local cache (inside eureka's discovery client, so it does not make any HTTP calls at all) that get asynchronously updated every 30 seconds and thus can return almost instantly.

ST-DDT · 2019-12-28T23:32:56Z

...ure/src/main/java/net/devh/boot/grpc/client/nameresolver/DiscoveryClientResolverFactory.java

+            for (final Listener2 listener : listeners) {
+                listener.onResult(result);
+            }


You need to change this to preserve the if (serviceList != KEEP_PREVIOUS) { logic.
E.g. by adding an else statement above that returns early.

if serviceList == KEEP_PREVIOUS is true, listeners should be empty and complete the loop quickly.

John24y · 2019-12-29T11:49:53Z

如果一个service有多个服务提供者，实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

查询服务列表是从本地缓存查的

wangzw · 2019-12-31T05:00:27Z

如果一个service有多个服务提供者，实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

Do you have any metrics/data that show how long each call takes? I looked at the sources and it looks like it uses a local cache (inside eureka's discovery client, so it does not make any HTTP calls at all) that get asynchronously updated every 30 seconds and thus can return almost instantly.

Got intensive eureka request message from log, not sure if the request is cached or not.

ST-DDT · 2020-01-04T21:03:46Z

I will wait for a response on grpc/grpc-java#6524 (comment) before possibly merging this PR.
As eagerly updating the target addresses could cause connection attempts to all addresses (currently the behavior is not specified by grpc)

ST-DDT · 2020-08-01T08:43:12Z

The bug itself has been fixed with #320 .
As the remaining changes only affect a non-recommended setup that isn't handled by grpc-java's NameResolvers as well, I will close this PR until grpc-java implements grpc/grpc-java#6524.

Thanks for bringing the issue to our attention as well as contributing a lot to solving it.

wangzw force-pushed the master branch 2 times, most recently from 4f984c0 to ae51cdd Compare December 13, 2019 13:12

wangzw mentioned this pull request Dec 13, 2019

Fix memory leak in DiscoveryClientResolverFactory #312

Closed

ST-DDT added bug Something does not work as expected enhancement A feature request or improvement labels Dec 13, 2019

ST-DDT added this to the 2.6.2 milestone Dec 13, 2019

Fix grpc-ecosystem#311. DiscoveryClientNameResolver leak and performa…

876d847

…nce issue.

wangzw force-pushed the master branch from ae51cdd to 876d847 Compare December 14, 2019 09:44

wangzw changed the title ~~WIP: Fix #311. DiscoveryClientNameResolver leak and performance issue.~~ Fix #311. DiscoveryClientNameResolver leak and performance issue. Dec 14, 2019

ST-DDT self-assigned this Dec 16, 2019

ST-DDT requested changes Dec 16, 2019

View reviewed changes

Fix grpc-ecosystem#311. DiscoveryClientNameResolver leak and performa…

e7355a1

…nce issue.

ST-DDT requested changes Dec 29, 2019

View reviewed changes

ST-DDT modified the milestones: 2.6.2, 2.7.0 Jan 4, 2020

ST-DDT modified the milestones: 2.7.0, Future Feb 23, 2020

ST-DDT closed this Aug 1, 2020

ST-DDT removed this from the Future milestone Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

wangzw commented Dec 13, 2019 •

edited

ST-DDT commented Dec 13, 2019

wangzw commented Dec 14, 2019 •

edited

wangzw commented Dec 15, 2019

ST-DDT commented Dec 15, 2019

ST-DDT left a comment

ST-DDT commented Dec 16, 2019 •

edited

ST-DDT commented Dec 17, 2019

wangzw commented Dec 17, 2019 •

edited

wangzw commented Dec 17, 2019 •

edited

ST-DDT commented Dec 17, 2019 •

edited

ST-DDT commented Dec 17, 2019

ST-DDT commented Dec 17, 2019

ST-DDT commented Dec 19, 2019 •

edited

wangzw commented Dec 22, 2019

ST-DDT commented Dec 23, 2019

John24y commented Dec 26, 2019 •

edited

wangzw commented Dec 28, 2019

wangzw commented Dec 28, 2019

ST-DDT left a comment

ST-DDT Dec 28, 2019

wangzw Dec 31, 2019

John24y commented Dec 29, 2019

wangzw commented Dec 31, 2019

ST-DDT commented Jan 4, 2020

ST-DDT commented Aug 1, 2020

Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

Conversation

wangzw commented Dec 13, 2019 • edited

ST-DDT commented Dec 13, 2019

wangzw commented Dec 14, 2019 • edited

wangzw commented Dec 15, 2019

ST-DDT commented Dec 15, 2019

ST-DDT left a comment

Choose a reason for hiding this comment

ST-DDT commented Dec 16, 2019 • edited

ST-DDT commented Dec 17, 2019

wangzw commented Dec 17, 2019 • edited

wangzw commented Dec 17, 2019 • edited

ST-DDT commented Dec 17, 2019 • edited

ST-DDT commented Dec 17, 2019

ST-DDT commented Dec 17, 2019

ST-DDT commented Dec 19, 2019 • edited

wangzw commented Dec 22, 2019

ST-DDT commented Dec 23, 2019

John24y commented Dec 26, 2019 • edited

wangzw commented Dec 28, 2019

wangzw commented Dec 28, 2019

ST-DDT left a comment

Choose a reason for hiding this comment

ST-DDT Dec 28, 2019

Choose a reason for hiding this comment

wangzw Dec 31, 2019

Choose a reason for hiding this comment

John24y commented Dec 29, 2019

wangzw commented Dec 31, 2019

ST-DDT commented Jan 4, 2020

ST-DDT commented Aug 1, 2020

wangzw commented Dec 13, 2019 •

edited

wangzw commented Dec 14, 2019 •

edited

ST-DDT commented Dec 16, 2019 •

edited

wangzw commented Dec 17, 2019 •

edited

wangzw commented Dec 17, 2019 •

edited

ST-DDT commented Dec 17, 2019 •

edited

ST-DDT commented Dec 19, 2019 •

edited

John24y commented Dec 26, 2019 •

edited