Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #311. DiscoveryClientNameResolver leak and performance issue. #313

Closed
wants to merge 2 commits into from

Conversation

wangzw
Copy link

@wangzw wangzw commented Dec 13, 2019

  1. Move the work of updating service instance list to DiscoveryClientResolverFactory to reduce the workload of Eureka.
  2. set listener of DiscoveryClientNameResolver to null to avoid leak of Channel.
  3. Add method of register/unregister listener to DiscoveryClientResolverFactory

@ST-DDT

Fixes #311

@wangzw wangzw force-pushed the master branch 2 times, most recently from 4f984c0 to ae51cdd Compare December 13, 2019 13:12
@ST-DDT ST-DDT added bug Something does not work as expected enhancement A feature request or improvement labels Dec 13, 2019
@ST-DDT ST-DDT added this to the 2.6.2 milestone Dec 13, 2019
@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 13, 2019

Do you want a preliminary review/feedback or only after you're done?
(there is no point in reviewing something that is going to change anyway)

@wangzw wangzw changed the title WIP: Fix #311. DiscoveryClientNameResolver leak and performance issue. Fix #311. DiscoveryClientNameResolver leak and performance issue. Dec 14, 2019
@wangzw
Copy link
Author

wangzw commented Dec 14, 2019

Patch has been deployed to our staging environment for test and ready for review. @ST-DDT

@wangzw
Copy link
Author

wangzw commented Dec 15, 2019

屏幕快照 2019-12-15 21 45 32

Heap usage before and after this patch. Object leak has been fixed.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 15, 2019

Looks good.
I will review it tomorrow.

@ST-DDT ST-DDT self-assigned this Dec 16, 2019
Copy link
Collaborator

@ST-DDT ST-DDT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all a very good improvement just, there are just a few things I would like to address.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 16, 2019

There might be a rare case, where Listeners don't receiver their first (second) update.
If serviceInstanceList is not empty, then there won't be a force update.
In combination with a running resolve (that stores the previous listenerList) it will omit the new listener and thus exclude it from being updated with the new config. The needsToUpdateConnections check further excludes it from ever receiving that update until the server list changes again.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 17, 2019

I created a patch with some fixes/changes/ideas.
HashDataInc/grpc-spring-boot-starter@master...ST-DDT:pr/313

Unfortunately I forgot to commit them one by one, so the commit looks somewhat messy.
Not tested yet.

@wangzw
Copy link
Author

wangzw commented Dec 17, 2019

There might be a rare case, where Listeners don't receiver their first (second) update.

Any reason why listener did not receiver their first(second) update? If there is such case, current implementation has the same issue.

New listener has been notified when registering or put into the list which will pass to Resolver before a force refresh. It will be notified either as error or success at the force refresh. I do not think there is a case of missing the update.

@wangzw
Copy link
Author

wangzw commented Dec 17, 2019

I created a patch with some fixes/changes/ideas.
HashDataInc/grpc-spring-boot-starter@master...ST-DDT:pr/313

Unfortunately I forgot to commit them one by one, so the commit looks somewhat messy.
Not tested yet.

I have reviewed your patch, it does improve a lot. But I have only one concern about your patch.

You notify all listeners in the synchronized block, I not sure if listeners are notified asynchronously or not. In our system, there are maybe thousands listeners or even more. The action of listener on notifying is out of control of resolver, listener implementation may put heavy work or synchronized block in action, it will cause performance issue. That's why I make a copy of listener and pass into Resolver

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 17, 2019

There might be a rare case, where Listeners don't receiver their first (second) update.

Any reason why listener did not receiver their first(second) update? If there is such case, currently implementation has the same issue.

The issue is that the cache updates aren't distributed to newly registered listeners (Race-Condition).

I will try to explain the issue using code:

public final synchronized void registerListener(String name , Listener listener ) { // name = "service", listener = listener2
    listenerMap.computeIfAbsent(name, n -> Sets.newHashSet()).add(listener); // listener2 gets added to [listener1]

    List<ServiceInstance> instances = serviceInstanceMap.get(name); // instances = ["10.0.0.1"]

    if (instances != null) { // true
        List<EquivalentAddressGroup> targets = convert(name, instances); // targets = ["10.0.0.1"]

        if (targets.isEmpty()) { // false
            // serviceInstanceMap.remove(name);
        } else {
            listener.onAddresses(targets, Attributes.EMPTY); // Addresses set to ["10.0.0.1"]
        }
    }

    refresh(name);
}
public final synchronized void refresh(String name) { // name = "service"
    Future<List<ServiceInstance>> future = discoverClientTasks.get(name); // future = future12345 (running)

    // no resolver is running with this service name.
    if (CollectionUtils.isEmpty(listenerMap.get(name))) { // false
        // return;
    }

    // exit when resolving but not a force refresh
    if (resolving(future) && !forceRefresh(name)) { // resolving && no force refresh => true
        return; // Exit
    }

    [...]
}

Next refresh interval:

public final synchronized void refresh(String name) { // name = "service"
    Future<List<ServiceInstance>> future = discoverClientTasks.get(name); // future = future12345 (done)

    // no resolver is running with this service name.
    if (CollectionUtils.isEmpty(listenerMap.get(name))) { // false 
        // return;
    }

    // exit when resolving but not a force refresh
    if (resolving(future) && !forceRefresh(name)) { // false && true => false
        // return;
    }

    // update cached instances when not a force refresh.
    if (!forceRefresh(name)) { // true
        List<ServiceInstance> instances = getResolveResult(future); // instances = ["10.0.0.2"]

        if (instances != KEEP_PREVIOUS) {
            serviceInstanceMap.put(name, instances); // ["10.0.0.2"] replaces ["10.0.0.1"] and listener2 not updated
        }
    }

    discoverClientTasks.put(name,
            executor.submit(new Resolve(name, 
                    Sets.newHashSet(listenerMap.get(name)), // [listener1, listener2]
                    Lists.newArrayList(serviceInstanceMap.get(name)) // ["10.0.0.2"]
            )) // future67890
    ); // future67890 replaces future12345
}
// future67890
// name = "service"
private List<ServiceInstance> resolveInternal() {
    final List<ServiceInstance> newInstanceList = client.getInstances(name); // newInstanceList = ["10.0.0.2"]
    if (CollectionUtils.isEmpty(newInstanceList)) { // false
        [...]
    }
    if (!needsToUpdateConnections(newInstanceList)) { // still ["10.0.0.2"] => no need to update => true
        return KEEP_PREVIOUS; // Exit => listener2 will never be updated with ["10.0.0.2"]
    }
    [...]
}

I hope this explains the issue.

EDIT: On further thought, the issue happens if a listener is registered after the first resolution and the last resolution returned a different result then the cache one.


While writing the above explanation I actually found a more serious problem, but that is easy to fix.

If for some reason the connection to the discovery service fails, then notifyStatus will mark the listener with the error, but it does not reset the cache, which causes future refreshs (successful but with the old state) to return at needsToUpdateConnections. So the listeners won't recover from the one time failure even, until the target's address changes.

Fix:

private void notifyStatus(Status status) {
    serviceInstanceMap.put(name, ImmutableList.of()); // <--- Reset cache (needs synchronization)
    for (Listener listener : savedListenerList) {
        listener.onError(status);
    }
}

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 17, 2019

You notify all listeners in the synchronized block, I not sure if listeners are notified asynchronously or not. In our system. there are maybe thousands listeners or even more. And the action of listener on notifying is out of control of resolver, listener implementation may put heavy work or synchronized block in action, it will cause performance issue.

According to the javadocs:

 /**
   * Receives address updates.
   *
   * <p>All methods are expected to return quickly. // <---
   *
   * @since 1.0.0
   */
  @ExperimentalApi("https://github.com/grpc/grpc-java/issues/1770")
  @ThreadSafe
  public interface Listener {

So it shouldn't take "that" long.

That's why I make a copy of listener and pass into Resolver

If we create a copy, then we might run into the above cache issue again
( start resolve, register listener, complete resolve (but copy does not contain the newly registered one and thus does not get updated)).

We could try to avoid that by invoking force refreshs when registering new listeners, but this could cause performance issues as well.

I also considered removing the "nothing has changed check", but I'm not sure about the performance implications here (maybe this forces all connections to be reopened, which would be mind blowing expensive) (See also: grpc/grpc-java#6524)

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 17, 2019

I just got a reply from: grpc/grpc-java#6523 (comment)

The onAddresses() might be removed some time in the near future, so do not use it in your new code.

So we should probably update our code here as well.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 19, 2019

@wangzw Do you have time to fix this in the next few days?

@wangzw
Copy link
Author

wangzw commented Dec 22, 2019

@wangzw Do you have time to fix this in the next few days?

I will do it in this week.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Dec 23, 2019

I updated my branch to also include a unit test

@John24y
Copy link
Contributor

John24y commented Dec 26, 2019

内存泄漏确实存在,但是性能问题似乎是和DiscoveryClientResolver无关的。
gRPC的channel每次建立连接(Idle时会断开连接)都会创建新的NameResolverDiscoveryClientResolverFactory从来不释放它创建过的NameResolver,造成内存泄漏,这个问题只要在创建NameResolver时传给它一个shutdown时调用的RunnableRunnable清理它在Factory中的记录,就解决了。把代码搬到Factory不太合适。
心跳是 DiscoveryClient负责的, DiscoveryClient在整个应用中只有一个,心跳频率是配置决定的,和有多少NameResolver也没关系,从 DiscoveryClient查询服务列表也是在内存里面查的. 如果一个service有多个服务提供者,实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

@wangzw
Copy link
Author

wangzw commented Dec 28, 2019

如果一个service有多个服务提供者,实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

@wangzw
Copy link
Author

wangzw commented Dec 28, 2019

@ST-DDT I have merged you patch and make some small modifications. I moved the code of notifying listeners out of synchronized block.

Copy link
Collaborator

@ST-DDT ST-DDT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果一个service有多个服务提供者,实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

Do you have any metrics/data that show how long each call takes? I looked at the sources and it looks like it uses a local cache (inside eureka's discovery client, so it does not make any HTTP calls at all) that get asynchronously updated every 30 seconds and thus can return almost instantly.

Comment on lines +425 to +427
for (final Listener2 listener : listeners) {
listener.onResult(result);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change this to preserve the if (serviceList != KEEP_PREVIOUS) { logic.
E.g. by adding an else statement above that returns early.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if serviceList == KEEP_PREVIOUS is true, listeners should be empty and complete the loop quickly.

@John24y
Copy link
Contributor

John24y commented Dec 29, 2019

如果一个service有多个服务提供者,实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

查询服务列表是从本地缓存查的

@wangzw
Copy link
Author

wangzw commented Dec 31, 2019

如果一个service有多个服务提供者,实际client对此service也只有一个channel和一个NameResolver, 不用根据service name对NameResolver分组并一起刷新。这一整块代码看起来修改的有点多。

Each channel has a resolver, and each resolver sends request to Eureka independently. So if there are many channels. Eureka will receive many duplicated request from different resolvers. So move DiscoveryClient::getInstances from resolver to DiscoveryClientResolverFactory or any other singleton object will reduce the request quantity significantly.

Do you have any metrics/data that show how long each call takes? I looked at the sources and it looks like it uses a local cache (inside eureka's discovery client, so it does not make any HTTP calls at all) that get asynchronously updated every 30 seconds and thus can return almost instantly.

Got intensive eureka request message from log, not sure if the request is cached or not.

未命名

@ST-DDT
Copy link
Collaborator

ST-DDT commented Jan 4, 2020

I will wait for a response on grpc/grpc-java#6524 (comment) before possibly merging this PR.
As eagerly updating the target addresses could cause connection attempts to all addresses (currently the behavior is not specified by grpc)

@ST-DDT ST-DDT modified the milestones: 2.6.2, 2.7.0 Jan 4, 2020
@ST-DDT ST-DDT modified the milestones: 2.7.0, Future Feb 23, 2020
@ST-DDT
Copy link
Collaborator

ST-DDT commented Aug 1, 2020

The bug itself has been fixed with #320 .
As the remaining changes only affect a non-recommended setup that isn't handled by grpc-java's NameResolvers as well, I will close this PR until grpc-java implements grpc/grpc-java#6524.

Thanks for bringing the issue to our attention as well as contributing a lot to solving it.

@ST-DDT ST-DDT closed this Aug 1, 2020
@ST-DDT ST-DDT removed this from the Future milestone Aug 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something does not work as expected enhancement A feature request or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DiscoveryClientNameResolver leak and performance issue.
3 participants