fix(sub-channels): clear the cleanup interval when all channels are unrefed. #1083

JrSchild · 2019-10-24T09:52:01Z

This forces a cleanup of subchannels whenever a channel is closed and disables the background task if there are no more refed subchannels. This ensure that if all clients using grpc are closed, there are no background tasks running.

Co-authored-by: Natan Sągol m@merlinnot.com

…nrefed Co-authored-by: Natan Sągol <m@merlinnot.com>

linux-foundation-easycla · 2019-10-24T09:52:04Z

The committers are authorized under a signed CLA.

✅ Joram Ruitenschild (20cbfc7, 93f8169, 5f271de, 821c9ab)
✅ Natan Sągol (e51a740, 61d7e77, 0353dbf)

thelinuxfoundation · 2019-10-24T09:52:07Z

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

User @JrSchild isn't covered by a CLA. They will need to complete the form at https://identity.linuxfoundation.org/projects/cncf

Regards,
CLA GitHub bot

JrSchild · 2019-10-24T09:59:51Z

I signed it

murgatroid99 · 2019-10-24T17:02:23Z

Have you confirmed that this problem is not just a Jest bug? My own tests always terminate quickly without even explicitly closing channels.

merlinnot · 2019-10-24T17:52:44Z

There might be a bit of an overstatement here: it probably won't fix #1083. However it will still fix one of the issues we're experiencing.

This setInterval won't prevent process from exiting, however it will keep running and consuming memory (preventing garbage collection of multiple, multiple objects, as it keeps a reference to a subchannel pool) if:

you don't use gRPC anymore, but don't exit the process.
you import the file multiple times, clearing require cache - which is a common practice in tests
you run it in more interesting scenarios, like using multiple Node.js VMs in one process.

We're trying to introduce some gradual changes to help resolve issues with the library we are facing at the moment. I hope that this change is small enough to be reviewed with confidence and given the examples above you'll see a point of incorporating it into the library.

We tested it on our internal projects too and it seems to work perfectly fine.

murgatroid99 · 2019-10-24T19:28:12Z

Clearing the require cache is something I hadn't considered. I'm OK with turning the timer on and off based on whether there is at least one subchannel in the pool, but I don't like forcing the pool to clear out every time a channel is closed. Part of the point here is to allow subchannel reuse to improve connection time, including when a channel is closed and then a new one is opened. If all of the channels are closed, then the subchannel pool should still get completely cleared out the next time the timer runs.

merlinnot · 2019-10-24T19:39:21Z

Great. To align on this, so me and my colleagues can improve the PR tomorrow:

The expectation is that instead of trying to unref as much as possible and check if we unrefed everything, we should first check if we can unref everything and do so only if that's the case?

murgatroid99 · 2019-10-24T19:51:16Z

I'm not suggesting any logic change. I'm suggesting using the existing logic, but turning the timer on if there is a subchannel in the pool and off when the last subchannel is removed from the pool. A natural consequence of this is that if there are no open channels, every subchannel will be removed and then the timer will be disabled.

We can also talk about changing the REF_CHECK_INTERVAL number. The current choice was arbitrary. A lower number cleans everything up faster, but also uses more CPU.

merlinnot · 2019-10-24T20:45:05Z

I'm not sure if that would solve the issue. Let me clarify my expectations first, so we're on the same page:

If all gRPC clients are closed (using the close method), all resources are freed immediately, synchronously.

Do you agree that it should be the case?

murgatroid99 · 2019-10-24T22:19:57Z

No, I do not agree with that. It is beneficial to have subchannels stick around for a brief time after the last channel closes in case a new channel is created.

merlinnot · 2019-10-25T16:23:20Z

Could you share the use cases that you have in mind that would benefit from the current behavior?

My understanding it is that it would help if users are rapidly opening and deliberately closing connections, using the exact same options (addresses, credentials, ...). Are there some other use cases?

murgatroid99 · 2019-10-25T22:59:59Z

Yes, the use-case for this is that someone closes a client connected to a specific target, and then later (but not much later) creates a new identical client with the same target.

merlinnot · 2019-10-28T16:43:00Z

Do you feel it's an expected way of using the library? Or is it an optimization towards users misusing it?

I'm starting to wonder if we should have this timer at all, maybe we should only clean it up when a client is explicitly closed?

murgatroid99 · 2019-10-28T16:58:40Z

The timer is still needed to handle clients that aren't closed explicitly and subchannels that become unused that were previously used by still-active channels.

merlinnot · 2019-10-28T17:01:55Z

Thanks for the explanation.

Do you think that explicitly closing the client and recreating it repeatedly is an expected way of using this library?

murgatroid99 · 2019-10-28T20:20:47Z

No, closing and recreating identical clients is not an expected or intended use of this library. I'm retracting my previous comment here; I agree now that there is not significant inherent value in preserving subchannels past the channel lifetime.

Still, I believe that the timer is necessary for other usage patterns and that cleaning up the subchannel pool on channel close should be essentially unnecessary. So far, as far as I am aware, the only benefit to cleaning up immediately is to avoid this problem with Jest, and I do not want to start addressing issues in other libraries and tools in our code.

My preference then is to keep the functionality simple by only having the timer. I do still support turning the timer off while the pool has no subchannels. In addition, memory could be further saved by creating a new empty pool object every time the subchannel pool becomes empty.

merlinnot · 2019-10-29T06:30:08Z

Maybe using Jest in the issue wasn’t the best idea, I didn’t intend to focus on it. It affects any testing setup if you clean require cache or you have memory leak detection implemented.

Conceptually it seems that “I have no resources allocated if I cleaned everything up” would be an expected behavior. Maybe we could find some middle ground here, so we can have this behavior, but it’s implemented in a way that is more to your liking? Do you feel that the current implementation is too complex? Maybe we should split it in two parts, so that it’s easier to review?

murgatroid99 · 2019-10-29T20:22:46Z

I simply don't think it's necessary to clean everything up as soon as channels close. I don't think there is a significant impact even if you clean the require cache a hundred times in one process, and I expect that this will be noise in most leak checks. I really don't think it matters that much if resources are cleaned up immediately after a channel closes, as opposed to several seconds later. They're already unrefed, so they won't hold the process open.

merlinnot · 2019-10-29T20:52:46Z

I understand that it doesn’t matter for you. I think we have both pros and cons here.

Pros:

memory leak detection tools are happy
there’s lower memory consumption when you close the client, but don’t exit the process
there’s less CPU used when you close the client, but don’t exit the process
doesn’t leave any resources behind when require cache is cleaned

Cons:

the current implementation is changed, possibly causing an increase in complexity

Are there any things I should add to this list?

Maybe we could work together to come up with an implementation that is simpler and more clean? I tried to make the smallest amount of changes possible, take care of naming and nice structure of the code, but I have a feeling that it’s not up to your standards? Could you maybe provide me with some more detailed feedback if you feel like we can improve this PR in such a way that you’d be comfortable with merging it?

murgatroid99 · 2019-10-29T21:29:25Z

Here's how I see it:

Option 1 (the current behavior of this PR): Use the timer and a call at channel close to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:

Slightly less memory usage for a few seconds after each channel is closed.
Leak detection tools may be happier if the process ends immediately after closing a channel.
If channels are explicitly closed, no resources are held for a few seconds after the require cache is cleared.

Option 2 (my proposal): Use only the timer to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:

Slightly less CPU usage when closing a channel.
Simpler behavior.

For me the relative simplicity of the second option outweighs the brief and occasional value of the first.

merlinnot · 2019-10-30T02:21:11Z

For us, as a user of this library, a solution proposed by you would force us to either:

wait for 10 seconds after each test
use fake timers in every single test

The first option would dramatically lower our productivity and increase costs of CIs. The second option would increase complexity of the test setup and might interfere with timers used by other libraries, which might result in unexpected behavior and inability to test real behavior of these libraries.

Maybe we could make eager cleanup optional if you’re concerned by additional CPU usage?

grpc.setOptions({ eagerSubchannelCleanup: true })

I’m trying to find a solution that would allow my company to keep using this library.

murgatroid99 · 2019-10-30T17:21:16Z

I don't understand these requirements you are describing. My own tests finish promptly without even explicitly closing clients, because these objects we're discussing are unrefed. And I don't see what a fake timer would do.

merlinnot · 2019-10-30T17:32:36Z

I understand that your particular testing setup exits and I do acknowledge that unrefing allows the process to exit cleanly if there are no other tasks in the event loop.

However:

If you'd be using memory leak detection tools which compare memory usage before and after each test, you'd have to either (based on the assumption that we'd choose to go with your option - disable timers on an interval, no eagerly): wait for 10 seconds or use fake timers to forcefully run the timer.
If you'd be cleaning require cache, you'd have a similar situation, where multiple of these timers would pile up over 10 seconds (if a test takes 0,5s, you'd have a maximum of 20 of these timers running).

murgatroid99 · 2019-10-30T17:41:59Z

Are you saying that you are currently using one of these memory leak detection tools and the only excess memory usage it is detecting is those Subchannel instances and associated object that they own?

And I honestly don't see the problem with temporarily having 20+ timers pending. They might consume a little CPU when they run the last time but then that's it.

merlinnot · 2019-10-30T17:49:06Z

No, I'm not saying it's the only leak it detects.

As my company is using this library, we'd like to contribute to it the best way we can. As an example, I recently provided a meaningful reproduction that allowed to fix the issue we and other users were facing: #1085 (that's what is being released today).

This PR provides a solution to one of the issues which is affecting us. We'd like to continue investing into this library by both providing reproductions of issues and creating pull requests to fix some of these issues if we are able to pinpoint a root cause.

murgatroid99 · 2019-10-30T18:18:24Z

I am just trying to understand what actual problem would be solved by the change you propose.

If the goal is to fix Jest reporting an error because it doesn't understand that the timer is unrefed as reported in #1080, that should be taken up with Jest; I don't want to publish code that primarily addresses bugs in other libraries.

If there is some other problem separate from #1080 that is preventing tests from finishing cleanly, tell me what that problem is. It may not even be related to the subchannel pool.

If this library was blocking the adoption of leak checking in tests because it was the only memory sticking around at the end of the test, I would consider that a strong reason to make this change or one like it. But if it's just one leak among many then don't think it's grpc's responsibility to make fixes for that purpose.

If the existence of many pending timers as a result of clearing the require cache is causing significant problems with running tests, tell me what those problems are and I'll be more inclined to make this change.

merlinnot · 2019-10-30T21:38:41Z

Sure. Here's an illustration, simple JavaScript that you can actually execute with --expose-gc flag (node --expose-gc ./script.js):

const assert = require('assert');

let heapUsed = NaN;

const beforeEach = () => {
  global.gc();

  heapUsed = process.memoryUsage().heapUsed;
};

const afterEach = () => {
  global.gc();

  console.log(`Before: ${heapUsed}, after: ${process.memoryUsage().heapUsed}.`);
};

const definitelyNotLeaking = () => {
  const buffer = Array.from({ length: 1024 * 1024 });

  assert(buffer.length === 1048576, "Math doesn't work anymore.");
};

const definitelyLeaking = () => {
  const buffer = Array.from({ length: 1024 * 1024 });
  setTimeout(() => {
    console.log(
      `Hey look, test finished, but I'm still running! - ${buffer.size}`,
    );
  }, 500).unref();
};

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyLeaking();
afterEach();

beforeEach();
definitelyLeaking();
afterEach();

You should see an output similar to:

Before: 3502640, after: 3448280.
Before: 3489640, after: 3481360.
Before: 3487184, after: 3484216.
Before: 3484328, after: 11887168.
Before: 11887280, after: 20277448.

As you can see, the memory usage is radically growing with setTimeout, even tho it's unrefed. That's because it holds references to some objects defined outside of it's callback, preventing these from being garbage collected. The same happens in our case, where this timer not only consumes memory by itself, but also keeps all of the referenced objects in memory.

From what I've seen most memory leak detection tools uses V8's C++ API directly, as it exposes more functionalities, but I hope this simple example will help you understand the issue.

murgatroid99 · 2019-10-31T00:10:58Z

OK, yes, I do understand that a timer that closes over things will hold those things in memory for as long as the timer is pending. And I know that unrefing the timer doesn't change that because that is addressing a different problem. And I do understand that you can write a function that checks how much memory is allocated and see that more memory is allocated.

What I am trying to understand here is the actual practical impact of these temporary memory usage spikes, in terms of test failures or integration problems or application issues or users' bug reports or whatever.

merlinnot · 2019-10-31T05:57:35Z

If you have such a tool enabled, ever test which uses gRPC, even if cleaned up correctly (.close etc) will fail the memory check unless you (given we’d start disabling the timer in a way you proposed):

wait 10 seconds after each test
use fake timers to run this timer forcefully

What’s more, if you clean require cache, memory allocated by gRPC will start to pile up (as I described above, if the test execution takes half a second, you have up to 20 pending timers). In our scenario tests are running on a 32-core VM in full parallelization and isolation. That means that we’d have 32 * 20 = 640 gRPCs allocated at the same time, which is a significant additional memory pressure.

I didn’t see any other library that would behave in such a way, that after deliberately closing it, resources would remain allocated (even for a few seconds). Do you know some other libraries that do so?

murgatroid99 · 2019-11-01T17:31:16Z

I understand that if you were to use such a tool and set it up that way, its report would include those objects. I just want to know what the actual impact of this is. Is this preventing you from using such a memory leak detection tool effectively?

As for the memory usage, subchannels use something on the order of kilobytes or tens of kilobytes of memory. So in the situation you describe, you will see a peak of something on the order of several megabytes of total excess memory usage. That shouldn't be a problem on just about any modern system.

merlinnot · 2019-11-01T19:32:46Z

Yes, it does prevent usage of such tools. It's very hard to determine where a leak comes from, as it's usually a chain of references from one thing to another. You probably know how hard it is sometimes to get some understanding of different memory leaks if you use a debugger, it's also hard to determine for these tools where does a memory leak come from, so you just get a "yes/no" response. At least I personally didn't ever come across a tool that somehow allows you to exclude some paths.

Situation with native grpc was a little different, as it's a native add-on and memory leak detection tools don't keep track of these, only actual JS heap.

So at the moment I can't enable memory leak detection for any tests, as we're using Google SDKs extensively, especially Firestore and PubSub (with emulators). Having an actually clean state (synchronously) would allow me and other developers to use these tools. It's really useful for us to detect memory leaks in the CI :) I know at my company we're a little paranoid about tests, but given the speed we're developing at (sometimes dozens of features and deployment daily) and having a "master = production" mentality, we really need to take a good care of our tests.

We're trying our best to contribute to open source, especially the tools we use daily. You can take a look at my public history (e.g. 25 merged and 1 open PR to firebase-functions) and my company's public profile with PubSub, Firestore and Flatbuffers containers, GitHub Actions etc.

We created this PR as it's an important functionality for us and we prefer to contribute to the community instead of just using forks, as we believe it's a more sustainable way for the community and our company. We also believe that it would ease adoption of memory leak detection tools for others, which might be a good idea for many.

If this implementation is not something that is acceptable for you, maybe you could propose a different solution (e.g. what I mentioned above, to add it as an option)? We'd be happy to discuss it and adopt our approach.

murgatroid99

I'm sorry for all of the back-and-forth. I feel sufficiently convinced now that cleaning the pool up when channels close is the best way to improve this behavior for some use-cases.

murgatroid99 · 2019-11-01T21:55:54Z

packages/grpc-js/src/channel.ts

@@ -288,6 +288,10 @@ export class ChannelImplementation implements Channel {
  close() {
    this.resolvingLoadBalancer.destroy();
    this.updateState(ConnectivityState.SHUTDOWN);
+
+    if (this.subchannelPool !== undefined) {


Type-wise, this.subchannelPool can't be undefined, so this check is redundant.

Fixed in 93f8169.

murgatroid99 · 2019-11-01T22:12:08Z

packages/grpc-js/src/subchannel-pool.ts

+    const allSubchannelsUnrefed = this.unrefUnusedSubchannels();
+
+    if (allSubchannelsUnrefed && this.cleanupTimer !== undefined) {
+      clearInterval(this.cleanupTimer);


The condition should be that the timer is started if there is at least one subchannel in the pool, and stopped if there are no longer any subchannels in the pool. The timer should not be disabled just because a single channel is closed. So, this check should be in unrefUnusedSubchannels.That allows a simplification where unrefUnusedSubchannels returns nothing, and is public, and is the function that Channel#close calls directly.

Good one.
Fixed in 5f271de

murgatroid99 · 2019-11-01T22:20:14Z

packages/grpc-js/src/subchannel-pool.ts

+  /**
+   * A timer of a task performing a periodic subchannel cleanup.
+   */
+  private cleanupTimer: NodeJS.Timer | undefined;


I slightly prefer not allowing undefined here, and instead having a separate boolean indicating whether the timer is running. I find it cleaner type-wise to allow as few types as possible. The downside is that the constructor needs to do cleanupTimer=... and clearInterval(cleanupTimer) and starting the timer is slightly more code, but the upside is that you never need to check the value of cleanupTimer itself.

I understand that you prefer it to be more explicit, rather than having undefined to indicate that it's not running? I'd personally prefer to also destroy the reference to the timer if we're not running it anymore, as it ideally should not be used anymore. I thought we might be more explicit, but destroy the reference too:

type CleanupTask = { running: true; timer: Nodejs.Timer } | { running: false };

WDYT?

The reason I like keeping the timer reference is it allows the code for disabling the timer to work even if the timer is already disabled, which simplifies code:

clearInterval(timer); running = false;

Hi @murgatroid99,

I couldn't find any documentation, explaining if the callback of setInterval is garbage collected when clearInterval is called on its timer. This makes me uncertain if we should keep the timer on the instance or if we should remove it. I consulted with my colleagues and no one was certain about it. So I think it might cause more confusion than clarity.

What do you think, maybe we could leave it as is?

That's a fair point. One alternative would be to do the same thing I suggested in the constructor and replace it with a do-nothing interval and then immediately cancel it, but that kind of removes the benefit I mentioned so I'm OK with keeping it as it is here.

OK, actually, this is a little minor, but can we have the alternate type be null instead of undefined. I dislike using undefined as an explicit type, and null would be more consistent with other code in this project. It shouldn't change much; this line can just be private cleanupTimer: NodeJS.Timer | null = null.

I changed it to null and merged master, as far as I remember you made a fix to macOS builds.

murgatroid99 · 2019-11-01T22:32:24Z

packages/grpc-js/src/subchannel-pool.ts

+   * Ensure that the cleanup task is spawned.
+   */
+  ensureCleanupTask(): void {
+    if (this.global === true && this.cleanupTimer === undefined) {


=== true shouldn't be needed here. global is always a boolean.

Fixed in 93f8169.

merlinnot · 2019-11-01T22:42:56Z

Thanks for the feedback! We'll rebase on master and address your comments on Monday :)

murgatroid99 · 2019-11-01T22:52:43Z

I don't think you need to rebase. The relevant parts of these files haven't changed since you opened this PR.

JrSchild · 2019-11-06T14:23:41Z

@murgatroid99 Thank you for your feedback, this should be ready for another review.

… is stopped

thelinuxfoundation · 2019-11-06T19:32:37Z

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

User @merlinnot isn't covered by a CLA. They will need to complete the form at https://identity.linuxfoundation.org/projects/cncf

Regards,
CLA GitHub bot

merlinnot · 2019-11-06T19:37:35Z

@thelinuxfoundation I signed it.

merlinnot · 2019-11-06T19:40:55Z

@murgatroid99 Could you re-run Kokoro?

merlinnot · 2019-11-06T21:30:47Z

Yay, thanks! 😃

murgatroid99 · 2019-11-07T18:00:38Z

I have published this change in grpc-js 0.6.11.

JrSchild · 2019-11-07T19:10:34Z

Thanks!

fix(sub-channels): clear the cleanup interval when all channels are u…

20cbfc7

…nrefed Co-authored-by: Natan Sągol <m@merlinnot.com>

merlinnot mentioned this pull request Oct 24, 2019

Expose an API to close channels googleapis/nodejs-firestore#769

Closed

murgatroid99 added the kokoro:run label Oct 24, 2019

kokoro-team removed the kokoro:run label Oct 24, 2019

murgatroid99 requested changes Nov 1, 2019

View reviewed changes

JrSchild added 2 commits November 5, 2019 09:59

refactor: simplify if statements

93f8169

fix: cancel the cleanup task inside the unrefUnusedSubchannels function

5f271de

fix: correct comments

821c9ab

murgatroid99 added the kokoro:run label Nov 6, 2019

kokoro-team removed the kokoro:run label Nov 6, 2019

refactor: use null instead of undefined to indicate that cleanupTimer…

e51a740

… is stopped

merlinnot added 2 commits November 6, 2019 20:38

Merge branch 'master' of github.com:grpc/grpc-node into JrSchild/master

61d7e77

fix: correctly initialize cleanupTimer

0353dbf

murgatroid99 added the kokoro:run label Nov 6, 2019

kokoro-team removed the kokoro:run label Nov 6, 2019

murgatroid99 approved these changes Nov 6, 2019

View reviewed changes

murgatroid99 merged commit a7567f0 into grpc:master Nov 6, 2019

murgatroid99 mentioned this pull request Nov 7, 2019

grpc-js: Bump to 0.6.11 #1165

Merged

merlinnot mentioned this pull request Feb 4, 2020

Publisher Client close() method googleapis/nodejs-pubsub#817

Closed

lock bot locked as resolved and limited conversation to collaborators Feb 5, 2020

fix(sub-channels): clear the cleanup interval when all channels are unrefed. #1083

fix(sub-channels): clear the cleanup interval when all channels are unrefed. #1083

Conversation

JrSchild commented Oct 24, 2019

linux-foundation-easycla bot commented Oct 24, 2019 • edited

thelinuxfoundation commented Oct 24, 2019

JrSchild commented Oct 24, 2019

murgatroid99 commented Oct 24, 2019

merlinnot commented Oct 24, 2019

murgatroid99 commented Oct 24, 2019

merlinnot commented Oct 24, 2019

murgatroid99 commented Oct 24, 2019

merlinnot commented Oct 24, 2019

murgatroid99 commented Oct 24, 2019

merlinnot commented Oct 25, 2019

murgatroid99 commented Oct 25, 2019

merlinnot commented Oct 28, 2019

murgatroid99 commented Oct 28, 2019

merlinnot commented Oct 28, 2019

murgatroid99 commented Oct 28, 2019

merlinnot commented Oct 29, 2019

murgatroid99 commented Oct 29, 2019

merlinnot commented Oct 29, 2019

murgatroid99 commented Oct 29, 2019

merlinnot commented Oct 30, 2019

murgatroid99 commented Oct 30, 2019

merlinnot commented Oct 30, 2019

murgatroid99 commented Oct 30, 2019

merlinnot commented Oct 30, 2019

murgatroid99 commented Oct 30, 2019

merlinnot commented Oct 30, 2019

murgatroid99 commented Oct 31, 2019

merlinnot commented Oct 31, 2019

murgatroid99 commented Nov 1, 2019

merlinnot commented Nov 1, 2019 • edited

murgatroid99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merlinnot commented Nov 1, 2019

murgatroid99 commented Nov 1, 2019

JrSchild commented Nov 6, 2019

thelinuxfoundation commented Nov 6, 2019

merlinnot commented Nov 6, 2019

merlinnot commented Nov 6, 2019

merlinnot commented Nov 6, 2019

murgatroid99 commented Nov 7, 2019

JrSchild commented Nov 7, 2019

linux-foundation-easycla bot commented Oct 24, 2019 •

edited

merlinnot commented Nov 1, 2019 •

edited