Subscriber maxes out CPU after exactly 60 minutes on GKE #890

ctavan · 2020-02-19T12:24:48Z

Environment details

OS: GKE
Node.js version: 12.14.1
npm version: –
@google-cloud/pubsub version: pubsub@1.5.0 / grpc-js@0.6.16

Steps to reproduce

Start a pod that subscribes to an idle subscription
Wait 60 minutes
Observe node process eat up all available CPU

I have a pod that subscribes to a subscription which is 100% idle (i.e. no messages published on that topic) during the observation period.

Exactly 60 minutes after the pod started it eats up all available CPU until the pod is deleted manually to force a restart.

This behavior does not appear when using the C++ gGRPC bindings as suggested in #850 (comment)

I have also enabled debug output through

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

I've put the full debug output (as fetched from Google Cloud Logging) of the entire lifetime of the pod in this gist.

There is no further output after the last line even though the pod kept running for a while as can be seen form the graph above: the bump in CPU usage is the time where the pod eats up all available CPU and doesn't seem to log anything. (Times in the logs correspond to the times in the graph above with 1h timezone difference, so 10:51 in the logs = 11:51 in the graph).

I can reproduce this problem with any of my subscriber pods.

I believe it is very likely that this problem also falls into the category of #868.

The text was updated successfully, but these errors were encountered:

hefekranz · 2020-02-19T16:05:42Z

same is happening to us. Thought it was a bad implementation...

codersbrew · 2020-02-19T16:58:29Z

We are also seeing the same issue. We've been having a lot of strange behavior with this library.

feywind · 2020-02-19T18:20:23Z

Thanks everyone, especially for that log output! I find this to be particularly suspicious:

2020-02-19T10:10:14.382Z | call_stream | [6] write() called with message of length 0

I'm working on trying to build some kind of repro to pass on to the grpc team.

We have only one Node library that's defaulting to the gRPC C++ bindings at this point (spanner), so there has been some hesitation about switching this one back to it by default, but it also seems to be affecting a lot of people. (Also, the C++ one requires node-gyp, which may not be available in some environments, so adding a hard dependency to it is possibly problematic.) The current thought is that it may have something to do with this library's more extensive use of streaming APIs. So progress is being made there. Thanks for the patience and debugging help!

crivera · 2020-02-19T18:51:20Z

We are seeing a similar issue where our node process gets killed if a subscription has been idle for 60 minutes - without any logs.

murgatroid99 · 2020-02-19T19:02:04Z

Can you clarify when you restarted the process relative to the final timestamped log message? Was the log cut off because you restarted, or was there a time gap after the last log message and before the manual restart?

murgatroid99 · 2020-02-19T19:25:31Z

The log you posted shows that the grpc library establishes a connection and some requests are made, then exactly an hour later the connection drops and a new connection is established (this is a normal thing that the grpc library is designed to handle). Then, right after the new connection is established, the log cuts off, saying that the process was restarted. I'm wondering if you can see timestamps for when GKE restarts the process, and when that is relative to the log lines in the file.

The reason is that I'm wondering if the problem I should be looking for is more like "something weird happened while reconnecting and that triggered GKE to restart it" or "the process entered an infinite loop without logging anything, and GKE eventually detected it".

ctavan · 2020-02-20T00:27:33Z

Can you clarify when you restarted the process relative to the final timestamped log message? Was the log cut off because you restarted, or was there a time gap after the last log message and before the manual restart?

I only restarted about 20 mins after the last log message (see the graph in the original post, where the bump in CPU usage goes down again).

Between the last log message in my gist and the restart there were no further log messages retrievable from kubectl or Google cloud logging and the process consumed all CPU that Kubernetes allowed.

ctavan · 2020-02-20T00:30:49Z

The log you posted shows that the grpc library establishes a connection and some requests are made, then exactly an hour later the connection drops and a new connection is established (this is a normal thing that the grpc library is designed to handle). Then, right after the new connection is established, the log cuts off, saying that the process was restarted. I'm wondering if you can see timestamps for when GKE restarts the process, and when that is relative to the log lines in the file.

The reason is that I'm wondering if the problem I should be looking for is more like "something weird happened while reconnecting and that triggered GKE to restart it" or "the process entered an infinite loop without logging anything, and GKE eventually detected it".

My observation is: after 60 minutes of repeated log output every 30 seconds all log output stops and the node process jumps to max available CPU usage for no apparent reason. The log output in the gist shows everything retrieved from that pod which formally lived about 20 minutes longer than the last line of logs.

ctavan · 2020-02-20T00:37:09Z

Just to make sure we're not missing out any details, our initialization logic is like this:

const pubsub = new PubSub();
const subscription = pubsub.subscription(subscriptionName, {
 flowControl: {
   maxBytes: Math.floor(getenv.int('MEMORY_LIMIT_BYTES') * 0.5),
 },
});
subscription.on('message', handleMessage);

where MEMORY_LIMIT_BYTES comes from the Kubernetes pod spec resource limits.

ctavan · 2020-02-20T11:54:58Z

@murgatroid99 I tried it with early versions of @grpc/grpc-js and the problem only appears with @grpc/grpc-js@0.6.16. It did not appear with 0.6.12, 0.6.14 and 0.6.15.

I have updated the gist to include the log output of the same pod with the earlier versions of grpc-js.

murgatroid99 · 2020-02-21T17:50:03Z

I have published @grpc/grpc-js version 0.6.17 with a possible mitigation for the issue and some more logging that should hopefully be relevant. Can you try that out and share the resulting logs?

ctavan · 2020-02-21T21:42:08Z

Whoops, I have troubles downloading the logs, there are megabytes of:

2020-02-21T20:23:04.844Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.861Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.880Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.884Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.889Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.904Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.923Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.940Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined

during the period where the process runs with maxed out CPU. So basically tryPick seems to run in an infinite loop.

I can highly recommend not running @grpc/grpc-js version 0.6.17 with debug flags enabled 😉

(Edited to include multiple subsequent log outputs to illustrate that they are only milliseconds apart)

ctavan · 2020-02-21T21:58:37Z

@murgatroid99 I managed to download 140MB of logs from Google Cloud Logging and put the part until the infinite loop begins into the gist.

feywind · 2020-02-21T22:16:58Z

@ctavan Thanks for the continued debug info! This actually is starting to sound like something we were talking about as a possibility for a cause. Basically, after some time (I think it might've been 30 minutes) the stream is closed, and we attempt to reconnect it. If something prevents that connection, it's not supposed to busy loop reconnecting, but that might be what it's doing.

A gRPC fix might be useful if that's what it is (this guy - grpc/grpc-node#1271) but we also talked about putting in a workaround in the nodejs-pubsub library if that proves more elusive.

murgatroid99 · 2020-02-22T00:15:57Z

I had a feeling that the problem was something like that, but I tried to mitigate a slightly different potential issue. I didn't realize it would loop that tightly and generate that many log lines, though I do want to note that in normal circumstances, when you're not encountering this bug, you'll usually get about one extra log line per request.

What's happening here is that a connection has been established, but something is in an invalid state that is preventing it from actually starting a stream, so it's just repeatedly retrying. The reason it only happens on 0.6.16 and above is that it's caused by grpc/grpc-node#1251, which was supposed to fix googleapis/nodejs-firestore#890. I'll try to get a new version out on Monday that fixes this. For now, if you can force it to use 0.6.15 you'll probably be fine.

murgatroid99 · 2020-02-24T20:42:32Z

I have now published @grpc/grpc-js version 0.6.18. I'm pretty sure I've fixed it this time, and if not I added some more relevant logging. Can you try that out?

ctavan · 2020-02-25T11:50:41Z

@murgatroid99 I've upgraded to @grpc/grpc-js@0.6.18 and the specific problem described in this issue is now gone. Great work!

For the sake of completeness I have added the log output of the (still running since 160 minutes) pod to the gist.

I'll leave this issue open in case you want to update the transitive dependencies of @google-cloud/pubsub so that it pulls at least 0.6.18 of @grpc/grpc-js.

feywind · 2020-02-25T18:24:53Z

@ctavan I will take a look at updating the versions in nodejs-pubsub. Thank you for all the help in testing this stuff!

jperasmus · 2020-03-17T18:39:50Z

Had a similar issue and switched to the older C++ grpc client in the meantime. It seems to be behaving.

feywind · 2020-05-05T17:52:27Z

@jperasmus Did you try out the latest version of the library lately, which pulls in grpc-js 0.6.18? That library released 1.x recently, and I'm planning to pull that in after pubsub has its 2.0 release.

jperasmus · 2020-05-06T08:04:40Z

@feywind I did try with the latest version at the time, which was after Christoph posted about 0.6.18. I will try with the latest v1 grpc-js client on one of our test environments after v2 of this library includes it.

feywind · 2020-07-21T23:16:31Z

@jperasmus Checking back in - have you had better luck with the grpc-js client 1.x? That should be picked up here now (by way of gax updates).

jperasmus · 2020-07-22T10:58:20Z

Hi @feywind thanks for following up. I actually moved to a different company where we are not using this package anymore, but before I left, I did run an experiment for almost 2 weeks with the v1 grpc-js client on a test environment and it was substantially more stable than the old grpc client. Basically, where the grpc client would throw at least one connection issue in production per day, the grpc-js client had no connection issues in the test environment under similar, but slighter less, load than that of production. As far as I am aware, grpc-js is running in production now for the old company.

feywind · 2020-08-18T21:18:27Z

Okay, cool! If anyone is still running into this with grpc-js, feel free to re-open.

product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Feb 19, 2020

feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 19, 2020

murgatroid99 mentioned this issue Feb 20, 2020

grpc-js: Add pick tracing and one error handler grpc/grpc-node#1271

Merged

ctavan mentioned this issue Feb 21, 2020

Backlog spikes reported every hour #850

Closed

murgatroid99 mentioned this issue Feb 25, 2020

Error: No connection established grpc/grpc-node#1064

Closed

This was referenced Feb 25, 2020

Update grpc-js to 0.6.18 to fix retry issues in nodejs-pubsub (and elsewhere?) googleapis/gax-nodejs#727

Closed

Stream: Removed #892

Closed

fix(deps): update to the latest google-gax to pull in grpc-js 0.6.18 #903

Merged

murgatroid99 mentioned this issue Mar 3, 2020

grpc-js: Only automatically retry picks on known error grpc/grpc-node#1286

Merged

yashmurty mentioned this issue Apr 27, 2020

[bug] Firebase dependency issue mamori-i-japan/mamori-i-japan-api#76

Open

yoshi-automation added the 🚨 This issue needs some love. label Aug 17, 2020

feywind closed this as completed Aug 18, 2020

feywind removed the 🚨 This issue needs some love. label Aug 18, 2020

JustinBeckwith assigned feywind Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

ctavan commented Feb 19, 2020

hefekranz commented Feb 19, 2020

codersbrew commented Feb 19, 2020

feywind commented Feb 19, 2020

crivera commented Feb 19, 2020

murgatroid99 commented Feb 19, 2020

murgatroid99 commented Feb 19, 2020

ctavan commented Feb 20, 2020 •

edited

Loading

ctavan commented Feb 20, 2020

ctavan commented Feb 20, 2020

ctavan commented Feb 20, 2020

murgatroid99 commented Feb 21, 2020

ctavan commented Feb 21, 2020 •

edited

Loading

ctavan commented Feb 21, 2020

feywind commented Feb 21, 2020

murgatroid99 commented Feb 22, 2020

murgatroid99 commented Feb 24, 2020

ctavan commented Feb 25, 2020

feywind commented Feb 25, 2020

jperasmus commented Mar 17, 2020

feywind commented May 5, 2020

jperasmus commented May 6, 2020

feywind commented Jul 21, 2020

jperasmus commented Jul 22, 2020

feywind commented Aug 18, 2020

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

Comments

ctavan commented Feb 19, 2020

Environment details

Steps to reproduce

hefekranz commented Feb 19, 2020

codersbrew commented Feb 19, 2020

feywind commented Feb 19, 2020

crivera commented Feb 19, 2020

murgatroid99 commented Feb 19, 2020

murgatroid99 commented Feb 19, 2020

ctavan commented Feb 20, 2020 • edited Loading

ctavan commented Feb 20, 2020

ctavan commented Feb 20, 2020

ctavan commented Feb 20, 2020

murgatroid99 commented Feb 21, 2020

ctavan commented Feb 21, 2020 • edited Loading

ctavan commented Feb 21, 2020

feywind commented Feb 21, 2020

murgatroid99 commented Feb 22, 2020

murgatroid99 commented Feb 24, 2020

ctavan commented Feb 25, 2020

feywind commented Feb 25, 2020

jperasmus commented Mar 17, 2020

feywind commented May 5, 2020

jperasmus commented May 6, 2020

feywind commented Jul 21, 2020

jperasmus commented Jul 22, 2020

feywind commented Aug 18, 2020

ctavan commented Feb 20, 2020 •

edited

Loading

ctavan commented Feb 21, 2020 •

edited

Loading