Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backlog spikes reported every hour #850

Closed
pjm17971 opened this issue Jan 12, 2020 · 13 comments
Closed

Backlog spikes reported every hour #850

pjm17971 opened this issue Jan 12, 2020 · 13 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@pjm17971
Copy link

We run many workers on GKE that essentially pull messages from a Pubsub subscription at a rate of about 120k/min, do some processing and write the result into Bigtable. As part of a recent change we upgraded this Pubsub client library from 0.29.1 to 1.2 and immediately started to see alerts.

What we started to see after this upgrade was spikes in the reported backlog (and oldest unacked message age). These happened hourly. However, our service appeared not to suffer, and continued to output its product at a steady rate.

Here is an overview of this upgrade as seen in Stackdriver (running Pubsub v1.2 highlighted in red, then after 10am Friday we reverted JUST the pubsub version and the process returned to normal):

Screen Shot 2020-01-11 at 11 22 37 AM

Zooming into approx Friday 12am until noon, and showing backlog at the top and oldest message at the bottom:

Screen Shot 2020-01-11 at 5 08 05 PM

It is pretty clearly something that happens every hour.

I know there's another Github issue for memory spikes, but at least as far as we can tell that's not the case for us. In fact, I don't think we actually saw a real impact on processing output. This assessment is based on: 1) we didn't see lag in out downstream client which is usually the case with actual backlogs and 2) we didn't see increase in worker cpu when the backlog recovered. The biggest problem is we use these alerts on this subscription as our main indicator that the process may be experiencing problems for some reason.

Environment details

  • OS: GKE
  • Node.js version: 12.x
  • npm version: 6.11.3
  • @google-cloud/pubsub version: 1.2

Steps to reproduce

Not sure we do anything special. Each worker instance creates a simple client which processes messages. We run this on GKE, with one node instance per pod. Approximately 64 workers are all pulling from the same subscription. We generally just ack messages regardless of successful processing because in this application it's ok to just drop them (a separate process will take care of it).

Hope this is helpful. Let me know if I can provide any additional data.

Thanks!

@pjm17971 pjm17971 changed the title Backlog spikes every hour Backlog spikes reported every hour Jan 12, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jan 13, 2020
@feywind feywind added external This issue is blocked on a bug with the actual product. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jan 13, 2020
@feywind
Copy link
Collaborator

feywind commented Jan 13, 2020

@pjm17971 Hi Peter, we've been seeing a number of issues lately that involved upgrading the nodejs-pubsub version from one of the 0.x to the latest one. The common suggestion I've seen is to try switching to the C++ gRPC bindings, so you might give that a try.

const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

(Or however you instantiate the PubSub object, you can provide grpc to replace its implementation.)

I'll look at this some more, but I wanted to give you something quick to try!

@pmcnr-hx
Copy link

pmcnr-hx commented Jan 14, 2020

We have seen similar behaviour when using the @grpc/grpc-node pure JavaScript module that is the default in >0.29.1 versions and I can confirm that this is solved, like suggested above, by opting in to the native bindings gRPC module instead.

@pjm17971
Copy link
Author

@feywind Thanks for the info. If we give it a try I'll report back here if it worked or not. Is there an issue on the grpc javascript side that we can track so we'll know when to switch this back out?

@feywind
Copy link
Collaborator

feywind commented Jan 16, 2020

@pjm17971 I'm actually pretty new on the team, so I don't have a map of everything that's going on in my head yet :) but there have been a number of issues lately with gRPC vs grpc-js. I'll be on the lookout for a useful grpc issue for you.

@bcoe Do you know of one off the top of your head?

@feywind
Copy link
Collaborator

feywind commented Jan 16, 2020

@pjm17971 If you have a chance, can you try turning on the GRPC debug flags with these environment variables?

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

Specifically with the configuration described as problematic in the original post, if possible.

Thanks!

@google-cloud-label-sync google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 30, 2020
@ctavan
Copy link
Contributor

ctavan commented Feb 19, 2020

Just chiming in to let you know that I'm experience the precise same behavior as @pjm17971 with a couple of subscribers running on GKE.

Here are the versions and the respective behavior that I have been observing:

2020-01-26  pubsub-0.31.1 grpc-js-0.5.2  --> No spikes
2020-02-17  pubsub-1.1.5  grpc-js-0.6.9  --> Spikes
2020-02-17  pubsub-1.5.0  grpc-js-0.6.16 --> Maxed out CPU on idle subscriptions (will report in separate issue)
2020-02-17  pubsub-1.1.5  grpc-js-0.6.9  --> Spikes (reverted due to above issues)
2020-02-18  pubsub-1.5.0  grpc-1.24.2    --> No spikes since then (using C++ grpc instead of grpc-js)

Providing the C++ grpc bindings to the PubSub() constructor as suggested by @feywind seems to be a viable workaround so far.


That said, with the pubsub-1.1.5 grpc-js-0.6.9 combo I occasionally got precisely this error crashing my node process: googleapis/nodejs-firestore#739

node[1]: ../src/node_http2.cc:892:ssize_t node::http2::Http2Session::ConsumeHTTP2Data(): Assertion `(flags_ & SESSION_STATE_READING_STOPPED) != (0)' failed.
 1: 0x9bcb80 node::Abort() [node]
 2: 0x9bcc07  [node]
 3: 0x9f253a node::http2::Http2Session::ConsumeHTTP2Data() [node]
 4: 0x9f2ec0 node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [node]
 5: 0xad941b node::TLSWrap::ClearOut() [node]
 6: 0xadb878 node::TLSWrap::OnStreamRead(long, uv_buf_t const&) [node]
 7: 0xa71e66 node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) [node]
 8: 0x12baaa9  [node]
 9: 0x12bb0d0  [node]
10: 0x12c0c08  [node]
11: 0x12af54b uv_run [node]
12: 0xa00127 node::NodeMainInstance::Run() [node]
13: 0x993dd8 node::Start(int, char**) [node]
14: 0x7fc6d2cf42e1 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
15: 0x932cb5  [node]

@pjm17971 have you been observing these errors as well?

I was not able to reliably correlate these crashes with the spikes in unacked messages though.

Also, according to this grpc issue the crashes problem might have been fixed in grpc-js-0.6.13 which I haven't been able to verify yet, because at least grpc-js-0.6.16 seems to be causing a whole different range of problems… Will let you know if I find out more.

@ctavan
Copy link
Contributor

ctavan commented Feb 19, 2020

I just realized that in the original issue description the spikes appear exactly each full hour. This is not the case in my setup. In my setup the spikes appear in irregular intervals around 2-5 times per day.

@ctavan
Copy link
Contributor

ctavan commented Feb 21, 2020

Just a quick confirmation that the issue is gone for me since switching back to the C++ bindings. Need to wait for #890 until I can try out grpc-js again.

@pjm17971
Copy link
Author

@feywind Sorry I didn't reply, but kind of a non-update from my end: we went back to the C++ gRPC implementation and our process behaved fine. We couldn't reproduce easily in a non-production environment so we did not attempt to deploy again with the bad library version. We can try any new version of grpc-js again down the road if a fix is believed to be in place and give feedback as to if this works for us. Sounds like #890 kind of has this problem covered.

@feywind
Copy link
Collaborator

feywind commented Feb 28, 2020

Thanks, everyone! There's a PR out now to update to the grpc-js that should fix the issues in #890. I'll update here again once that's merged, and an update to that nodejs-pubsub should pull it in.

@feywind feywind removed the external This issue is blocked on a bug with the actual product. label Feb 28, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Feb 28, 2020
@ctavan
Copy link
Contributor

ctavan commented Mar 3, 2020

I have removed the C++ bindings again and upgraded to @grpc/grpc-js@0.6.18 (which fixed #890 for me). I have not observed a single spike for 24h now, a situation I have never seen with @grpc/grpc-js@0.6.9.

So for the time being I conclude that

@google-cloud/pubsub@1.5.0
@grpc/grpc-js@0.6.18

is finally a stable setup for my use case again!

@pjm17971 I think the above versions are a setup that is worth trying out.

@feywind
Copy link
Collaborator

feywind commented Mar 3, 2020

Thanks for the update! I'm working on getting the default grpc-js version pushed up to 0.6.18. There might be a next tag coming up that will pull in 0.7.0, but the timing on that is TBD.

@feywind
Copy link
Collaborator

feywind commented Apr 23, 2020

grpc-js is up to 0.6.18 in the last few releases, and this issue hasn't seen much traffic lately, so I'm going to go ahead and close it. Please feel free to re-open/comment if necessary.

@feywind feywind closed this as completed Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants