Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests seem not to respond anymore after upgrading to 1.10.2 #2690

Closed
julienfouilhe opened this issue Mar 14, 2024 · 14 comments
Closed

Requests seem not to respond anymore after upgrading to 1.10.2 #2690

julienfouilhe opened this issue Mar 14, 2024 · 14 comments

Comments

@julienfouilhe
Copy link

julienfouilhe commented Mar 14, 2024

Problem description

I have a microservice that uses grpc-js both to serve requests, and to make requests to other services.
After upgrading to 1.10.2 from 1.10.1, we noticed that a lot of requests were not going through anymore. After looking at the request latency graph, we noticed a spike shortly after the grpc-js upgrade was released and downgraded immediately, and the service was then back to normal.

Screenshot 2024-03-14 at 20 55 27

Reproduction steps

I haven't tried to reproduce it locally yet, as it just occurred and I thought it would be best to report it immediately.
It also seems hard to reproduce. Our pre-production environment did not encounter any issues so I guess it needs to reach a certain number of requests/second before it happens.

Environment

  • OS name, version and architecture: docker node:20.8.1-alpine image
  • Node version 20.8.1
  • Node installation method docker
  • Package name and version gRPC@1.10.2

Additional context

Client and server libraries are generated using protobuf-ts@2.9.3

@murgatroid99
Copy link
Member

Can you narrow down if this is a problem on the handling requests side or the making requests side?

@julienfouilhe
Copy link
Author

julienfouilhe commented Mar 14, 2024

It seems to be on the "making requests" side, as I can see logs coming in for the grpc-js server, but the service it's making requests to does not receive the requests (this other microservice is written in Rust and therefore does not run grpc-js).

@acdcjunior
Copy link

Hey, faced similar issue after upgrading to 1.10.2. The best we know so far is after some time (not a lot) new requests just hang without getting any data from the server.
We have many instances of this service that uses grpc-js and eventually all begin to fail and never recover. This could be related to idleness since they wouldn't start to fail all at the same time.

Reverting the version made the issue go away.

@Ganitzsh
Copy link

Ganitzsh commented Mar 15, 2024

I can confirm this, we had the exact same issue with multiple services running on GCP. Downgrading was the fix, hopefully this gets addressed soon.

@udnes99
Copy link

udnes99 commented Mar 15, 2024

Confirmed. We experienced the exact same issue, also for multiple services running on GCP.
It especially affected the Datastore node client which caused few requestst to succeed because transactions would time out.

@julienfouilhe
Copy link
Author

Yep I forgot to mention it but my services are also running on GCP. More specifically on Cloud Run, which does not allocate CPU outside requests (maybe that's a hint?).

@murgatroid99
Copy link
Member

I just published version 1.10.3 with a change that reverts my best guess for the cause of this problem. Please try it out.

@jeffijoe
Copy link

We're experiencing a similar issue using Google Cloud PubSub which is gRPC-based, same symptoms, basically we're getting DEADLINE_EXCEEDED or the client says it waited too long for response data.

Pushing an update to our backend now with 1.10.3 and hope that fixes it. We noticed that it happened during times of inactivity, so #2677 could also be a culprit, but we'll know more in the next few hours/days as it was very random.

@murgatroid99
Copy link
Member

@jeffijoe can you be more specific about what part of #2677 you think might cause this problem? If you're talking about the session idle timeout change, that shouldn't be relevant here, because this bug is on the client side and that was a server-side change.

@jeffijoe
Copy link

@murgatroid99 I was mostly skimming that one, I saw "idle" mentioned there and figured it could be related considering how we've observed this often during downtime (overnight), so don't mind me 😅

@michaelAtCoalesce
Copy link

is this issue the same as what we are seeing? we are seeing hangs in our usage of firestore, after turning on the grpc traces, we see v1.10.2 in the logs.

firebase/firebase-admin-node#2495

@julienfouilhe
Copy link
Author

@michaelAtCoalesce it seems to be the same issue yes.

@FredrikAugust
Copy link

Is this fixed by 1.10.3?

@jeffijoe
Copy link

We haven't seen the same issue reoccurring since we upgraded to 1.10.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants