Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent GRPC Internal Failure after 2-6 hours - How to enhance auto recovery #153

Closed
pdykes opened this issue Jul 6, 2021 · 3 comments

Comments

@pdykes
Copy link

pdykes commented Jul 6, 2021

After 2-6 hours, a permanent CRUD error failure returned until an instance restart occurs is bubbling up from GRPC:

"GRPCInternalError: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT"

After experimenting, and fully validating the networking is NOT the issue, it seems I'm looking for advice to best configure the node etcd3 client for long running etcd3 transactional usage. I have 3 instances and see issues in all of them (they are configured exactly the same/same code levels, so not surprising).

Based on last weeks testing of the latest version of this module, It appears after 2-8 hours, the client code has some GRPC timeout, and from that point the CRUD methods encounter this error. Also, the watcher function becomes unstable. A second question, if the answer is to configure grpc, vs. etcd3, that would be good (i was wondering if etcd3 self tunes grpc, and configuring orthogonally maybe bad thing to pursue).

If I restart my kubernetes pods, then it works again. I have looked at the docs on the recovery, but I am looking for a keep alive at the grpc level, set via the etc3 api/config that would make the client code more resilient.

Thanks

@pdykes
Copy link
Author

pdykes commented Jul 10, 2021

Follow up:

I put the "lease" and lease.put in a try catch, and can catch and try again. However, the second put attempt against lease always gets a circuit breaker exception. Any advice?

@pdykes
Copy link
Author

pdykes commented Aug 2, 2021

FYI, this issue continues with the latest build, pulling in clean dependencies, tracing assistance be great with any suggestions for the Node Library. Thanks.

@pdykes
Copy link
Author

pdykes commented Aug 10, 2021

I wanted to update folks... I looked over the @grpc/grpc-js changes for the pure javascript client, and they had some issues in the pre 1.35 version and recently dropped a 1.36 versions. I took all my builds, ensured I was at 1.36 vs. earlier versions and restarted all testing and so far, so good. The ETIMEDOUT has pretty much disappeared as denoted above. I noticed they dropped a 1.37 over the weekend. I am behind, so i'm gong to stick with the 1.36 for now, but just heads up in case following this - it was a painful 6 week to find this seemed to fix the issue.

@pdykes pdykes closed this as completed Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant