New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc servers hanging with a "connection attempt timed out before receiving SETTINGS frame" error #36256
Comments
We also ran py-spy dump and py-spy record on the process while it was in this bad state. All the worker threads were idle (all in the middle doing networking things like posting to a requests session, which is not a surprise since our work is pretty I/O Bound)
and when we ran a |
One stuck thread was also in the middle of opening its own channel to another grpc service (that's expected in the code that we running in the server API), although I have no particular reason to believe that's significant or caused the issue. The rest were all doing requests connection work.
|
Actually a bit of overlap here with #36098 here potentially depending on what they meant by “freeze” - since one of the threads that we dumped was trying to create a grpc client. Could also be a coincidence. |
Just coming out of debugging a stalling Apache beam pipeline running on Google Cloud Dataflow and nailing it down to a dependency bump from grpcio 1.60.0 to 1.62.1. Profiling revealed all the walltime was spent past this wait https://github.com/grpc/grpc/blob/v1.62.1/src/python/grpcio/grpc/_channel.py#L959 ending up with what looks like deadlocking on a Python threading lock. So something might have happened between these two versions. |
@DerRidda thanks for corroborating. We have some early evidence that rolling back just to 1.62.0 may be sufficient to resolve the problem, but not conclusive yet. |
Scratch that - we are still reproducing on 1.62.0 - trying 1.60.0 next. The first report we received of this issue hanging processes was also on February 9th, about a week after 1.60.1 was released. |
Also not actually fixed on my end after all. Might be a combination of dependencies evolving around grpc. Will try more next week. |
@gibsondan Judging from the Apache Beam SDK it might actually be grpcio 1.59.3 I have running as known to be unaffected. See: apache/beam#30867 (comment) I can't test this know but I strongly recommend starting there, might be all of 1.6x is affected by whatever this issue is. |
The user I've been working with hasn't seen the issue happen since downgrading to 1.60.0 , and it was happening pretty reliably before. Still monitoring to see if it pops up again. |
Hi all, based on the error message, it appears there might be an issue with either the transport or internet connection. To determine if this is a regression, we'll need more logs. Please enable those env vars to collect logs from gRPC core:
|
I'm not in a position to run that since I don't have a local reproduction myself, but I will recommend that the next time somebody reports it. Maybe you are able to do that @DerRidda ? |
So yeah, I managed to beat my application into shape with a mixture of dropping the protobuf and grpcio version to known good ones, actually older versions than that. I am pretty sure 1.60.0 is also fine, as is 1.59.3 but that alone wasn't it for me. I used a big hammer and fixed probably more dependencies into place than needed: grpcio = "1.59.3"
protobuf = "4.25.1"
googleapis-common-protos = "1.61.0"
google-cloud-core = "2.3.3"
google-api-core = "2.14.0"
google-api-python-client= "2.109.0" The top two might be most interesting for a generic use case outside of Apache Beam on Dataflow or the GCP ecosystem in general, @gibsondan. Could be related to #36247? |
Summary: While we don't have a conclusive answer to the sporadic reports of grpc server hangs,evidence is mounting to support a pin: - At least one user who was reliably hitting the hang in grpc/grpc#36256 there reported it going away after downgrading to 1.60.0 (I went one version lower here, to 1.59.3, because I didn't want to pin in the middle of a minor version) - The comments on that issue from other users also report earlier versions helping (although they also reported needing to downgrade some other dependencies too - would want more data points to support that before we add more pins) Test Plan: BK
Summary: While we don't have a conclusive answer to the sporadic reports of grpc server hangs,evidence is mounting to support a pin: - At least one user who was reliably hitting the hang in grpc/grpc#36256 there reported it going away after downgrading to 1.60.0 (I went one version lower here, to 1.59.3, because I didn't want to pin in the middle of a minor version) - The comments on that issue from other users also report earlier versions helping (although they also reported needing to downgrade some other dependencies too - would want more data points to support that before we add more pins) Test Plan: BK
I have a fairly reliable repro, and some details in apache/beam#30867 (comment) |
@tvalentyn Thanks for the excellent debugging! |
I can probably give you access to a VM with a stuck process. |
so far I was not able to repro in a simpler setup |
|
|
particularly:
|
looks like we might need to have debug symbols for cygrpc.cpython-38-x86_64-linux-gnu.so to get more info |
also, i think the process i looked at got "unstuck" after 90 min or so. |
Will try to rebuild grpc dependency with debug symbols as follows
and supply it in a custom container for the dataflow job. |
I cannot reproduce the error when I use the build of GRPC from sources. I tried setting |
Could we make resolution of this issue a blocker for release v1.63.0 ? |
C++ team need this release to unblock GCS, is there any particular reason you want to block the release? From the context looks like it's not a regression introduced in v1.63.0 and can be temporary resolved by pinging to a lower version. |
Mostly to not have to add an upperbound on grpcio, since that can sometimes make dependency resolution complicated down the road. ok, we'll add an upper bound for now then. |
I would love to know what version would work as an upper bound if we have
any leads on that front.
…On Mon, Apr 15, 2024 at 5:07 PM tvalentyn ***@***.***> wrote:
Mostly to not have to add an upperbound on grpcio, since that can
sometimes make dependency resolution complicated down the road. ok, we'll
add an upper bound for now then.
—
Reply to this email directly, view it on GitHub
<#36256 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAPJC53SPYYJIZ2G2QNQFLY5RFSNAVCNFSM6AAAAABFXSTTPKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXHA4TGMZUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I can repro apache/beam#30927 on grpcio==1.60.0 and cannot repro on 1.59.x, so I'll set the bound to "grpcio<1.60.0" |
Scratch that, I can still repro on |
As @DerRidda mentions in #36256 (comment) - this might involve multiple dependencies or some other dependency is at play in this regression. |
Ok, the issue i have been investigating so far appears to match: googleapis/python-bigtable#949, which will be fixed in the upcoming release of grpcio. Any of the following mitigations help:
|
I've also been getting the same error in my clients with
However, not all clients receive the error, around 30% of them. I've also been using gRPC 1.62.1 in Python 3.11. I've tried changing the version of grpcio for the server and client. I also get this error in versions 1.56.0, 1.58.0, 1.63.0rc1 and 1.63.0rc2. UPDATE: I wasn't properly closing the gRPC server, and that was leaving a process using the same port, causing the error. Make sure you are properly closing/killing all processes related to gRPC in Python... |
We have backported the fix to v1.62.2, update to this version or above should fix the issue. Please open an new issue if you still have issues using v1.62.2 or above versions. |
Thank you for the fix! |
What version of gRPC and what language are you using?
1.62.1, python
What operating system (Linux, Windows,...) and version?
Linux
What runtime / compiler are you using (e.g. python version or version of gcc)
python 3.11
What did you do?
We run an project that is built on grpc and involves running a grpc server.
In the last few months, several different users have reported an issue that always has the same commonalities:
connection attempt timed out before receiving SETTINGS frame
):This is different than the error message I'm used to seeing when a gRPC server is totally inaccessible or is down (and the process is still running / the threads still appear to be ready to serve requests when inspected via py-spy). I'm more accustomed to an error message like this (
Failed to connect to remote host: Connection refused
):Unfortunately I do not have a simple or reliable repro for this, but i'm wondering if you all have any recommendations for additional debugging flags we could add or more information that would be helpful to get to the bottom of what might be going on here - or if this error message clearly indicates that we are hitting some timeout with a value that we could tune. Thanks in advance for any guidance you can provide.
What did you expect to see?
A running grpc server
What did you see instead?
A "hanging" grpc server that returns "connection attempt timed out before receiving SETTINGS frame"
Anything else we should know about your project / environment?
If it's helpful context, the way we initialize our grpc server is here:
The way we initialize our grpc server can be found here - we pass a ThreadPoolExecutor into a new grpc server object: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L1184-L1192
The text was updated successfully, but these errors were encountered: