New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stream: retry write on EPROTOTYPE on OS X #482
Conversation
FWIW I also tested this change on OS X 10.9 (where |
I followed the links you mention and the change looks ok. If I'm reading the blog post correctly, this happens when the send is processed while the socket is being torn down, not after it was abruptly closed; can you please clarify that in the commit message? Any chance you could write a test for this? |
Can you add some in-source comments as well? It's a rather obscure bug, it's not obvious from just looking at the code what the EPROTOTYPE check is for. |
At least on OS X 10.10 "Yosemite", an EPROTOTYPE can occur when trying to write to a socket that is shutting down. By retrying the write after EPROTOTYPE, we correctly get EPIPE.
428e338
to
94067e4
Compare
@bnoordhuis Comments added. @saghul To be honest, I'm not sure how to trigger this at the libuv layer. |
@mscdex From the blog post linked in the issue:
Maybe you can try to port this one to libuv. |
@saghul Right, but I am not able to get the timing right or something when I tried my hand at it. I'm not very familiar with the libuv internals. I always get |
Since it seems inherently racey, maybe we could let this one slide :-) @bnoordhuis WDYT? |
I guess so. LGTM. |
At least on OS X 10.10 "Yosemite", an EPROTOTYPE can occur when trying to write to a socket that is shutting down. By retrying the write after EPROTOTYPE, we correctly get EPIPE. PR-URL: #482 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
Hey all, I think this workaround might be the cause of a bigger issue on newer versions on MacOSX (I've tested it on BigSur and Monterey). If a certain kind of network extension (AppProxy or TransparentProxy extension) is enabled, any sockets that were opened prior to the extension becoming active will return an EPROTOTYPE error that is not a transient error. See https://developer.apple.com/forums/thread/681135 for more info. This is not a libuv specific issue -- see the ssh fix committed upstream referenced in the above post. This affects Electron based apps such as Slack and Signal, and it's pretty easily reproducible. With a flow-based VPN, merely starting/stopping an audio call on Slack, or just starting the network extension after Signal has started, will cause a 100% CPU usage pattern. On Signal, the activity monitor and CPU profile clearly show most of the time is spent inside libuv / uv_write2. Running dtruss on the process shows a large number of write() syscalls failing with errno=41 (EPROTOTYPE). For Slack, however, I haven't been able to get a clean indication from dtruss / Instruments as to why it's spinning. It's more of a conjecture that the two issues are related. Should the fix in this PR at least be conditional on the OSX version? Looking at the comment tokio-rs/mio#1364 (comment), newer MacOS versions might not be hit by this issue. Since documentation is very scarce on this, and this fix has been copied onto other projects (Ruby, dotnet/runtime, etc.), I believe there should at least be a guard against a potential infinite loop here. :) What do you all think? |
macOS never ceases to be a clusterfuck of software bugs... For a company with the resources of Apple you'd think they could at least keep something as basic as an operating system in working shape. Let me summarize and see if my understanding is correct:
Is that an accurate summary? |
@bnoordhuis I believe that's accurate based on my reading so far. Re (3): I would reword that as "If some OS features are turned on, write() calls on sockets that were connected before the OS feature was enabled, return EPROTOTYPE errors." That said, while I share your frustration with Apple's lack of documentation and transparency about these bugs, the workaround in this PR could have been better in the sense that it could have guarded against the infinite loop, rather than make a strong assumption about the (undocumented) transient nature of the EPROTOTYPE error. But I see the tension here, it's unclear how many retries are needed before we give up and bubble up the error! :) @stephentoub was spot on with his review comment on dotnet/corefx repository: https://github.com/dotnet/corefx/pull/37208/files/116dd13f10017f2991aefa9ad21273395526f1ff?w=1#diff-6c0d8c7b971aa7240f2b70aeb9fb718ed207b0fee11b8bb4c2bb95496a818187. |
Indeed, that turns it into a flaky bug and that's arguably worse. I'm unsure what the best way forward is. Give up after x amount of wall clock time has passed? That fixes the infinite loop but introduces a throughput choke point. |
@bnoordhuis Some ideas..
It might not be a throughput choke point because the error seems to indicate that the socket must be closed and reopened, so it's better to bubble this error up... After reopening the socket, things ought to succeed. This typically corresponds to VPN enable/disable system events that usually disrupt live connections anyway. |
@libuv/collaborators This needs your input, see ^. I'm inclined to just remove the workaround and see what happens. |
It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv#482
I gave in to my inclination and opened #3405. |
We can't realistically claim to support 10.7 or any version that Apple no longer supports so let's bump the baseline to something more realistic. Refs: libuv#482 Refs: libuv#3405
It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: #482
macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv#482 Refs: libuv#3405
I reproduced the EPROTOTYPE on macos 10.15 today:
(error code -41 is EPROTOTYPE negated, that's how libuv passes around errors.) I've opened #3413 to map the error to ECONNRESET, which is an expected error and probably appropriate given the circumstances. |
macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: #482 Refs: #3405
We can't realistically claim to support 10.7 or any version that Apple no longer supports so let's bump the baseline to something more realistic. Refs: libuv#482 Refs: libuv#3405
It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv#482
macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv#482 Refs: libuv#3405
We can't realistically claim to support 10.7 or any version that Apple no longer supports so let's bump the baseline to something more realistic. Refs: libuv#482 Refs: libuv#3405
Original commit log follows: darwin: remove EPROTOTYPE error workaround (nodejs#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (nodejs#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916 PR-URL: nodejs#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916 PR-URL: nodejs#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs/node#43916 PR-URL: nodejs/node#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs/node#43916 PR-URL: nodejs/node#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
At least on OS X 10.10 "Yosemite", an
EPROTOTYPE
can occur when trying to write to a stream that was abruptly closed. By retrying the write onEPROTOTYPE
, we correctly getEPIPE
.Related to nodejs/node#2382
Also, this issue is documented elsewhere in the wild.