debugpy sometimes fails to start up #1064

roblourens · 2022-09-23T02:22:00Z

Before creating a new issue, please check the FAQ to see if your question is answered there.

Environment data

debugpy version: 1.6.3
OS and version: ubuntu-latest
Python version (& distribution if applicable, e.g. Anaconda): 3.8
Using VS Code or Visual Studio: VS Code

Continuing my investigation of flakiness in VS Code's run-by-line, here's an issue that I've seen a few times in CI (https://github.com/microsoft/vscode-jupyter/actions/runs/3102518942/jobs/5024891548)

From VS Code's side, we try to start debugging, and send an "initialize" request.

The debugpy logs show that the server spawns the adapter, then waits for it to call back on some port. The adapter starts, tries to call back, but it seems that the server does not get that request. At least this is my interpretation of what I'm seeing. I'm interested in any ideas you have about what could be going on. Maybe it's a case of trying to use a port that turns out to not be free (I filed an issue related to that in the ipykernel repo)

Does_not_stop_in_other_cell_-_18-debugpy.adapter-10546.log
Does_not_stop_in_other_cell_-_18-debugpy.pydevd.10530.log
Does_not_stop_in_other_cell_-_18-debugpy.server-10530.log

int19h · 2022-09-26T21:18:24Z

It's interesting that the adapter doesn't report any error from sending the port information to the server. Nor did it get blocked - there wouldn't be the "all debug servers disconnected" message in the log if that were the case. So it did successfully connect to something on that port, and sent a TCP packet there.

This seems to imply that something else hijacked that port. The server tries to prevent port clashes by using an ephemeral port and letting the OS pick the port number:

debugpy/src/debugpy/server/api.py

Line 158 in 01b1c7b

endpoints_listener = sockets.create_server("127.0.0.1", 0, timeout=10)

(On Windows, we also set SO_EXCLUSIVEADDRUSE, but there's no equivalent on other platforms that I know of.)

However, this would require for the hijack to happen between the listener socket being opened, and the adapter process spawned and getting to the point in its startup where it tries to connect to said port. Which is a pretty short interval, so if that's the case, I wouldn't expect it to happen often.

I'm not well versed in the Linux networking stack, but perhaps there's some system-wide way to log which processes are opening ports for listening?

roblourens · 2022-10-14T00:47:56Z

I picked up this failure on a CI machine, which may be busy with random processes taking different ports. But saw it more than once, and if it happened there, can happen somewhere else too. Is there anything else I can do on my end to help?

int19h · 2022-10-19T20:18:28Z

I think part of the problem is that we use SO_REUSEADDR for all our server sockets. We originally added this back when there was only one socket (for DAP), because users were hitting issues whereby the debuggee crashing or being force-killed resulted in a lingering port allocation for a couple more minutes before the OS timeout would kick in, during which debugpy --listen on the same port would result in an error.

However, this logic doesn't apply to ephemeral ports that we use for internal communication - yet we use the same helper function to create listener sockets for them. I did some more digging on the exact effect of this combination, and it turns out that SO_REUSEADDR can reclaim a port from a bound open socket so long as it isn't listening; when port=0, this means applies to all potential candidates for the newly allocated port. So if two processes concurrently request an ephemeral port for a socket with SO_REUSEADDR set, there's a window for a race between bind() and listen() on that socket.

I'm not 100% sure that this is the root cause, but it seems the most likely candidate, especially if you run tests that involve debugpy concurrently. So let's fix that first and see if it helps.

…rosoft#1064)

roblourens · 2022-10-20T15:11:04Z

especially if you run tests that involve debugpy concurrently

Tests are run one at a time, but I'm not sure there's a guarantee that the previous test will be done shutting down before the next one starts. Thanks for looking into this!

int19h · 2022-10-20T17:23:16Z

The tentative fix is merged now; let me know if this helped.

roblourens · 2022-11-28T17:39:16Z

I think that did it- a few hundred runs in CI and I haven't seen this again yet. Thanks!

int19h added the bug Something isn't working label Sep 27, 2022

int19h self-assigned this Oct 19, 2022

int19h pushed a commit to int19h/debugpy that referenced this issue Oct 20, 2022

Do not use SO_REUSEADDR et all when listening on ephemeral ports (mic…

84248fd

…rosoft#1064)

int19h pushed a commit that referenced this issue Oct 20, 2022

Do not use SO_REUSEADDR et all when listening on ephemeral ports (#1064)

e7fb50a

int19h closed this as completed Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debugpy sometimes fails to start up #1064

debugpy sometimes fails to start up #1064

roblourens commented Sep 23, 2022

int19h commented Sep 26, 2022

roblourens commented Oct 14, 2022

int19h commented Oct 19, 2022

roblourens commented Oct 20, 2022

int19h commented Oct 20, 2022

roblourens commented Nov 28, 2022 •

edited

Loading

debugpy sometimes fails to start up #1064

debugpy sometimes fails to start up #1064

Comments

roblourens commented Sep 23, 2022

Environment data

int19h commented Sep 26, 2022

roblourens commented Oct 14, 2022

int19h commented Oct 19, 2022

roblourens commented Oct 20, 2022

int19h commented Oct 20, 2022

roblourens commented Nov 28, 2022 • edited Loading

roblourens commented Nov 28, 2022 •

edited

Loading