Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debugpy sometimes fails to start up #1064

Closed
roblourens opened this issue Sep 23, 2022 · 6 comments
Closed

debugpy sometimes fails to start up #1064

roblourens opened this issue Sep 23, 2022 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@roblourens
Copy link
Member

Before creating a new issue, please check the FAQ to see if your question is answered there.

Environment data

  • debugpy version: 1.6.3
  • OS and version: ubuntu-latest
  • Python version (& distribution if applicable, e.g. Anaconda): 3.8
  • Using VS Code or Visual Studio: VS Code

Continuing my investigation of flakiness in VS Code's run-by-line, here's an issue that I've seen a few times in CI (https://github.com/microsoft/vscode-jupyter/actions/runs/3102518942/jobs/5024891548)

From VS Code's side, we try to start debugging, and send an "initialize" request.

The debugpy logs show that the server spawns the adapter, then waits for it to call back on some port. The adapter starts, tries to call back, but it seems that the server does not get that request. At least this is my interpretation of what I'm seeing. I'm interested in any ideas you have about what could be going on. Maybe it's a case of trying to use a port that turns out to not be free (I filed an issue related to that in the ipykernel repo)

Does_not_stop_in_other_cell_-_18-debugpy.adapter-10546.log
Does_not_stop_in_other_cell_-_18-debugpy.pydevd.10530.log
Does_not_stop_in_other_cell_-_18-debugpy.server-10530.log

@int19h
Copy link
Contributor

int19h commented Sep 26, 2022

It's interesting that the adapter doesn't report any error from sending the port information to the server. Nor did it get blocked - there wouldn't be the "all debug servers disconnected" message in the log if that were the case. So it did successfully connect to something on that port, and sent a TCP packet there.

This seems to imply that something else hijacked that port. The server tries to prevent port clashes by using an ephemeral port and letting the OS pick the port number:

endpoints_listener = sockets.create_server("127.0.0.1", 0, timeout=10)

(On Windows, we also set SO_EXCLUSIVEADDRUSE, but there's no equivalent on other platforms that I know of.)

However, this would require for the hijack to happen between the listener socket being opened, and the adapter process spawned and getting to the point in its startup where it tries to connect to said port. Which is a pretty short interval, so if that's the case, I wouldn't expect it to happen often.

I'm not well versed in the Linux networking stack, but perhaps there's some system-wide way to log which processes are opening ports for listening?

@int19h int19h added the bug Something isn't working label Sep 27, 2022
@roblourens
Copy link
Member Author

I picked up this failure on a CI machine, which may be busy with random processes taking different ports. But saw it more than once, and if it happened there, can happen somewhere else too. Is there anything else I can do on my end to help?

@int19h
Copy link
Contributor

int19h commented Oct 19, 2022

I think part of the problem is that we use SO_REUSEADDR for all our server sockets. We originally added this back when there was only one socket (for DAP), because users were hitting issues whereby the debuggee crashing or being force-killed resulted in a lingering port allocation for a couple more minutes before the OS timeout would kick in, during which debugpy --listen on the same port would result in an error.

However, this logic doesn't apply to ephemeral ports that we use for internal communication - yet we use the same helper function to create listener sockets for them. I did some more digging on the exact effect of this combination, and it turns out that SO_REUSEADDR can reclaim a port from a bound open socket so long as it isn't listening; when port=0, this means applies to all potential candidates for the newly allocated port. So if two processes concurrently request an ephemeral port for a socket with SO_REUSEADDR set, there's a window for a race between bind() and listen() on that socket.

I'm not 100% sure that this is the root cause, but it seems the most likely candidate, especially if you run tests that involve debugpy concurrently. So let's fix that first and see if it helps.

@roblourens
Copy link
Member Author

especially if you run tests that involve debugpy concurrently

Tests are run one at a time, but I'm not sure there's a guarantee that the previous test will be done shutting down before the next one starts. Thanks for looking into this!

@int19h
Copy link
Contributor

int19h commented Oct 20, 2022

The tentative fix is merged now; let me know if this helped.

@roblourens
Copy link
Member Author

roblourens commented Nov 28, 2022

I think that did it- a few hundred runs in CI and I haven't seen this again yet. Thanks!

@int19h int19h closed this as completed Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants