-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskExecutor should release port immediately prior to launching shell #365
Comments
@hungj can we close this issue given it's resolved? |
Fixing this issue is a good way to mitigate the chance of a race condition causing port contention. However, the race condition still exists, though the window where it might happen has been reduced. I think we could eliminate the race condition entirely between different TaskExecutors running on the same machine as follows:
I think this would work, assuming:
|
+1 to @erwa , this sounds like a good enough synchronization mechanism for now. |
If there's a large TonY app (e.g. >200 executors), it's possible one TaskExecutor sets up ports long before another. By this time the first TaskExecutor will have acquired and released a port, and the longer the setup time between two executors, the more likely some other TaskExecutor (by a different application) will try to setup on the same port.
The fix is for TaskExecutor to hold onto the port up until immediately before the underlying TF process is launched (or at least hold on to the port until the TaskExecutor receives the cluster spec, meaning all TaskExecutors in the same job have been set up) so that we block this port while the rest of the TaskExecutors are being allocated/setting up.
The text was updated successfully, but these errors were encountered: