TaskExecutor should release port immediately prior to launching shell #365

hungj · 2019-09-04T18:00:39Z

If there's a large TonY app (e.g. >200 executors), it's possible one TaskExecutor sets up ports long before another. By this time the first TaskExecutor will have acquired and released a port, and the longer the setup time between two executors, the more likely some other TaskExecutor (by a different application) will try to setup on the same port.

The fix is for TaskExecutor to hold onto the port up until immediately before the underlying TF process is launched (or at least hold on to the port until the TaskExecutor receives the cluster spec, meaning all TaskExecutors in the same job have been set up) so that we block this port while the rest of the TaskExecutors are being allocated/setting up.

burgerkingeater · 2019-09-06T03:34:03Z

@hungj can we close this issue given it's resolved?

erwa · 2019-09-07T15:10:02Z

Fixing this issue is a good way to mitigate the chance of a race condition causing port contention. However, the race condition still exists, though the window where it might happen has been reduced. I think we could eliminate the race condition entirely between different TaskExecutors running on the same machine as follows:

rather than finding an arbitrary free port, each TaskExecutor could iterate over a range of ports (either a randomly chosen range or something specified in config)
write a temp file port_XXX to some common location (could be specified in tony-site.xml)
if write succeeds, try to open socket on port (as sanity check). Else, go back to the previous step, trying another port
release port, set the port environment variable
spawn the process
when TaskExecutor exits, temp file port_XXX should be cleaned up automatically

I think this would work, assuming:

other processes and applications are not using the same ports (shouldn't be a problem in general, since MR and Spark have well-defined ports or are using dynamic ports (the range of dynamic ports should be configurable, and as long as the TaskExecutors use a port range which doesn't overlap with the dynamic port range, there shouldn't be a conflict))
the TaskExecutor process isn't SIGKILL'd (if it is, temp file clean up won't happen). I forget if YARN tries SIGTERM first before sending SIGKILL when a user kills an application or when nmClientAsync.stopContainerAsync is called. For the latter, we could replace the stopContainerAsync() call with calling our own TaskExecutor RPC endpoint telling the TaskExecutor to shutdown and do any necessary clean-up. For the former, temp port_XXX files left behind just means those ports won't be used by TaskExecutors until the file is cleaned up (but we could ensure they are cleaned up by a tmpwatch script).

oliverhu · 2019-09-09T07:26:51Z

+1 to @erwa , this sounds like a good enough synchronization mechanism for now.

burgerkingeater mentioned this issue Sep 4, 2019

TaskExecutor should release port immediately prior to launching shell. #366

Merged

erwa closed this as completed Sep 6, 2019

burgerkingeater mentioned this issue Jul 7, 2020

Hold reserved ports till task finishes to avoid possible port conflict in TaskExecutor #453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskExecutor should release port immediately prior to launching shell #365

TaskExecutor should release port immediately prior to launching shell #365

hungj commented Sep 4, 2019 •

edited

burgerkingeater commented Sep 6, 2019

erwa commented Sep 7, 2019

oliverhu commented Sep 9, 2019

TaskExecutor should release port immediately prior to launching shell #365

TaskExecutor should release port immediately prior to launching shell #365

Comments

hungj commented Sep 4, 2019 • edited

burgerkingeater commented Sep 6, 2019

erwa commented Sep 7, 2019

oliverhu commented Sep 9, 2019

hungj commented Sep 4, 2019 •

edited