New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel shutdown is not working correctly in YARN Cluster mode #1116
Comments
Through binary search trials across commits, this issue was introduced with the merge of #1093. The changes in the PR to the areas in which an issue may occur are not obvious. These areas include (but may not be limited to): I'll be looking more closely at the PR changes but if others see anything obvious, please let us know via comments on this issue. Thank you! |
I finally found the issue. It was due to this change in
Here's the code block... enterprise_gateway/etc/kernel-launchers/python/scripts/launch_ipykernel.py Lines 593 to 607 in ec0365b
This was one of the last things I checked because it just didn't seem likely, but here's what is happening (although I can't explain everything)... The The change in 1093 added some logging when What I can't explain is why what appears to be another call to Launcher stdout
By moving this block of code into |
Wow, thank you @kevin-bates for the investigating and for finding the issue/solution. |
Well - got some bad news. In looking into the CI logs, I also see this same symptom with the Scala kernel and just reproduced it! I wonder if it too is getting an unexpected exception. I don't understand why EG would get a second connection payload since it doesn't appear an entirely new launch is occurring. Sigh!! |
Interesting. The Scala scenario is reproducible against #1092 - although the Python case wasn't. I believe this is because the Python kernel (in 1092) was closing more quickly, but with 1093 (due to the extra exceptions?), it wasn't. This opens the candidate-set to those PRs prior to 1092. I will first confirm that 2.6.0 doesn't have this issue, then, again, use a binary search to locate where the issue may have been introduced. At least this appears to be something in the EG stack and not launcher specific at this point. |
(Sorry for the spam) This is only seen in YARN, but I suppose it could be possible for whatever resource managers try to auto-restart failed dirver applications. In this case, because these kernels are launched within the YARN cluster, there's an inherent race condition that occurs between the time the listener (launcher) is requested to shut down and the time the resource manager can terminate the application. In cases where the launcher terminates prior to the RM, the RM thinks it needs to restart the application. Since the response address is now long-lived, the EG log gets another response (with different ports) dumped into the log. Because the kernel's shutdown has taken a bit, the shutdown extra measures are taken (which also appear in the log). Meanwhile, the original shutdown request for the YARN application itself - sent via the web API - has been received and the YARN application (which now includes two attempts) is finally shutdown. From an end-user perspective, everything looks normal. Only the DEBUG logs are a bit odd when this additional "received payload" message is encountered. Moving forward: Any objections can be discussed on the ensuing PR or here. |
I have found a general solution based on the fact that indeed the YARN Resource Manager's auto-restart functionality is racing against the kernel's shutdown when the kernel is launched in Spark cluster mode. We can negate the RM's auto-restart by setting Since Jupyter already has automatic restarts built into the framework (which can also be disabled if necessary), we really don't need the RM getting in the way and trigger this race condition. As a result, I would like to I'm also finding some bugs with how errors are handled when stopping the listener (again on shutdowns) and would like to address those in the same PR. |
While preparing for the 3.0 alpha/beta release, I'm seeing an issue relative to kernel shutdowns. It primarily occurs with YARN kernels, but I suspect it might be more general. This needs to be resolved prior to 3.0.
The symptom is that it seems the launcher is restarting because EG logs that new connection info response comes in from launcher - which is unexpected. I suspect a race condition may have been introduced such that the server (EG) i detecting the kernel has terminated prior to the kernel manager marking the kernel as shut down. This results in the restarted thinking the kernel has died - sending an automatic restart. Since the kernel manager is not in a state relative to restarts, it gets confused and things wind up in an invalid state.
Here's the log output. The first block of lines is expected, the rest are not...
I'll be looking into this, but I haven't had a chance to compare recent diffs. Any suggestions are welcome.
The text was updated successfully, but these errors were encountered: