Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel shutdown is not working correctly in YARN Cluster mode #1116

Closed
kevin-bates opened this issue Jun 16, 2022 · 7 comments · Fixed by #1117
Closed

Kernel shutdown is not working correctly in YARN Cluster mode #1116

kevin-bates opened this issue Jun 16, 2022 · 7 comments · Fixed by #1117
Labels
Milestone

Comments

@kevin-bates
Copy link
Member

While preparing for the 3.0 alpha/beta release, I'm seeing an issue relative to kernel shutdowns. It primarily occurs with YARN kernels, but I suspect it might be more general. This needs to be resolved prior to 3.0.

The symptom is that it seems the launcher is restarting because EG logs that new connection info response comes in from launcher - which is unexpected. I suspect a race condition may have been introduced such that the server (EG) i detecting the kernel has terminated prior to the kernel manager marking the kernel as shut down. This results in the restarted thinking the kernel has died - sending an automatic restart. Since the kernel manager is not in a state relative to restarts, it gets confused and things wind up in an invalid state.

Here's the log output. The first block of lines is expected, the rest are not...

[D 2022-06-16 14:08:02.951 EnterpriseGatewayApp] Clearing buffer for 7c9fff71-83e7-4a7d-b55b-46e1853ed4d0
[I 2022-06-16 14:08:02.952 EnterpriseGatewayApp] Kernel shutdown: 7c9fff71-83e7-4a7d-b55b-46e1853ed4d0
[D 2022-06-16 14:08:02.955 EnterpriseGatewayApp] Signal (2) sent via gateway communication port.
[D 2022-06-16 14:08:02.957 EnterpriseGatewayApp] Shutdown request sent to listener via gateway communication port.

[D 2022-06-16 14:08:09.854 EnterpriseGatewayApp] Received payload 'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJHT2sxWkM5K2hHM1M3ekYzWlFFL3RnR0ZXSDhSa2s1a2xLbUp4c0xXSEI0VjFYYTQ3S09MRTYzWU1HaE5xMjF1TVdhU2VRRHowYXp4OVlsaHhkSGxVUk9nRGRBZ1QxaFpLaVRFK1I0WWlENXcwSmkvT2ZrdHp4Q2RkdkhVbHFrQzlZM3JOL3Rmc204djh1M0ZENVVOR002VXlkMWhYVEo1T0l6MzlTbVI2WU09IiwgImNvbm5faW5mbyI6ICIzMEdIaHdONnVhSkJsdTVGUVlPTUo2L3E1Q2NvQ3MxekhoWndvcVU3MkxQVUFXaVZnSnNQOHBzRGJycEZ0c01yMmliS3dMS0dpdW9SMldiZ01BWGx1REZsNlJJTVFwTmYycDdEU2tlRFNlYTF0YnFYM3VjR0xFWGJhcTdxcFB5bU90Z1VZRWVTa1FmdnQ4d0V0bFJqR05wNEgyUFBFcDN5RTdnZnAvczZjUWY2OXBzUjBETXc0MFREVFlzd0xMNlZZc3Fmd0xMNE5LNHVrV0pMbElET2xpeU9OTE45QXk1SEk2Nzk3R0JTSllZZEZaVnBpV0N0cWxhV3VKWFE3QTFYeElMcFhEWUJZeWM5V0daNkVJS0ZHR0pROTFSRjJ3bTQxdmZmbXp3NkU3cWFseDVDQWlvV1Y4blA3OTZmS25LeXJLd0dRZEdBQTBXTkxSckhjaHJkbWprRy9yd3V0aVZ4Q2Zubmx3YTFkc2NmOVlvL2x5Y2VFNHlFUVY1ZHN2YmI1eUxtWm54eHYzODNtTXhFY0k5YTRPKzZaWVFUR01IbWxWYjY1TUhJTWMxQU5Kak16UmdEeHp1T3grUDNWbVVEUjZuMjR6Y1JBQTMwb1BxVklNQ3dvUT09In0='
[D 2022-06-16 14:08:09.854 EnterpriseGatewayApp] Version 1 payload received.
[D 2022-06-16 14:08:09.857 EnterpriseGatewayApp] Decrypted payload '{'shell_port': 57059, 'iopub_port': 52699, 'stdin_port': 52885, 'control_port': 60819, 'hb_port': 46147, 'ip': '0.0.0.0', 'key': 'a6a3a110-8c52-4633-bca5-8e7571926a47', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': 6911, 'pgid': 6407, 'comm_port': 44059, 'kernel_id': '7c9fff71-83e7-4a7d-b55b-46e1853ed4d0'}'
[E 2022-06-16 14:08:09.857 EnterpriseGatewayApp] Kernel id '{}' has not been registered and will not be processed!
[D 2022-06-16 14:08:10.477 EnterpriseGatewayApp] Kernel is taking too long to finish, terminating
[D 2022-06-16 14:08:10.477 EnterpriseGatewayApp] BaseProcessProxy.terminate(): None
[D 2022-06-16 14:08:10.753 EnterpriseGatewayApp] Polling every 30 seconds for kernels idle > 43200 seconds...
[D 2022-06-16 14:08:10.754 EnterpriseGatewayApp] kernel_id=7c9fff71-83e7-4a7d-b55b-46e1853ed4d0, kernel_name=spark_python_yarn_cluster, last_activity=2022-06-16 14:06:50.468986+00:00
[D 2022-06-16 14:08:17.966 EnterpriseGatewayApp] Kernel is taking too long to finish, killing
[D 2022-06-16 14:08:18.374 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1655388136881_0001, kernel ID: 7c9fff71-83e7-4a7d-b55b-46e1853ed4d0, state: KILLED, result: None
[D 2022-06-16 14:08:18.401 EnterpriseGatewayApp] YarnClusterProcessProxy.cleanup: Clearing possible defunct process, pid=2707...
[I 220616 14:08:18 web:2243] 204 DELETE /api/kernels/7c9fff71-83e7-4a7d-b55b-46e1853ed4d0 (192.168.65.3) 15461.88ms

I'll be looking into this, but I haven't had a chance to compare recent diffs. Any suggestions are welcome.

@kevin-bates kevin-bates added this to the v3.0 milestone Jun 16, 2022
@kevin-bates
Copy link
Member Author

Through binary search trials across commits, this issue was introduced with the merge of #1093.

The changes in the PR to the areas in which an issue may occur are not obvious. These areas include (but may not be limited to): services/kernels/remotemanager.py, services/processproxies/{processproxy,yarn}.py and etc/kernel-launchers/python/scripts/launch_ipykernel.py.

I'll be looking more closely at the PR changes but if others see anything obvious, please let us know via comments on this issue. Thank you!

@kevin-bates
Copy link
Member Author

I finally found the issue. It was due to this change in launch_ipykernel.py...

611,612c604,607
<     except Exception:
<         pass
---
>     except Exception as e:
>         logger.warning(
>             f"Could not delete connection file '{connection_file}' at exit due to error: {e}"
>         )

Here's the code block...

# launch the IPython kernel instance
start_ipython(
locals(),
cluster_type=cluster_type,
connection_file=connection_file,
ip=ip,
kernel_class_name=kernel_class_name,
)
try:
os.remove(connection_file)
except Exception as e:
logger.warning(
f"Could not delete connection file '{connection_file}' at exit due to error: {e}"
)

This was one of the last things I checked because it just didn't seem likely, but here's what is happening (although I can't explain everything)...

The os.remove(connection_file) has probably been a problem for some time and produces a NameError: name 'os' is not defined error into the launcher's output (which isn't typically viewed). Since this is the last bit of code before the module is exited, it really isn't noticeable (nor was this noticed). Although the file does include import os, the os is removed from the namespace during initialize_namespace() when setting up how the Spark context is established. As a result, references to it are not resolved when that code is run.

The change in 1093 added some logging when os.remove() fails (although this kind of failure was not anticipated). Again, because the namespace is essentially rebuilt just prior to starting the kernel, logger and connection_file references are also not resolved.

What I can't explain is why what appears to be another call to return_connection_info() occurs. It's like something must have popped the call stack and wound up in another location of the code. This is where the Received payload message is coming from in the EG log in the original description above. Here's the stdout captured by the launcher in this case...

Launcher stdout

[D 2022-06-17 18:56:20,739.739 launch_ipykernel] Using connection file '/tmp/kernel-25a4f4ce-97bb-4a88-a112-09198446db1c_h20yg7h_.json'.
[I 2022-06-17 18:56:20,740.740 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 51345
[D 2022-06-17 18:56:20,740.740 launch_ipykernel] JSON Payload 'b'{"shell_port": 41965, "iopub_port": 38485, "stdin_port": 56499, "control_port": 37107, "hb_port": 40997, "ip": "0.0.0.0", "key": "1394a872-ca31-4a76-869c-c2a4de0722c6", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": 5198, "pgid": 4695, "comm_port": 51345, "kernel_id": "25a4f4ce-97bb-4a88-a112-09198446db1c"}'
[D 2022-06-17 18:56:20,755.755 launch_ipykernel] Encrypted Payload 'b'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJuY3hhVDk3T2R5L0VzUmxVWnpBeVRCK3dxVGNNRjFFKzhIbGhRZTlTTVhtZEFyTkdFWGYzaHBEeFJ3UGxEM1Fjcy9wMnMxTHZ6OHZ5VTh0RzZQVGl6OEg0dVgxekxGZnd5SVYxbnJSRlVrL2tzUjN1ZkF0NXR4TTJrc2lQTmdaL1g3dFlUMUVqMVltdTBxMlFJT2FTMDZRTGxoZnpJWUFHcmxGeFd6dXRGWTQ9IiwgImNvbm5faW5mbyI6ICJHKzcva244c1RFYjRXUzRwRHpsU0o3cFpFTjZDdDBkSDl4UC9MdUZmajRVMnV2dm9hSlEza3VHK1pJWkJ1NndTdmk2ejN6WldJNjRpTEExdlNxNmVGbTRyVkNiNm95c3ZmWU5Fem1Pb3RwaFZITnhEaXFMbHJyK2RCQSt2UC9OMFdNaTZHQklrbGd0VEdqSmZUOUR0clozTUVSYy81VEVvOC96RTNoZ3BkbzYwMzY4WjV1U2lLZ0U5N0ZHY01GUGkyTUJ2Y3JnVXdCcDR1VUIvWUpTbFZlcmRMUG9wVWpZYm1BMnIvUHA5dTBMQnNsbXFlYm9mWlgzQ2NmM2xISHR2NjBPYjRyYzkwWDlDS2RmUGRSU3lxdjZQOXhiS2hJZzk2amo5Qm9EOTByTmYyNXUwcVEvLzc2c3dXWVcyUkRVVkZSNDVZVXY3S3h1WEZlU1FzTkE0ZVJNeXV2Rjc2T1BSVUNQUEppck1TeDBXQVg2QldFWDFTS2d2UGRKblRGVDU3cFNPV3NCcFNBWXliZStCQWJ3SXJJaFJ3YlVYelZFZnY2QkNzdmlUNStQLzEwcFdROFY5TmFnUW9mTnA2WEVCODFZaE5HU05LTU5qbXlLSWhHOWpDdz09In0='
/opt/conda/lib/python3.7/site-packages/IPython/paths.py:67: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
" using a temp directory.".format(parent))
NOTE: When using the ipython kernel entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending
"quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see ipython/ipython#2049

To connect another client to this kernel, use:
--existing /tmp/kernel-25a4f4ce-97bb-4a88-a112-09198446db1c_h20yg7h_.json
[I 2022-06-17 18:56:46,157.157 launch_ipykernel] server_listener got request: {'signum': 2}
[I 2022-06-17 18:56:46,158.158 launch_ipykernel] server_listener got request: {'shutdown': 1}
Traceback (most recent call last):
File "launch_ipykernel.py", line 606, in
os.remove(connection_file)
NameError: name 'os' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "launch_ipykernel.py", line 608, in
logger.warning(
NameError: name 'logger' is not defined
[D 2022-06-17 18:56:53,294.294 launch_ipykernel] Using connection file '/tmp/kernel-25a4f4ce-97bb-4a88-a112-09198446db1c_2vokj94x.json'.
[I 2022-06-17 18:56:53,295.295 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 59979
[D 2022-06-17 18:56:53,296.296 launch_ipykernel] JSON Payload 'b'{"shell_port": 34917, "iopub_port": 51997, "stdin_port": 50251, "control_port": 44411, "hb_port": 38765, "ip": "0.0.0.0", "key": "766bf903-9603-46e1-91d5-b83bcde6ba7c", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": 6878, "pgid": 6376, "comm_port": 59979, "kernel_id": "25a4f4ce-97bb-4a88-a112-09198446db1c"}'
[D 2022-06-17 18:56:53,324.324 launch_ipykernel] Encrypted Payload 'b'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJsQVVKcXA3cmROYThVV1pVQk9vbXhtbDludWQ5aklXWXU2TlZMUUgrS0RBSHFkVUNSV1ZuRjN0S1ZFOXl4OWhUM3kzZGNGTE9jUktCZVNCODd6WkE5ZVg2YWNsanE4ZU05NVIxeXdzYk1MTENCcEIxdnhyTSs5cUdDS251UEluL2x2M2E2N0FvNXdoWGt3enBuQVo0WnI1WkQvdFFpcmhnbDlLVlRxWjlPS0U9IiwgImNvbm5faW5mbyI6ICJxazVGZ204SXdPd241d3RCMVpuczJZeVRoUmF6V3Q0cmk0Tlp5UGRwMnQzaVphNm9RZTFLVUhGSHR0SjN3N2ZJQlJGRWhCK0tab3V6RkVNbHlHN0xkQjRwZTA4dE1Pd0RKMDVKL2NVQjl2N2NsRUNBMXR0bytwK1R6LzFnalozMnFsR0oyaGZUV0llZEVjdmZGMHhQNjNnMVZwcXMraDh5QlFFNmZHQkpXR1B5YlM1Y0k4WjBKbUduSWFxTjdlQUU3TzNSQ0xFNjMwd2p0cDE1RjNSMldaTkhBaUQ5WjFUQlBPZk42NlhmY3l3ai9TeEtUK1NoZlQ0L2dPQ2FUbEI2M0tIaGNjR0IvNFprbGdydjVCNUdnVTlhSHA1d0pwNXczVUh3bFNiM2pZaUoreTVLSWppVEtiamhPdThXb0hoWmZBUEp3TWsxM2ZLZzFSN1Yya3pnWDd3VTRieit6a3ZGTUxkYzZMalMvZHVNR3oxUkRHb1Bva1RjZktubGU0cEIxOEJmTExUellwUzZPdEhTQ1oyT29uT3dLN09NRWVwYmE4NlR3dXdYM2dsZmZaaFcxVGtWem85by9tMkU0WmxqNHh1akFVWk81RkYxbFYyTEFUVGVLQT09In0='
/opt/conda/lib/python3.7/site-packages/IPython/paths.py:67: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
" using a temp directory.".format(parent))
NOTE: When using the ipython kernel entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending
"quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see ipython/ipython#2049

To connect another client to this kernel, use:
--existing /tmp/kernel-25a4f4ce-97bb-4a88-a112-09198446db1c_2vokj94x.json

By moving this block of code into start_ipython(), re-importing os, and using print the shutdown behavior is as expected. A pull request is forthcoming.

@lresende
Copy link
Member

Wow, thank you @kevin-bates for the investigating and for finding the issue/solution.

@kevin-bates
Copy link
Member Author

Well - got some bad news. In looking into the CI logs, I also see this same symptom with the Scala kernel and just reproduced it! I wonder if it too is getting an unexpected exception. I don't understand why EG would get a second connection payload since it doesn't appear an entirely new launch is occurring. Sigh!!

@kevin-bates
Copy link
Member Author

Interesting. The Scala scenario is reproducible against #1092 - although the Python case wasn't. I believe this is because the Python kernel (in 1092) was closing more quickly, but with 1093 (due to the extra exceptions?), it wasn't. This opens the candidate-set to those PRs prior to 1092. I will first confirm that 2.6.0 doesn't have this issue, then, again, use a binary search to locate where the issue may have been introduced. At least this appears to be something in the EG stack and not launcher specific at this point.

@kevin-bates
Copy link
Member Author

kevin-bates commented Jun 17, 2022

(Sorry for the spam)
I'm pretty sure what's happening now. The issue is also reproduced in 2.6.0 and I suspect has always been happening. It's more visible in 2.6.0 after we introduced the single response address changes where the response address is essentially long-lived in EG.

This is only seen in YARN, but I suppose it could be possible for whatever resource managers try to auto-restart failed dirver applications. In this case, because these kernels are launched within the YARN cluster, there's an inherent race condition that occurs between the time the listener (launcher) is requested to shut down and the time the resource manager can terminate the application.

In cases where the launcher terminates prior to the RM, the RM thinks it needs to restart the application. Since the response address is now long-lived, the EG log gets another response (with different ports) dumped into the log. Because the kernel's shutdown has taken a bit, the shutdown extra measures are taken (which also appear in the log). Meanwhile, the original shutdown request for the YARN application itself - sent via the web API - has been received and the YARN application (which now includes two attempts) is finally shutdown.

From an end-user perspective, everything looks normal. Only the DEBUG logs are a bit odd when this additional "received payload" message is encountered.

Moving forward:
I think I'd like to submit a PR for the python launcher changes, but leave this issue open. I will change the title such that "regression" isn't implied, but also add that this is related to YARN cluster behavior.

Any objections can be discussed on the ensuing PR or here.

@kevin-bates kevin-bates changed the title [REGRESSION] Kernel shutdown is not working correctly Kernel shutdown is not working correctly in YARN Cluster mode Jun 17, 2022
@kevin-bates
Copy link
Member Author

I have found a general solution based on the fact that indeed the YARN Resource Manager's auto-restart functionality is racing against the kernel's shutdown when the kernel is launched in Spark cluster mode.

We can negate the RM's auto-restart by setting spark.yarn.maxAppAttempts=1. This configurable value defaults to yarn.resourcemanager.am.max-attempts - which defaults to 2. Setting the value to 1 instructs the RM to not make any restart attempts.

Since Jupyter already has automatic restarts built into the framework (which can also be disabled if necessary), we really don't need the RM getting in the way and trigger this race condition. As a result, I would like to --conf spark.yarn.maxAppAttempts=1 to the SPARK_OPTS in the various spark-based kernels corresponding to YARN.

I'm also finding some bugs with how errors are handled when stopping the listener (again on shutdowns) and would like to address those in the same PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants