Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to start kernels (Spark Operator) keeps trying to connect indefinitely #1266

Closed
lresende opened this issue Mar 3, 2023 · 0 comments · Fixed by #1271
Closed

Failed to start kernels (Spark Operator) keeps trying to connect indefinitely #1266

lresende opened this issue Mar 3, 2023 · 0 comments · Fixed by #1271

Comments

@lresende
Copy link
Member

lresende commented Mar 3, 2023

Description

When a Spark Operator/CRD kernel fails to start on the spark operator side, the gateway keeps pulling for connection indefinitely.

Expected behavior

I believe there are two issues here:

  • When a submitted CRD fails, gateway should recognize it as failed
  • In the case this is caused any other way, after some sort of timeout/number of retries, it should stop, mark kernel as dead, and stop the loop

Logs

[D 2023-03-03 16:26:26.175 EnterpriseGatewayApp] 193: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:26.190 EnterpriseGatewayApp] Nudge: attempt 77 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:26.690 EnterpriseGatewayApp] 194: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:26.695 EnterpriseGatewayApp] Nudge: attempt 78 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:27.208 EnterpriseGatewayApp] 195: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:27.213 EnterpriseGatewayApp] Nudge: attempt 79 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:27.729 EnterpriseGatewayApp] 196: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[W 2023-03-03 16:26:27.734 EnterpriseGatewayApp] Nudge: attempt 80 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:28.247 EnterpriseGatewayApp] 197: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:28.252 EnterpriseGatewayApp] Nudge: attempt 81 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:28.765 EnterpriseGatewayApp] 198: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:28.770 EnterpriseGatewayApp] Nudge: attempt 82 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:29.286 EnterpriseGatewayApp] 199: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:29.287 EnterpriseGatewayApp] Nudge: attempt 83 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:29.789 EnterpriseGatewayApp] Nudge: attempt 84 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:29.807 EnterpriseGatewayApp] 200: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:30.295 EnterpriseGatewayApp] Nudge: attempt 85 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:30.327 EnterpriseGatewayApp] 201: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
[D 2023-03-03 16:26:30.798 EnterpriseGatewayApp] Nudge: attempt 86 on kernel 68415e17-3ced-497f-86de-9c59285e2fec
[D 2023-03-03 16:26:30.846 EnterpriseGatewayApp] 202: Waiting to connect to k8s sparkapplication in namespace 'spark-applications'. Name: 'some-user-68415e17-3ced-497f-86de-9c59285e2fec-driver', Status: 'None', Pod IP: 'None', KernelID: '68415e17-3ced-497f-86de-9c59285e2fec'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant