Spark batch jobs are not terminating #627

suvashishtha · 2019-09-19T09:05:34Z

Recently after cluster upgrade, spark jobs were not starting because of the issue:
#591

I have changed the kubernetes-client jar from 3.0.0 to 4.4.2 in the docker image as mentioned in the comments. Now spark batch jobs are not terminating after it's execution. Both driver and executor are stuck in running state.

Sparkoperator version: 2.4.0

liyinan926 · 2019-09-20T22:03:33Z

You probably need to look into your executor/driver logs to figure out why.

liyinan926 · 2019-10-05T03:26:11Z

Closing this. Feel free to reopen if needed.

locona · 2019-10-17T11:49:44Z

@suvashishtha
I faced the same problem

Have you made any progress?

rkautoid · 2020-07-09T00:10:10Z

@liyinan926 I am facing the same problem. Executor logs shows the job as finished. I tried to explicitly kill the application using sys.exit(0) still the driver and the executor are in running state. Any pointers ? I am using v2.4.4 of the operator.

ghost · 2020-08-07T06:28:19Z

@liyinan926 I am having this issue when running k8s operator, writing as a delta table, and using Ozone. The logs are fine nothing there to raise suspicion. The executor sent all information back to the driver. When I describe the app, the status is running but the message below has them in a pending state.

jkleckner · 2020-08-10T16:21:06Z

@dustinTDG / @rkautoid Can you grab thread dumps from the executors page of the Spark UI?

I'm wondering if this is related to the problem fixed [1] in 3.0 branch of Spark but not yet back ported to the 2.4 branch?
That was SPARK-24266 [2] which fixed a loss of an HTTP_GONE event from the k8s control plane.

[1] apache/spark#28423
[2] https://issues.apache.org/jira/browse/SPARK-24266

puneetloya · 2020-08-10T19:23:10Z

I see this error a lot in the batch jobs:

{"level":"WARN","timestamp":"2020-08-10 19:17:35,985","thread":"OkHttp https://kubernetes.default.svc/...","source":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager", "line":"209","message":"Exec Failure"}
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
^C

I do think its related to the above issue. The batch job starts, Driver is able to spin up new executors, communicate with them and get the job done, but cannot clean them up.

This is with Spark 2.4.5 and Kubernetes Version: 1.15 and 1.16 with Multi Kubernetes Masters.

The above message repeats every 10 seconds. Let me know if its not related

jkleckner · 2020-08-11T19:52:44Z

It looks a bit different from what I see. For me, it appears to get stuck at the very end of writing data to Bigtable in the very last task of a job. Our partner is working to back port the fix I mentioned and I will let you know if that addresses the hang.

michael-will · 2021-10-13T17:47:05Z

Any update

jkleckner · 2021-10-16T21:19:33Z

Those patches got merged into the 2.4, 3.0, 3.1, and master branches.
We are still using v2.4.8 for now and it has the fix and works for us.

apache/spark#30283

liyinan926 closed this as completed Oct 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark batch jobs are not terminating #627

Spark batch jobs are not terminating #627

suvashishtha commented Sep 19, 2019

liyinan926 commented Sep 20, 2019

liyinan926 commented Oct 5, 2019

locona commented Oct 17, 2019 •

edited

Loading

rkautoid commented Jul 9, 2020

ghost commented Aug 7, 2020

jkleckner commented Aug 10, 2020

puneetloya commented Aug 10, 2020 •

edited

Loading

jkleckner commented Aug 11, 2020

michael-will commented Oct 13, 2021

jkleckner commented Oct 16, 2021

Spark batch jobs are not terminating #627

Spark batch jobs are not terminating #627

Comments

suvashishtha commented Sep 19, 2019

liyinan926 commented Sep 20, 2019

liyinan926 commented Oct 5, 2019

locona commented Oct 17, 2019 • edited Loading

rkautoid commented Jul 9, 2020

ghost commented Aug 7, 2020

jkleckner commented Aug 10, 2020

puneetloya commented Aug 10, 2020 • edited Loading

jkleckner commented Aug 11, 2020

michael-will commented Oct 13, 2021

jkleckner commented Oct 16, 2021

locona commented Oct 17, 2019 •

edited

Loading

puneetloya commented Aug 10, 2020 •

edited

Loading