Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark batch jobs are not terminating #627

Closed
suvashishtha opened this issue Sep 19, 2019 · 10 comments
Closed

Spark batch jobs are not terminating #627

suvashishtha opened this issue Sep 19, 2019 · 10 comments

Comments

@suvashishtha
Copy link

Recently after cluster upgrade, spark jobs were not starting because of the issue:
#591

I have changed the kubernetes-client jar from 3.0.0 to 4.4.2 in the docker image as mentioned in the comments. Now spark batch jobs are not terminating after it's execution. Both driver and executor are stuck in running state.

Sparkoperator version: 2.4.0

@liyinan926
Copy link
Collaborator

You probably need to look into your executor/driver logs to figure out why.

@liyinan926
Copy link
Collaborator

Closing this. Feel free to reopen if needed.

@locona
Copy link

locona commented Oct 17, 2019

@suvashishtha
I faced the same problem

Have you made any progress?

@rkautoid
Copy link

rkautoid commented Jul 9, 2020

@liyinan926 I am facing the same problem. Executor logs shows the job as finished. I tried to explicitly kill the application using sys.exit(0) still the driver and the executor are in running state. Any pointers ? I am using v2.4.4 of the operator.

@ghost
Copy link

ghost commented Aug 7, 2020

@liyinan926 I am having this issue when running k8s operator, writing as a delta table, and using Ozone. The logs are fine nothing there to raise suspicion. The executor sent all information back to the driver. When I describe the app, the status is running but the message below has them in a pending state.

@jkleckner
Copy link
Contributor

@dustinTDG / @rkautoid Can you grab thread dumps from the executors page of the Spark UI?

I'm wondering if this is related to the problem fixed [1] in 3.0 branch of Spark but not yet back ported to the 2.4 branch?
That was SPARK-24266 [2] which fixed a loss of an HTTP_GONE event from the k8s control plane.

[1] apache/spark#28423
[2] https://issues.apache.org/jira/browse/SPARK-24266

@puneetloya
Copy link

puneetloya commented Aug 10, 2020

I see this error a lot in the batch jobs:

{"level":"WARN","timestamp":"2020-08-10 19:17:35,985","thread":"OkHttp https://kubernetes.default.svc/...","source":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager", "line":"209","message":"Exec Failure"}
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
^C

I do think its related to the above issue. The batch job starts, Driver is able to spin up new executors, communicate with them and get the job done, but cannot clean them up.

This is with Spark 2.4.5 and Kubernetes Version: 1.15 and 1.16 with Multi Kubernetes Masters.

The above message repeats every 10 seconds. Let me know if its not related

@jkleckner
Copy link
Contributor

It looks a bit different from what I see. For me, it appears to get stuck at the very end of writing data to Bigtable in the very last task of a job. Our partner is working to back port the fix I mentioned and I will let you know if that addresses the hang.

@michael-will
Copy link

Any update

@jkleckner
Copy link
Contributor

Those patches got merged into the 2.4, 3.0, 3.1, and master branches.
We are still using v2.4.8 for now and it has the fix and works for us.

apache/spark#30283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants