-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark batch jobs are not terminating #627
Comments
You probably need to look into your executor/driver logs to figure out why. |
Closing this. Feel free to reopen if needed. |
@suvashishtha Have you made any progress? |
@liyinan926 I am facing the same problem. Executor logs shows the job as finished. I tried to explicitly kill the application using sys.exit(0) still the driver and the executor are in running state. Any pointers ? I am using v2.4.4 of the operator. |
@liyinan926 I am having this issue when running k8s operator, writing as a delta table, and using Ozone. The logs are fine nothing there to raise suspicion. The executor sent all information back to the driver. When I describe the app, the status is running but the message below has them in a pending state. |
@dustinTDG / @rkautoid Can you grab thread dumps from the executors page of the Spark UI? I'm wondering if this is related to the problem fixed [1] in 3.0 branch of Spark but not yet back ported to the 2.4 branch? [1] apache/spark#28423 |
I see this error a lot in the batch jobs:
I do think its related to the above issue. The batch job starts, Driver is able to spin up new executors, communicate with them and get the job done, but cannot clean them up. This is with Spark 2.4.5 and Kubernetes Version: 1.15 and 1.16 with Multi Kubernetes Masters. The above message repeats every 10 seconds. Let me know if its not related |
It looks a bit different from what I see. For me, it appears to get stuck at the very end of writing data to Bigtable in the very last task of a job. Our partner is working to back port the fix I mentioned and I will let you know if that addresses the hang. |
Any update |
Those patches got merged into the 2.4, 3.0, 3.1, and master branches. |
Recently after cluster upgrade, spark jobs were not starting because of the issue:
#591
I have changed the kubernetes-client jar from 3.0.0 to 4.4.2 in the docker image as mentioned in the comments. Now spark batch jobs are not terminating after it's execution. Both driver and executor are stuck in running state.
Sparkoperator version: 2.4.0
The text was updated successfully, but these errors were encountered: