Include run-spark-pi-local.sh to demonstrate questions

DO NOT RUN THIS ON A PRODUCTION SYSTEM For example use docker.app or minikube with a clean local k8s server Add examples/run-spark-pi-local.sh which will run the spark-pi example for a local k8s cluster from a novice user. This script adds helm, the spark-operator, runs spark-pi, displays the status, then tears down spark-operator and helm. Ideally there should not be any calls to sleep in a proper script that uses primitives to synchronize. It was created to ask the questions: - How can you launch a spark application and then reliably wait for it to finish? - This needs to be race free. - For a restart=Never application, what are the application states that indicate completion, just COMPLETED or FAILED? - Is there documentation about application states? - How can you know whether the application succeeded or failed? - Does COMPLETED imply success as the driver pod exit code should? - Why does the SparkPi example show all executor state as FAILED? - I've heard that this happens if sys.exit(0) is not called which supposedly should be avoided. Why doesn't spark.stop() cause executors to exit cleanly?
jkleckner · Dec 5, 2019 · 4fae3d4 · 4fae3d4
1 parent 866698f
commit 4fae3d4
Showing 1 changed file with 77 additions and 0 deletions.
diff --git a/examples/run-spark-pi-local.sh b/examples/run-spark-pi-local.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+
+# DO NOT RUN THIS ON A PRODUCTION SYSTEM
+# For example use docker.app or minikube with a clean local k8s server
+# This script adds helm, the spark-operator, runs spark-pi, displays the status, then tears down spark-operator and helm
+
+# Note that if you use a spark_namespace other than default you need to modify spark-pi.yaml to match
+spark_namespace=spark
+spark_namespace=default
+
+echo The challenge below is to eliminate all of the sleep invocations and
+echo make this script run to completion.
+
+# === In minikube or local docker.app:  (The only real difference is not enabling the web hook and adding/removing helm?
+set -x
+
+helm init && sleep 30
+
+if [ "$spark_namespace" != "default" ] ; then
+    kubectl create ns  "$spark_namespace" && sleep 5
+fi
+
+helm install incubator/sparkoperator --name spark-test --namespace spark-operator --set sparkJobNamespace=$spark_namespace --set enableMetrics=false \
+    && sleep 10
+
+kubectl apply --validate=true -f spark-pi.yaml \
+    && sleep 10
+
+{
+echo 'The key question of this example is how to wait reliably for spark-pi to start and finish then know whether it worked.'
+echo 'Note that logs -f knows how to wait until completion...'
+echo 'Note also that there is a race condition that this logs command will fail if the job has not started'
+} 2> /dev/null
+
+kubectl logs -f -n $spark_namespace spark-pi-driver
+
+exitCode=$(kubectl get -n $spark_namespace pod/spark-pi-driver -o=jsonpath='{.status.containerStatuses[*].state.terminated.exitCode}')
+{
+  echo exitCode is $exitCode
+} 2> /dev/null
+
+# kubectl get -n $spark_namespace sparkapplications spark-pi -o yaml
+
+kubectl get -n $spark_namespace sparkapplications spark-pi -o jsonpath='{"ApplicationState:"}{.status.applicationState.state}{"\nExecutorState:"}{.status.executorState.*}{"\n"}'
+
+statusCode=$(kubectl get -n $spark_namespace sparkapplications spark-pi -o jsonpath='{.status.applicationState.state}')
+{
+echo statusCode is $statusCode
+echo 'Does a statusCode of COMPLETED imply success in the same way that an exitCode of 0 does?'
+echo 'Why is the statusCode for the executors FAILED?'
+echo "Shouldn't the spark.stop() call cause the executor to exit cleanly?"
+} 2> /dev/null
+
+helm list
+
+kubectl delete sparkapplication -n $spark_namespace spark-pi && sleep 15
+
+helm list
+
+{
+echo 'Note that helm delete of spark-test does not remove the spark-operator namespace so it is not idempotent.'
+echo 'This is understandable since the namespace might have pre-existed and might be used elsewhere.'
+} 2> /dev/null
+helm delete --purge spark-test && sleep 15
+
+helm list
+
+if [ "$spark_namespace" != "default" ] ; then
+    kubectl delete ns  "$spark_namespace" && sleep 10
+fi
+
+helm reset && sleep 20
+
+
+kubectl delete ns spark-operator && sleep 10
+
+exit 0