# Submitting a Spark Job

Besides working with a Spark cluster interactively, for example via the PySpark console or a Jupyter notebook, a typical way of running Spark programs is to submit them as a [batch job](https://en.wikipedia.org/wiki/Batch_processing): We send a Python script to the cluster via the `spark-submit` command. This chapter explains how to do this.

## Configuration

Make sure that PySpark uses the right Python version for the following examples by setting the environment variable `PYSPARK_PYTHON` to Python 3. A way to do this with Python:

In [None]:
import os
os.environ["PYSPARK_PYTHON"] = "python3"

## spark-submit

`spark-submit` is the command line tool to send jobs to a Spark cluster. It supports Spark jobs written in Java, Scala, or Python. Additionally it offers various configuration options for tuning the performance of the job.

In [None]:
!spark-submit --help

## A Minimal Spark Job

The following is a minimal example of a Python script containing a Spark job:

In [None]:
%%file scripts/spark_job_minimal.py

from pyspark import SparkContext, SparkConf

SPARK_APP_NAME='sparkjob_template'
conf = SparkConf().setAppName(SPARK_APP_NAME) 
spark_context = SparkContext(conf=conf)

#----------------------
# TODO: replace with your Spark code
rdd = spark_context.range(100)
#----------------------

spark_context.stop() # don't forget to cleanly shut down


The %%file command is an IPython "cell magic" that automatically writes the code from the cell to a file. So, we can directly submit this to the cluster:

In [None]:
!spark-submit scripts/spark_job_minimal.py

## More Convenience

Here is a slightly more elaborate and convenient example that serves nicely as a template for writing Spark jobs.`contextlib` enables us to use the `with` statement to create and close a spark context in a very concise and clean way, and separate that from our actual Spark program.

In [None]:
%%file scripts/spark_job_template.py

SPARK_APP_NAME='sparkjob_template'

from contextlib import contextmanager
from pyspark import SparkContext, SparkConf

@contextmanager
def use_spark_context(appName):
    conf = SparkConf().setAppName(appName) 
    spark_context = SparkContext(conf=conf)

    try:
        print("starting ", appName)
        yield spark_context
    finally:
        spark_context.stop()
        print("stopping ", appName)


with use_spark_context(appName=SPARK_APP_NAME) as sc:
    #----------------------
    # TODO: replace with your Spark code
    rdd = sc.range(100)
    #----------------------

In [None]:
!spark-submit scripts/spark_job_template.py

## Exercise: Pi Approximation as Spark Job

Now it's your turn. Remember the $\pi$ approximation program? Wrap this into a Spark job script and submit it to the cluster!

In [None]:
%%file scripts/pi_approximation_job.py

# TODO: write a job for the pi approximation program and run it via `spark-submit`


---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_