Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



42 Commits

Repository files navigation

Build Status


This Docker image helps to run the Spark in a cluster mode with a master and variable slave (worker) nodes.


  1. Setup Docker and docker-compose first
  2. Build the image using included Dockerfile docker-compose build
  3. Spin up a Spark cluster with 1 master and 2 slaves (as an example) docker-compose up --scale master=1 --scale slave=2
  4. Verify that the cluster is running by going to http://docker-machine-ip:8080. Note: if you are running Docker on OS X or Windows, replace localhost with the docker host VM IP address. You can get the IP address by running docker-machine ip.
  5. Verify that Jupyter notebook server is running by going to http://docker-machine-ip:8888
  6. Destroy the cluster docker-compose down


import pyspark
conf = pyspark.SparkConf()

conf.setMaster("spark://<docker machine IP>:7077")

sc = pyspark.SparkContext(conf=conf)

rdd = sc.parallelize(range(100))
print(rdd.reduce(lambda x,y: x+y))


Need to add support for the following components:

  • Scala
  • PySpark
  • HDFS
  • Zeppelin
  • Jupyter
  • Instructions on setting up in Azure/AWS with Docker Swarm
  • Run containers in some kind of process manager