Running Spark on Amazon EC2
Clone this wiki locally
This guide describes how to get Spark running on an EC2 cluster. It assumes you have already signed up for Amazon EC2 account on the Amazon Web Services site.
For Spark 0.5
Spark now includes some EC2 Scripts for launching and managing clusters on EC2. You can typically launch a cluster in about five minutes. Follow the instructions at this link for details.
For older versions of Spark
Older versions of Spark use the EC2 launch scripts included in Mesos. You can use them as follows:
- Download Mesos using the instructions on the Mesos wiki. There's no need to compile it.
- Launch a Mesos EC2 cluster following the EC2 guide on the Mesos wiki. (Essentially, this involves setting some environment variables and running a Python script.)
- Log into your EC2 cluster's master node using
mesos-ec2 -k <keypair> -i <key-file> login cluster-name.
- Go into the
root's home directory.
- Run either
spark-shellor another Spark program, setting the Mesos master to use to
master@<ec2-master-node>:5050. You can also find this master URL in the file
~/mesos-ec2/cluster-urlin newer versions of Mesos.
- Use the Mesos web UI at
http://<ec2-master-node>:8080to view the status of your job.
Using a Newer Spark Version
The Spark EC2 machines may not come with the latest version of Spark. To use a newer version, you can run
git pull to pull in
/root/spark to pull in the latest version of Spark from
git, and build it using
sbt/sbt compile. You will also need to copy it to all the other nodes in the cluster using
Accessing Data in S3
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form
<id> is your Amazon access key ID and
<secret> is your Amazon secret access key. Note that you should escape any
/ characters in the secret key as
%2F. Full instructions can be found on the Hadoop S3 page.
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.