Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

111 lines (69 sloc) 5.85 KB

Using Elastic Map-Reduce in Wukong

Initial Setup

  1. Sign up for elastic map reduce and S3 at Amazon AWS.
  1. Download the Amazon elastic-mapreduce runner: either the official version at or the infochimps fork (which has support for Ruby 1.9) at .
  1. Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose ‘’ as the bucket and ‘/wukong’ as a scoping path within that bucket. In that case you will refer to it with a path like s3:// (see notes below about s3n:// vs. s3:// URLs).
  1. Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
  1. Edit emr.yaml and credentials.json, adding your keys where appropriate and following the other instructions. Start with a single-node m1.small cluster as you’ll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
  1. You should now be good to launch a program. We’ll give it the --alive flag so that the machine sticks around if there were any issues:
./elastic_mapreduce_example.rb —run=emr —alive s3:// s3://
  1. If you visit the AWS console you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
  1. The AWS console also has the public IP of the master node. You can log in to the machine directly:
  ssh -i /path/to/your/keypair.pem


Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from

s3n:// vs. s3:// URLs

Many external tools use a URI convention to address files in S3; they typically use the ‘s3://’ scheme, which makes a lot of sense:

Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks — to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth. Hadoop unfortunately chose the ‘s3://’ scheme for URIs in this filesystem:

Hadoop is happy to read s3 native files — ‘native’ as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There’s a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about). You refer to these files with the ‘s3n://’ scheme (‘n’ as in ‘native’):

Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use ‘s3n://’ for normal s3 files.

Advanced Tips n’ Tricks for common usage

Direct access to logs using your browser

Each Hadoop component exposes a web dashboard for you to access. Use the following ports:

  • 9100: Job tracker (master only)
  • 9101: Namenode (master only)
  • 9102: Datanodes
  • 9103: Task trackers

They will only, however, respond to web requests from within the private cluster
subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
browser to use it as a proxy.

Create a tunneling proxy to your cluster

To create a tunnel from your local machine to the master node, substitute the keypair and the master node’s address into this command:

  ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP

The command will silently background itself if it worked.

Make your browser use the proxy (but only for cluster machines)

You can access basic information by pointing your browser to this Proxy
Auto-Configuration (PAC)

You’ll have issues if you browse around though, because many of the in-page
links will refer to addresses that only resolve within the cluster’s private

Setup Foxy Proxy

To fix this, use FoxyProxy
It allows you to manage multiple proxy configurations and to use the proxy for
DNS resolution (curing the private address problem).

Once you’ve installed the FoxyProxy extension and restarted Firefox,

  • Set FoxyProxy to ‘Use Proxies based on their pre-defined patterns and priorities’
  • Create a new proxy, called ‘EC2 Socks Proxy’ or something
  • Automatic proxy configuration URL:
  • Under ‘General’, check yes for ‘Perform remote DNS lookups on host’
  • Add the following URL patterns as ‘whitelist’ using ‘Wildcards’ (not regular expression):
  • *.compute-*.internal*
  • *ec2.internal*
  • *domu*.internal*
  • *ec2**
  • *://10.*

And this one as blacklist:

  • https://us-**

Pulling to your local machine

s3cmd sync s3:// /tmp/emr_log/

Jump to Line
Something went wrong with that request. Please try again.