Hadoop For The Impatient

Paul Houle edited this page Aug 16, 2013 · 8 revisions
Clone this wiki locally

The main part of infovore is Bakemono, a Hadoop application. Thus, running Hadoop is a prerequisite for running Infovore. Creating your first Hadoop cluster can be time-consuming and frustrating, but here are some hints for getting started.

Here you'll find an explanation of what a Hadoop application is, specific instructions for running bakemono in the Amazon Elastic Map Reduce cloud, how to run it in your own Hadoop cluster, and the steps that I took to construct a successful development Hadoop cluster.

What is a Hadoop application?

A Hadoop application is an executable JAR, not all that different from a UNIX or Windows executable. The method Main.main(String[] argv) starts, as an ordinary Java application that has the Hadoop API available -- it uses the API to create and submit one or more jobs to the cluster.

Like other executables, a Hadoop application receives command line arguments; if you have your own Hadoop cluster, you can type:

hadoop jar bakemono/target/bakemono-2.0-SNAPSHOT-job.jar run pse3 /rdfInput /rdfOutput 

bakemono receives the arguments

run pse3 /rdfInput /rdfOutput 

which causes bakemono to run the pse3 application, taking input from /rdfInput in HDFS and sending output to /rdfOutput.

Running a job in a managed service, such as Amazon Elastic Map/Reduce, is pretty much the same, except that you pass the name of the JAR and the command line arguments in a slightly different way.

Running Bakemono in Amazon EMR

With Amazon Elastic Map Reduce, you can run bakemono jobs without provisioning any hardware or even compiling a binary. The cost of processing a data set on the size of Freebase can range from $2-$20 depending on what you are doing, in a place where you can access S3 storage for free and have access to Amazon's very fat data pipes.

If you got to the Amazon EMR Console and create a "Custom JAR" job, you can enter the name of a JAR, like this one, published in our public S3 bucket,


and you can then run an application with command line arguments like

run freebaseRDFPrefilter s3n://freebase-dumps/freebase-rdf-2013-08-11-00-00/ s3n://your-bucket/expanded-rdf-2013-08-11-00-00/

where your-bucket is a bucket that belongs to you. This job runs in about 40 minutes if you use an m1.medium as a control node and 3 c1.xlarge(s) as compute nodes. The s3n: for the input and output paths indicate that the data is being stored as a set of split files in S3 that look a lot like the split files visible in HDFS. Amazon automatically wires your API key, so the application are automatically authorized to use your assets and our requester-paid assets. (If you're working in the us-east zone, you pay nothing to access our assets, since the transfer cost is zero)

It takes about 3 minutes for the standard Amazon Hadoop edition to boot up. If you make a mistake, say, if your-bucket doesn't exist or doesn't belong to you, it will take you 3 minutes to find out. It will, however, cost a whole hour of computer time because EC2 rounds up to the next hour. This is a small loss if it is a small cluster, but could hurt your pocketbook with a large one. Thus, make your mistakes with small clusters, use automation and develop your discipline so you can run big jobs without a hitch.

Running it in your cluster

If you're fortunate enough to have a Hadoop cluster, such as a development or departmental cluster, it The S3 binary mentioned above is public, and can be downloaded from


you can install this and run something like

hadoop jar bakemono-t20130815a-job.jar s3n://freebase-dumps/freebase-rdf-2013-08-11-00-00/ /myLocalCopy

/myLocalCopy goes into your HDFS (or other filesystem.) This may be slow, because it is downloading over the internet, however, it would take as long to download by other means. This special edition data set is available for public download so this will work without authorization keys for S3.

Feel free to copy this data with another tool, install it into another filesystem and process it that way too.

My development cluster

I had several under-utilized Windows computers at my host that I wanted to build a cluster on. Although Hortonworks does have a distribution of Hadoop for Windows, Hadoop has a strong UNIX bias so I thought the easiest way to get started was to use virtualization.

The machines in my cluster are all 4-core Intel machines with Nehalem or Core 2 processors. They have 8GB (Mac Mini Server), 24GB (my old work machine) and 32GB respectively (my new laptop).

I installed VirtualBox on all three machines, and most importantly, configured each virtual machine with a "bridged" virtual network adapter with a permanent IP address on the LAN. Hadoop uses the unique IP address of each host as a host id, and it will be confused by DHCP, NAT or anything other than a simple static IP address. Although you could set up your own DNS server, I hardcoded names for the machines and put these into the /etc/hosts files on the Linux guests and in the `C:\Windows\System32\Drivers\etc\hosts' file. (If you're running Win 8, Windows defender will revert changes that you make to the host file -- follow these instructions to prevent that.

When I install Linux I create a small (25GB) virtual disk for the root partition and then I create a larger (200GB) partition for Hadoop storage, mounting that as /hadoop/. Make sure that both of these disks are allocated with "Fixed Size Storage"; you can tell Hadoop to allocate space on demand, but that badly hurts I/O performance. (think of files that are fragmented in Windows and fragmented again in Linux.)

I allocated 8 GB to Linux on the two machines with a lot of RAM, and just 4G on the smallest machine. Look through all of the VirtualBox settings to make sure you're getting full performance: for instance, there is a slider to control how many CPU cores participate in the VM and if you leave that at the default, one, you are shortchanging yourself.

The network is all gigabit Ethernet full duplex with switched hubs. I've been very happy with the TrendNet TEG-S80G switch after trialing several others. If you use Wi-Fi or 100 megabit ethernet you really need to get on cables or upgrade because the performance difference is palpable.

I installed Ubuntu 13.04 on the three machines and I installed Hadoop version 1.1.2 directly from the Apache site. There are a number of vendors that offer pre-packaged virtual machines, but I develop for Linux as a server platform, so I like having control of the environment.

With the nodes provisioned as above, I followed the Official Hadoop Installation Instructions, and being well prepared, things went smoothly. You'll almost certainly want to bump up the numbers mapred.tasktracker.{map|reduce}.tasks.maximum to something more than the default if you have more than 2 CPUs.

In case one of your (physical) machines is running low on space, I've found that e-SATA hard drives perform comparably to internal drives.