Haruhi

Overview

Haruhi is a Java application for managing Hadoop jobs and Hadoop clusters. It lets you submit a job to an existing cluster, or to a temporarily provisioned cluster in Amazon Elastic Map Reduce. From the source code directory, you can configure your shell for haruhi by writing

$ source haruhi/path.sh

Now I can run a task on my development cluster with

$ haruhi run job pse3 /input /output

(this works by running the local hadoop binary.) I can run the same task in Amazon EMR by typing

$ haruhi -clusterId largeAwsCluster run job pse3 s3://my-bucket/input s3://my-bucket/output

When Haruhi starts, Spring looks for configuration in two places: (i) an applicationContext.xml contained in the Haruhi jar, and (ii) the file $HOME\.haruhi\applicationContext.xml if it exists. Definitions in your local file will override.

If you wish to use Amazon Web Services you need only to configure your access credentials in your local applicationContext.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context"
  xsi:schemaLocation="http://www.springframework.org/schema/beans
         http://www.springframework.org/schema/beans/spring-beans.xsd
         http://www.springframework.org/schema/context
         http://www.springframework.org/schema/context/spring-context.xsd">
   <bean name="awsCredentials" class="com.amazonaws.auth.BasicAWSCredentials">
        <constructor-arg value="access Key Id"/>
        <constructor-arg value="secret Access Key"/>
   </bean>
</beans>

Although the configuration can involve wordy XML, it's as simple as that most of the time while keeping it possible (with no dev time on my part) to create your own custom configurations. In the case of EMR, for instance, you have full access to the model objects describing your cluster, so you are free to build out additional instance groups and populate them with spot instances.

Usage

The following text was copied from the usage information of the Haruhi App

To submit a job to the JobApplication do the following:

haruhi run job [options] jar_args ...

The system will pass on any arguments beyond the options to
the Hadoop application.  The system will use default options for the cluster
and JAR configuration unless you override them with the following options:

 -clusterId <clusterId>
 -jarId <jarId>

both of these arguments are Spring bean names.  If you want to add new
configurations,  this application searches

$HOME/.haruhi/applicationContext.xml

where you can override existing bean definitions or define new ones.

Configuration

A Hadoop Job consists of two things: (a) A jar file, (b) a set of command line arguments for the Main application built into the JAR file. Nothing happens unless we have a cluster, so we also need to supply a cluster that will run the job.

The defaultCluster, which runs the job using the 'hadoop' executable on your machine, runs if you don't specify a cluster id. If the hadoop command works in the environment of the Java process, you are all set and have nothing to configure.

The defaultJar is, by default, the 2.0-SNAPSHOT version of bakemono. If you are running Hadoop locally, this JAR will be found in your $HOME/.m2' directory after you run mvn install`. If you are running in AMR, the JAR will be looked for at

{s3JarPath}/{artifactId}-{version}-{classifier}

which would typically be

s3://bakemono-public/bakemono-2.0-SNAPSHOT-job.jar

this won't be my development snapshot, but instead I'll try to put something reasonably stable there, such as a copy of the last tag.

If you want some guarantee that you're running a certain version in the public repository, try

$ haruhi -clusterId largeAwsCluster -jarId t20130902 run job pse3 s3://my-bucket/input s3://my-bucket/output

but note all of this is configured in the applicationContext.xml like this:

  <bean name="bakemonoJar defaultJar" class="com.ontology2.haruhi.MavenManagedJar">
    <property name="groupId" value="com.ontology2" />
    <property name="artifactId" value="bakemono" />
    <property name="version" value="2.0-SNAPSHOT" />
    <property name="classifier" value="job" />
    <property name="headArguments">
      <list>
        <value>run</value>
      </list>
    </property>
  </bean>

  <bean name="t20130902" parent="bakemonoJar">
    <property name="version" value="t20130902" />
  </bean>

If you are doing development in S3, you create your own bucket to put your JARs in and specify it like this in your personal applicationContext.xml.

   <bean name="awsSoftwareBucket" class="java.lang.String">
       <constructor-arg value="s3://your-software-bucket/" />
   </bean>

you can then deploy your latest jars to your bucket like

$ s3cmd put bakemono/target/bakemono-2.0-SNAPSHOT.jar s3://your-software-bucket/

Architecture, Requirements, Evolution

Haruhi is independent of Hadoop, bakemono and the many dependencies that could come from those sources.

The HaruhiShell inherites from the CentipedeShell defined in the centipede framework and represents the most complex application made yet of the CentipedeShell; in fact, there is functionality in the HaruhiShell that ought to be backported to CentipedeShell and it all should be moved there, firmed up and tested.

JobApp uses a PeekingIterator to eat dash-options from command lines, and I like that as an approach to the getopt problem but there is an issue with that.

Some kind of input validation is necessary for large jobs run on AWS. For instance, the largeAwsClouster provisions 13 machines at about 70 cents/hour. Amazon rounds up to the next hour, so if your job fails at initialization, you get dinged $9, which gets painful, particularly when you could have made the same mistake for $0.075 with a single m1.small.

There are answers for this, such as write shell scripts, and "being careful" but it's just a matter of time before you fail a big job. Some kind of validation system would make use of the system less stressful.

The most common problems, by far, are people entering the wrong name for the input file or pointing to an output file that preserves, and a system that prevents this (even if imperfect) will save a lot of pain and $.

So far I have looked at three alternatives.

One of them is to submit a cheap 'probe' job that runs in a tiny AWS cluster that will validate the configuration, then shuts down. This insurance costs 7.5 cents and less than 10 minutes of wallclock time. It may not work for Hadoop applications in general, but we could add a validate action to Bakemono and organize Bakemono apps so this validation is possible.

Another one is to run the real validation step inside Haruhi (which means making it bakemono and hadoop dependent, which I don't want but might accept)

A third is to simulate the validation step inside Haruhi. If we had machine-readable getopts() that would let us identify input and output files, for instance, the problem would be solved. Even something stupid based on regexes could eliminate 75-80% of the pain. Metadata can be easily exported by having bakemono compile a special JAR with a -client classifier.

The last two have the advantage of interactive (few second) turnaround, which makes developers more productive.

Right now all of the classes in Haruhi are organized into a single package. if I decide to move things around into smaller packages, this would break your Spring configuration files and I wouldn't want to do that to myself, never mind you. Perhaps this is a commitment to keep Haruhi small (see Extensibility below!)

Almost certainly some mechanism for setting Hadoop-wide parameters will be necessary someday because I've already run into situations where I've wanted to change the parameters.

There's also a tension between the natural data model of AWS and how I want things to work for users. For instance, a Cluster can be configured to either shut down or not shut down when all steps are completed. It would be nice to be able to specify this as a command-line option (thinking of it really as a property of the "job"). On the other hand it ought to be easy to make something that starts up a persistent cluster with an AWS shell.

Build Issues

Long builds drive me up the wall and I'm almost annoyed enough to create profiles to separate a Haruhi build from a Bakemono build. Haruhi could be a good candidate for forming a 2.0 release and leaving it alone for a while so we can take some -SNAPSHOT dependencies out.

Extensibility

Just as the HaruhiShell extends the CentipedeShell, you can write your own shell that extends the HaruhiShell. By writing Java classes and wiring them up through Spring, you can define new cluster types, commands, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly