GitHub - pranab/fluxua: A simple easy to use Hadoop map reduce workflow engine

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
resource		resource
src/main/java/org/fluxua		src/main/java/org/fluxua
target		target
.gitignore		.gitignore
README		README
manifest.mf		manifest.mf
pom.xml		pom.xml

Repository files navigation

Introduction
============
Fluxua is a simple workflow driver for Hadoop map reduce jobs. It's non intrusive. It does not require
your map reduce implementation to extend any special class and can be plugged into the workflow as is. There
are other map reduce workflow engines like oozie, cascading etc. that you can use. Fluxua is itended to be a 
simple, small foot print alternative.

Architecture
============
The Hadoop map reduce jobs are defined as nodes in a DAG (directed acyclic graph). Edges represent the dependency
between the jobs. The workflow drives topologically orders the DAG and executes the jobs in proper order, starting
with the jobs at the root nodes of the DAG.

Each MR job is executed from a separate thread. The executing thread communicates with the driver through a blocking 
queue. All MR jobs at intermediate DAG nodes should do blocking executions of the job, so that the dependent jobs can
be launched only after the parent jobs have completed.

Configuration
=============
The jobs and the workflow are defined ina JSON file as below. The configuration has three main sections. The first
section has the system configurations. The second sections has configuration of all thge jobs. The las sectipns has 
flow definitions in terms of jobs. There is an example from a data mining project in the resource directory.


Sample Shell Script
===================
Here is sample shell script for using the driver. The entry point is the class org.fluxua.driver.JobDriver

DR_JAR=/home/pranab/Projects/fluxua/target/fluxua-1.0.jar
JAR=/home/pranab/Projects/zaal/target/zaal-1.0.jar
CL=org.fluxua.driver.JobDriver
CONFIG=/home/pranab/Projects/zaal/zaal.json
export HADOOP_CLASSPATH=$DR_JAR:$JAR
echo $HADOOP_CLASSPATH
hadoop fs -rmr /zaal/dor/output
hadoop fs -rmr /zaal/bcl/output
hadoop jar $JAR $CL -c $CONFIG -f bayesian -i sample

Additional Info
===============
More details can be found in my blogpost here
http://pkghosh.wordpress.com/2011/05/22/hadoop-orchestration