Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Ruby on Hadoop: Efficient, effective Hadoop streaming & bulk data processing. Write micro scripts for terabyte-scale data
Ruby Shell JavaScript

This branch is 274 commits ahead, 459 commits behind infochimps-labs:master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
data @ 6c83a9b
notes @ 9ba3332


Wukong is a toolkit for rapid, agile development of data applications at any scale.

The core concept in Wukong is a Processor. Wukong processors are simple Ruby classes that do one thing and do it well. This codebase implements processors and other core Wukong classes and provides a tool, wu-local, to run and combine processors on the command-line.

Wukong's larger theme is powerful black boxes, beautiful glue. The Wukong ecosystem consists of other tools which run Wukong processors in various topologies across a variety of different backends. Code written in Wukong can be easily ported between environments and frameworks: local command-line scripts on your laptop instantly turn into powerful jobs running in Hadoop.

Here is a list of various other projects which you may also want to peruse when trying to understand the full Wukong experience:

  • wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
  • wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
  • wukong-deploy: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.

For a more holistic perspective also see the Infochimps Platform Community Edition (FIXME: link to this) which combines all the Wukong tools together into a jetpack which fits comfortably over the shoulders of developers.

Writing Simple Processors

The fundamental unit of computation in Wukong is the processor. A processor is Ruby class which

  • subclasses Wukong::Processor (use the Wukong.processor method as sugar for this)
  • defines a process method which takes an input record, does something, and calls yield on the output

Here's a processor that reverses all each input record:

# in string_reverser.rb
Wukong.processor(:string_reverser) do
  def process string
    yield string.reverse

When you're developing your application, run your processors on the command line on flat input files using wu-local:

$ cat novel.txt
It was the best of times, it was the worst of times.

$ cat novel.txt | wu-local string_reverser.rb
.semit fo tsrow eht saw ti ,semit fo tseb eht saw tI

You can use yield as often (or never) as you need. Here's a more complicated example to illustrate:

# in processors.rb

Wukong.processor(:tokenizer) do
  def process line
    line.split.each { |token| yield token }

Wukong.processor(:starts_with) do

  field :letter, String, :default => 'a'

  def process word
    yield word if word =~"^#{letter}", true)

Let's start by running the tokenizer. We've defined two processors in the file processors.rb and neither one is named processors so we have to tell wu-local the name of the processor we want to run explicitly.

$ cat novel.txt | wu-local processors.rb --run=tokenizer

You can combine the output of one processor with another right in the shell. Let's add the starts_with filter and also pass in the field letter, defined in that processor:

$ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local processors.rb --run=starts_with --letter=t

Wanting to match on a regular expression is such a common task that Wukong has a built-in "widget" called regexp that you can use directly:

$ cat novel.txt | wu-local processors.rb --run=tokenizer | wu-local regexp --match='^t'

There are many more simple widgets like these.

Combining Processors into Dataflows

Combining processors which each do one thing well together in a chain is mimicing the tried and true UNIX pipeline. Wukong lets you define these pipelines more formally as a dataflow. Here's the dataflow for the last example:

# in find_t_words.rb
Wukong.dataflow(:find_t_words) do
  tokenizer > regexp(match: /^t/)

The DSL Wukong provides for combining processors is designed to similar to the processing of developing them on the command line. You can run this dataflow directly

$ cat novel.txt | wu-local find_t_words.rb

and it works exactly like before.


The process method for a Processor must accept a String argument and yield a String argument (or something that will to_s appropriately).

Coming Soon: The ability to define consumes and emits to automatically handle serialization and deserialization.


Wukong has a number of built-in widgets that are useful for scaffolding your dataflows.


Serializers are widgets which don't change the semantic meaning of a record, merely its representation. Here's a list:

  • to_json, from_json for turning records into JSON or parsing JSON into records
  • to_tsv, from_tsv for turning Array records into TSV or parsing TSV into Array records
  • pretty for pretty printing JSON inputs

When you're writing processors that are capable of running in isolation you'll want to ensure that you deserialize and serialize records on the way in and out, like this

Wukong.processor(:on_my_own) do
  def process json
    obj = MultiJson.load(json)

    # do something with obj...

    yield MultiJson.dump(obj)

For processors which will only run inside a data flow, you can optimize by not doing any (de)serialization until except at the very beginning and at the end

Wukong.dataflow(:complicated) do
  from_json > proc_1 > proc_2 > proc_3 ... proc_n > to_json

in this approach, no serialization will be done between processors.

General Purpose

There are several general purpose processors which implement common patterns on input and output data. These are most useful within the context of a dataflow definition.

  • null does what you think it doesn't
  • map perform some block on each
  • flatten flatten the input array
  • filter, select, reject only let certain records through based on a block
  • regexp, not_regexp only pass records matching (or not matching) a regular expression
  • limit only let some number of records pass
  • logger send events to the local log stream
  • extract extract some part of each input event

Some of these widgets can be used directly, perhaps with some arguments

Wukong.processor(:log_everything) do
  proc_1 > proc_2 > ... > logger

Wukong.processor(:log_everything_important) do
  proc_1 > proc_2 > ... > regexp(match: /important/i) > logger

Other widgets require a block to define their action:

Wukong.processor(:log_everything_important) do
  parser > select { |record| record.priority =~ /important/i } > logger


There are a selection of widgets that do aggregative operations like counting, sorting, and summing.

  • count emits a final count of all input records
  • sort can sort input streams
  • group will group records by some extracting part and give a count of each group's size
  • moments will emit more complicated statistics (mean, std. dev.) on the group given some other value to measure

Here's an example of sorting data right on the command line

$ head tokens.txt | wu-local sort

Try adding group:

$ head tokens.txt | wu-local sort | wu-local group
{:group=>"abhor", :count=>1}
{:group=>"abide", :count=>2}
{:group=>"able", :count=>3}
{:group=>"about", :count=>3}
{:group=>"above", :count=>1}

You can also use these within a more complicated dataflow:

Wukong.dataflow(:word_count) do
  tokenize > remove_stopwords > sort > group
Something went wrong with that request. Please try again.