Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Swineherd with a magic ruby to make pig fly.
Ruby R
Tree: fc38ad2fa5

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
examples
lib
LICENSE
README.textile
TODO.txt

README.textile

SwineHerd

Swineherd enhances the brute power of Apache Pig with the elegance and flexibility of Ruby.

Core features

  • Easily write idempotent and dependent pig jobs using rake
  • Write pig scripts using eruby templates

Rake Example

You can use the PigTask class to define rake tasks that understand how to talk to Pig. And, since it’s still rake, these tasks can play nicely with any other rake tasks in your workflow.



require 'swineherd' ; include Swineherd

PigTask.new_pig_task(:foopig, 'foo.pig') do |options|
  options[:inputs]           = {:in  => 'foo.tsv'}
  options[:outputs]          = {:out => '/tmp/foo.tsv'}
  options[:extra_pig_params] = {:n   => '1L'}
end

The above example defines a new rake task called foobar. To run it, simply do:


    rake -f /path/to/rakefile foopig

and watch it go.

Dependencies are easy too:


task :foopig => [:other_task1, :other_task2]

ERB example

You can also use the PigScript class which allows you to use eruby templates like this:


    require 'swineherd' ; include Swineherd

    PigScript.new('foo.pig.erb', {:key => value}).run
    

Here a pig script is created, the values of the passed in options hash are substituted into the erb template, and the pig script is ran.

Features

  • Code dependent pig jobs using rake
  • Sane handling of options
  • Check all outputs, don’t run at all if they exist
  • Write your pig scripts using erb templates

Why?

Simply, Pig needs more buttons.

  • A Pig script should play nicely with other tasks in a data workflow
  • A Pig script should not run if the output data is already there, but pass to the next task
  • Passing in options should be much easier
  • Pig doesn’t have an “include” statement
  • A minor change in a data model shouldn’t require rewriting of every pig script in a workflow
Something went wrong with that request. Please try again.