Simple Ruby Sugar for Hadoop Streaming
Ruby
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.textile
rubydoop.rb

README.textile

Rubydoop - Simple Ruby Sugar for Hadoop Streaming

Example - Inverted Index

Input: file@linenum\tline
(where line may include tabs or spaces, and likely has many words)

Desired output: Each word, stripped of punctuation, paired with a comma-delimited list of file@linenum locations for quick lookup.

inverted-index.rb


#!/usr/bin/ruby
require "rubydoop"

HADOOP_HOME = "/usr/local/hadoop"

map do |location, line|
  line.split(/\s+/).each do |word|
    next unless word.strip.length > 0
    emit word.strip.downcase.gsub(/^\(|[^a-zA-Z]$/, ''), location
  end
end

reduce do |key, values|
  emit key, values.join(",")
end

Running

./inverted-index.rb start

Assuming you have your hadoop environment all set up, this will fire up a task with the appropriate map and reduce functions.

Testing/Simulating

./inverted-index.rb simulate test-file.txt

Which executes a poor-man’s local MR:

cat test-file.txt | ./inverted-index.rb map | sort | ./inverted-index.rb reduce