- Simple Ruby Sugar for Hadoop Streaming
- Inverted Index
(where line may include tabs or spaces, and likely has many words)
Desired output: Each word, stripped of punctuation, paired with a comma-delimited list of file@linenum locations for quick lookup.
#!/usr/bin/ruby require "rubydoop" HADOOP_HOME = "/usr/local/hadoop" map do |location, line| line.split(/\s+/).each do |word| next unless word.strip.length > 0 emit word.strip.downcase.gsub(/^\(|[^a-zA-Z]$/, ''), location end end reduce do |key, values| emit key, values.join(",") end
Assuming you have your hadoop environment all set up, this will fire up a task with the appropriate map and reduce functions.
./inverted-index.rb simulate test-file.txt
Which executes a poor-man’s local MR:
cat test-file.txt | ./inverted-index.rb map | sort | ./inverted-index.rb reduce