Skip to content

Commit

Permalink
Big cleanup of the examples/ directory
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Jan 28, 2011
1 parent bfaccc9 commit 947156b
Show file tree
Hide file tree
Showing 38 changed files with 204 additions and 299 deletions.
32 changes: 32 additions & 0 deletions CHANGELOG.textile
Original file line number Diff line number Diff line change
@@ -1,3 +1,35 @@
h2. Wukong v2.0.0

h4. Important changes

* Passing options to streamers is now deprecated. Use @Settings@ instead.

* Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds

* Examples cleaned up, should all run

h4. Simplified syntax

* you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
* Adding an experimental sugar:

<pre>
#!/usr/bin/env ruby
require 'wukong/script'

LineStreamer.map do |line|
emit line.reverse
end.run
</pre>

Note that you can now tweet a wukong script.

* It's now recommended that at the top of a wukong script you say
<pre>
require 'wukong/script'
</pre>
Among other benefits, this lets you refer to wukong streamers without prefix.

h2. Wukong v1.5.4

* EMR support now works very well
Expand Down
70 changes: 58 additions & 12 deletions README.textile
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
* "More info":http://mrflip.github.com/wukong/moreinfo.html

h2. Imminent Changes

I'm pushing to release "Wukong 3.0 the actual 1.0 release".

* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
* Methods on TypedStruct to

* Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
* Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
* May make some things that are derived classes into mixin'ed modules
* Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.


h2. Help!

Expand Down Expand Up @@ -193,6 +181,64 @@ You'd end up with
@newman @elaine @jerry @kramer
</code></pre>

h2. Gotchas

h4. RecordStreamer dies on blank lines with "wrong number of arguments"

If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:

<pre>
class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
# this will fail if the line has more or fewer than 3 fields:
def process x, y, z
p [x, y, z]
end
end
</pre>

The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:

<pre>
class MyHappyMapper < Wukong::Streamer::RecordStreamer
# extracts three fields always; any missing fields are nil, any extra fields discarded
# @example
# recordize("a") # ["a", nil, nil]
# recordize("a\t\b\tc") # ["a", "b", "c"]
# recordize("a\t\b\tc\td") # ["a", "b", "c"]
def recordize raw_record
x, y, z = super(raw_record)
[x, y, z]
end

# Now all lines produce exactly three args
def process x, y, z
p [x, y, z]
end
end
</pre>

If you want to preserve any extra fields, use the extra argument to #split():

<pre>
class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
# extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
# @example
# recordize("a") # ["a", nil, nil]
# recordize("a\t\b\tc") # ["a", "b", "c"]
# recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
def recordize raw_record
x, y, z = split(raw_record, "\t", 3)
[x, y, z]
end

# Now all lines produce exactly three args
def process x, y, z
p [x, y, z]
end
end
</pre>


h2. Why is it called Wukong?

Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
Expand Down
28 changes: 17 additions & 11 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,24 @@ rescue LoadError
puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
end

require 'spec/rake/spectask'
Spec::Rake::SpecTask.new(:spec) do |spec|
spec.libs << 'lib' << 'spec'
spec.spec_files = FileList['spec/**/*_spec.rb']
end
Spec::Rake::SpecTask.new(:rcov) do |spec|
spec.libs << 'lib' << 'spec'
spec.pattern = 'spec/**/*_spec.rb'
spec.rcov = true
begin
require 'spec/rake/spectask'
Spec::Rake::SpecTask.new(:spec) do |spec|
spec.libs << 'lib' << 'spec'
spec.spec_files = FileList['spec/**/*_spec.rb']
end
Spec::Rake::SpecTask.new(:rcov) do |spec|
spec.libs << 'lib' << 'spec'
spec.pattern = 'spec/**/*_spec.rb'
spec.rcov = true
end
task :spec => :check_dependencies
task :default => :spec
rescue LoadError
task :spec do
abort "rspec is not available. In order to run rspec, you must: sudo gem install rspec"
end
end
task :spec => :check_dependencies
task :default => :spec

begin
require 'reek/rake_task'
Expand Down
8 changes: 0 additions & 8 deletions TODO.textile
Original file line number Diff line number Diff line change
@@ -1,13 +1,5 @@



* add GEM_PATH to hadoop_recycle_env

* Hadoop_command function received an array for the input_path parameter

** We should be able to specify comma *or* space separated paths; the last
space-separated path in Settings.rest becomes the output file, the others are
used as the input_file list.

* Make configliere Settings and streamer_instance.options() be the same
thing. (instead of almost-but-confusingly-not-always the same thing).
56 changes: 0 additions & 56 deletions examples/count_keys.rb

This file was deleted.

57 changes: 0 additions & 57 deletions examples/count_keys_at_mapper.rb

This file was deleted.

35 changes: 14 additions & 21 deletions examples/network_graph/breadth_first_search.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env ruby
$: << ENV['WUKONG_PATH']
require 'wukong'
$: << File.dirname(__FILE__)+'/../lib'
require 'wukong/script'

#
# Use this script to do a Breadth-First Search (BFS) of a graph.
Expand All @@ -9,19 +9,18 @@
# ./make_paths --head=[path_in_key] --tail=[path_out_key] --out_rsrc=[combined_path_key]
#
# For example, given an edge list in the file '1path.tsv' that looks like
# 1path n1 n2
# 1path n1 n3
# 1path n1 n2
# 1path n1 n3
# ... and so forth ...
# you can run
# for t in 1 2 3 4 5 6 7 8 9 ; do next=$((t+1)) ; time cat 1path.tsv "${t}path.tsv" | ./make_paths.rb --map --head="1path" --tail="${t}path" | sort -u | ./make_paths.rb --reduce --out_rsrc="${next}path" | sort -u > "${next}path.tsv" ; done
# to do a 9-deep breadth-first search.
#
module Gen1HoodEdges
class Mapper < Wukong::Streamer::Base
attr_accessor :head, :tail
def initialize options
self.head = options[:head]
self.tail = options[:tail]
class Mapper < Wukong::Streamer::RecordStreamer
def initialize
@head = Settings[:head]
@tail = Settings[:tail]
end
def process rsrc, *nodes
yield [ nodes.last, 'i', nodes[0..-2] ] if (rsrc == self.head)
Expand All @@ -37,8 +36,8 @@ def process rsrc, *nodes
#
class Reducer < Wukong::Streamer::AccumulatingReducer
attr_accessor :paths_in, :out_rsrc
def initialize options
self.out_rsrc = options[:out_rsrc]
def initialize
self.out_rsrc = Settings[:out_rsrc]
end
# clear the list of incoming paths
def start! *args
Expand All @@ -63,17 +62,11 @@ def get_key mid, *_
mid
end
end

class Script < Wukong::Script
def default_options
super.merge :sort_fields => 2, :partition_fields => 1
end
end

end

# Execute the script
Gen1HoodEdges::Script.new(
Wukong.run(
Gen1HoodEdges::Mapper,
Gen1HoodEdges::Reducer
).run
Gen1HoodEdges::Reducer,
:sort_fields => 2, :partition_fields => 1
)
Loading

0 comments on commit 947156b

Please sign in to comment.