Permalink
Browse files

Merge branch 'master' of github.com:mrflip/wukong

  • Loading branch information...
2 parents 78eac5a + 7dacf58 commit c18627bd4d8dbb0912e0d8d5c888fc7432b80a65 Philip (flip) Kromer committed Jan 28, 2011
Showing with 351 additions and 1,335 deletions.
  1. +32 −0 CHANGELOG.textile
  2. +58 −12 README.textile
  3. +17 −11 Rakefile
  4. +0 −8 TODO.textile
  5. +1 −1 VERSION
  6. +11 −0 bin/setcat
  7. +0 −56 examples/count_keys.rb
  8. +0 −57 examples/count_keys_at_mapper.rb
  9. +14 −21 examples/network_graph/breadth_first_search.rb
  10. +22 −13 examples/network_graph/gen_multi_edge.rb
  11. +1 −1 examples/pagerank/pagerank.rb
  12. +6 −10 examples/pagerank/pagerank_initialize.rb
  13. +6 −16 examples/sample_records.rb
  14. +0 −4 examples/server_logs/apache_log_parser.rb
  15. +3 −2 examples/size.rb
  16. +9 −11 examples/{ → stats}/binning_percentile_estimator.rb
  17. +2 −2 examples/{ → stats}/rank_and_bin.rb
  18. +0 −18 examples/store/chunked_store_example.rb
  19. +11 −14 examples/stupidly_simple_filter.rb
  20. +16 −36 examples/word_count.rb
  21. +7 −3 lib/wukong.rb
  22. +2 −15 lib/wukong/and_pig.rb
  23. +0 −81 lib/wukong/dfs.rb
  24. +0 −122 lib/wukong/keystore/cassandra_conditional_outputter.rb
  25. +0 −24 lib/wukong/keystore/redis_db.rb
  26. +0 −137 lib/wukong/keystore/tyrant_db.rb
  27. +0 −145 lib/wukong/keystore/tyrant_notes.textile
  28. +7 −28 lib/wukong/logger.rb
  29. +0 −25 lib/wukong/models/graph.rb
  30. +0 −7 lib/wukong/monitor.rb
  31. +0 −23 lib/wukong/monitor/chunked_store.rb
  32. +0 −34 lib/wukong/monitor/periodic_logger.rb
  33. +0 −70 lib/wukong/monitor/periodic_monitor.rb
  34. +24 −9 lib/wukong/periodic_monitor.rb
  35. +0 −104 lib/wukong/rdf.rb
  36. +20 −16 lib/wukong/script.rb
  37. +30 −29 lib/wukong/script/hadoop_command.rb
  38. +44 −2 lib/wukong/streamer/base.rb
  39. +0 −61 lib/wukong/streamer/cassandra_streamer.rb
  40. +3 −8 lib/wukong/streamer/count_keys.rb
  41. +0 −26 lib/wukong/streamer/count_lines.rb
  42. +0 −25 lib/wukong/streamer/counting_reducer.rb
  43. +2 −2 lib/wukong/streamer/filter.rb
  44. +3 −3 lib/wukong/streamer/list_reducer.rb
  45. +0 −22 lib/wukong/streamer/preprocess_with_pipe_streamer.rb
  46. +0 −21 lib/wukong/wukong_class.rb
  47. 0 {examples → old}/cassandra_streaming/avromapper.rb
  48. 0 {examples → old}/cassandra_streaming/berlitz_for_cassandra.textile
  49. 0 {examples → old}/cassandra_streaming/cassandra.avpr
  50. 0 {examples → old}/cassandra_streaming/cassandra_random_partitioner.rb
  51. 0 {examples → old}/cassandra_streaming/catter.sh
  52. 0 {examples → old}/cassandra_streaming/client_interface_notes.textile
  53. 0 {examples → old}/cassandra_streaming/client_schema.avpr
  54. 0 {examples → old}/cassandra_streaming/client_schema.textile
  55. BIN {examples → old}/cassandra_streaming/foofile.avr
  56. 0 {examples → old}/cassandra_streaming/pymap.sh
  57. 0 {examples → old}/cassandra_streaming/pyreduce.sh
  58. 0 {examples → old}/cassandra_streaming/smutation.avpr
  59. 0 {examples → old}/cassandra_streaming/streamer.sh
  60. 0 {examples → old}/cassandra_streaming/struct_loader.rb
  61. 0 {examples → old}/cassandra_streaming/tuning.textile
  62. 0 {examples → old}/keystore/cassandra_batch_test.rb
  63. 0 {examples → old}/keystore/conditional_outputter_example.rb
View
@@ -1,3 +1,35 @@
+h2. Wukong v2.0.0
+
+h4. Important changes
+
+* Passing options to streamers is now deprecated. Use @Settings@ instead.
+
+* Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
+
+* Examples cleaned up, should all run
+
+h4. Simplified syntax
+
+* you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
+* Adding an experimental sugar:
+
+ <pre>
+ #!/usr/bin/env ruby
+ require 'wukong/script'
+
+ LineStreamer.map do |line|
+ emit line.reverse
+ end.run
+ </pre>
+
+ Note that you can now tweet a wukong script.
+
+* It's now recommended that at the top of a wukong script you say
+ <pre>
+ require 'wukong/script'
+ </pre>
+ Among other benefits, this lets you refer to wukong streamers without prefix.
+
h2. Wukong v1.5.4
* EMR support now works very well
View
@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
* "More info":http://mrflip.github.com/wukong/moreinfo.html
-h2. Imminent Changes
-
-I'm pushing to release "Wukong 3.0 the actual 1.0 release".
-
-* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
-* Methods on TypedStruct to
-
- * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
- * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
- * May make some things that are derived classes into mixin'ed modules
- * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
-
h2. Help!
@@ -193,6 +181,64 @@ You'd end up with
@newman @elaine @jerry @kramer
</code></pre>
+h2. Gotchas
+
+h4. RecordStreamer dies on blank lines with "wrong number of arguments"
+
+If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
+
+<pre>
+ class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
+ # this will fail if the line has more or fewer than 3 fields:
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
+The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
+
+<pre>
+ class MyHappyMapper < Wukong::Streamer::RecordStreamer
+ # extracts three fields always; any missing fields are nil, any extra fields discarded
+ # @example
+ # recordize("a") # ["a", nil, nil]
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
+ # recordize("a\t\b\tc\td") # ["a", "b", "c"]
+ def recordize raw_record
+ x, y, z = super(raw_record)
+ [x, y, z]
+ end
+
+ # Now all lines produce exactly three args
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
+If you want to preserve any extra fields, use the extra argument to #split():
+
+<pre>
+ class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
+ # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
+ # @example
+ # recordize("a") # ["a", nil, nil]
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
+ # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
+ def recordize raw_record
+ x, y, z = split(raw_record, "\t", 3)
+ [x, y, z]
+ end
+
+ # Now all lines produce exactly three args
+ def process x, y, z
+ p [x, y, z]
+ end
+ end
+</pre>
+
+
h2. Why is it called Wukong?
Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
View
@@ -32,18 +32,24 @@ rescue LoadError
puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
end
-require 'spec/rake/spectask'
-Spec::Rake::SpecTask.new(:spec) do |spec|
- spec.libs << 'lib' << 'spec'
- spec.spec_files = FileList['spec/**/*_spec.rb']
-end
-Spec::Rake::SpecTask.new(:rcov) do |spec|
- spec.libs << 'lib' << 'spec'
- spec.pattern = 'spec/**/*_spec.rb'
- spec.rcov = true
+begin
+ require 'spec/rake/spectask'
+ Spec::Rake::SpecTask.new(:spec) do |spec|
+ spec.libs << 'lib' << 'spec'
+ spec.spec_files = FileList['spec/**/*_spec.rb']
+ end
+ Spec::Rake::SpecTask.new(:rcov) do |spec|
+ spec.libs << 'lib' << 'spec'
+ spec.pattern = 'spec/**/*_spec.rb'
+ spec.rcov = true
+ end
+ task :spec => :check_dependencies
+ task :default => :spec
+rescue LoadError
+ task :spec do
+ abort "rspec is not available. In order to run rspec, you must: sudo gem install rspec"
+ end
end
-task :spec => :check_dependencies
-task :default => :spec
begin
require 'reek/rake_task'
View
@@ -1,13 +1,5 @@
-
-
-
* add GEM_PATH to hadoop_recycle_env
-* Hadoop_command function received an array for the input_path parameter
-
** We should be able to specify comma *or* space separated paths; the last
space-separated path in Settings.rest becomes the output file, the others are
used as the input_file list.
-
-* Make configliere Settings and streamer_instance.options() be the same
- thing. (instead of almost-but-confusingly-not-always the same thing).
View
@@ -1 +1 @@
-1.5.4
+2.0.0
View
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+#
+# This script is useful for debugging. it dumps your environment to STDERR
+# and otherwise runs as `cat`
+#
+
+set >&2
+
+cat
+true
View
@@ -1,56 +0,0 @@
-#!/usr/bin/env ruby
-$: << File.dirname(__FILE__)+'/../lib'
-require 'wukong'
-require 'wukong/streamer/count_keys'
-require 'wukong/streamer/count_lines'
-
-#
-#
-class CountKeysReducer < Wukong::Streamer::CountLines
- #
- # Taken from the actionpack Rails component ('action_view/helpers/number_helper')
- #
- # Formats a +number+ with grouped thousands using +delimiter+. You
- # can customize the format using optional <em>delimiter</em> and <em>separator</em> parameters.
- # * <tt>delimiter</tt> - Sets the thousands delimiter, defaults to ","
- # * <tt>separator</tt> - Sets the separator between the units, defaults to "."
- #
- # number_with_delimiter(12345678) => 12,345,678
- # number_with_delimiter(12345678.05) => 12,345,678.05
- # number_with_delimiter(12345678, ".") => 12.345.678
- def number_with_delimiter(number, delimiter=",", separator=".")
- begin
- parts = number.to_s.split('.')
- parts[0].gsub!(/(\d)(?=(\d\d\d)+(?!\d))/, "\\1#{delimiter}")
- parts.join separator
- rescue
- number
- end
- end
-
- # Override to look nice
- def formatted_count item, key_count
- key_count_str = number_with_delimiter(key_count.to_i)
- "%-25s\t%12s" % [item, key_count_str]
- end
-end
-
-#
-class CountKeysScript < Wukong::Script
- def map_command
- # Use `cut` to extract the first field
- %Q{ cut -d"\t" -f1 }
- end
-
- #
- # There's just the one field
- #
- def default_options
- super.merge :sort_fields => 1
- end
-end
-
-# Executes the script when run from command line
-if __FILE__ == $0
- CountKeysScript.new(nil, CountKeysReducer).run
-end
@@ -1,57 +0,0 @@
-#!/usr/bin/env ruby
-$: << File.dirname(__FILE__)+'/../lib'
-require 'wukong'
-
-#
-#
-module CountKeys
- #
- class Mapper < Wukong::Streamer::Base
- attr_accessor :keys_count
- def initialize *args
- self.keys_count = {}
- end
- def process key, *args
- key.gsub!(/-.*/, '') # kill off the slug
- self.keys_count[key] ||= 0
- self.keys_count[key] += 1
- end
- def stream *args
- super *args
- self.keys_count.each do |key, count|
- emit [key, count].to_flat
- end
- end
- end
- # Identity Mapper
- class Reducer < Wukong::Streamer::AccumulatingReducer
- attr_accessor :key_count
- require 'active_support'
- require 'action_view/helpers/number_helper'; include ActionView::Helpers::NumberHelper
-
- # Override to look nice
- def formatted_count item, key_count
- key_count_str = number_with_delimiter(key_count.to_i, :delimiter => ',')
- "%-25s\t%12s" % [item, key_count_str]
- end
- def start! *args
- self.key_count = 0
- end
- def accumulate key, count
- self.key_count += count.to_i
- end
- def finalize
- yield formatted_count(key, key_count)
- end
- end
-
- #
- class Script < Wukong::Script
- # There's just the one field
- def default_options
- super.merge :sort_fields => 1, :reduce_tasks => 1
- end
- end
-end
-
-CountKeys::Script.new(CountKeys::Mapper, CountKeys::Reducer).run
@@ -1,6 +1,6 @@
#!/usr/bin/env ruby
-$: << ENV['WUKONG_PATH']
-require 'wukong'
+$: << File.dirname(__FILE__)+'/../lib'
+require 'wukong/script'
#
# Use this script to do a Breadth-First Search (BFS) of a graph.
@@ -9,19 +9,18 @@
# ./make_paths --head=[path_in_key] --tail=[path_out_key] --out_rsrc=[combined_path_key]
#
# For example, given an edge list in the file '1path.tsv' that looks like
-# 1path n1 n2
-# 1path n1 n3
+# 1path n1 n2
+# 1path n1 n3
# ... and so forth ...
# you can run
# for t in 1 2 3 4 5 6 7 8 9 ; do next=$((t+1)) ; time cat 1path.tsv "${t}path.tsv" | ./make_paths.rb --map --head="1path" --tail="${t}path" | sort -u | ./make_paths.rb --reduce --out_rsrc="${next}path" | sort -u > "${next}path.tsv" ; done
# to do a 9-deep breadth-first search.
#
module Gen1HoodEdges
- class Mapper < Wukong::Streamer::Base
- attr_accessor :head, :tail
- def initialize options
- self.head = options[:head]
- self.tail = options[:tail]
+ class Mapper < Wukong::Streamer::RecordStreamer
+ def initialize
+ @head = Settings[:head]
+ @tail = Settings[:tail]
end
def process rsrc, *nodes
yield [ nodes.last, 'i', nodes[0..-2] ] if (rsrc == self.head)
@@ -37,8 +36,8 @@ def process rsrc, *nodes
#
class Reducer < Wukong::Streamer::AccumulatingReducer
attr_accessor :paths_in, :out_rsrc
- def initialize options
- self.out_rsrc = options[:out_rsrc]
+ def initialize
+ self.out_rsrc = Settings[:out_rsrc]
end
# clear the list of incoming paths
def start! *args
@@ -63,17 +62,11 @@ def get_key mid, *_
mid
end
end
-
- class Script < Wukong::Script
- def default_options
- super.merge :sort_fields => 2, :partition_fields => 1
- end
- end
-
end
# Execute the script
-Gen1HoodEdges::Script.new(
+Wukong.run(
Gen1HoodEdges::Mapper,
- Gen1HoodEdges::Reducer
- ).run
+ Gen1HoodEdges::Reducer,
+ :sort_fields => 2, :partition_fields => 1
+ )
Oops, something went wrong.

0 comments on commit c18627b

Please sign in to comment.