Skip to content
This repository
Newer
Older
100644 280 lines (211 sloc) 11.396 kb
85253796 »
2009-02-16 Adding to README
1 h1. Wukong
e81349b8 »
2009-02-15 Documentation for script.rb
2
e04f928b »
2010-04-09 now takes the last arg as destination, not the second non-option arg
3 Wukong is Ruby for Hadoop -- it makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b8 »
2009-02-15 Documentation for script.rb
4
e18ff853 »
2009-10-12 New gem structure
5 Treat your dataset like a
85253796 »
2009-02-16 Adding to README
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
47599990 »
2012-08-10 Update README.textile
10 "RDocs for wukong available at rdoc.info":http://rdoc.info/gems/wukong/frames
11
12
e18ff853 »
2009-10-12 New gem structure
13 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
85253796 »
2009-02-16 Adding to README
14
b6d96d0a »
2009-10-11 updating README from gh-pages: formatting of code snippets
15 The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
fbeaeb59 »
2009-10-06 Backporting changes from gh-pages docs
16
17 * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
18 * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
19 * "Usage notes":http://mrflip.github.com/wukong/usage.html
20 * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
21 * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
22 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
23 * "More info":http://mrflip.github.com/wukong/moreinfo.html
24
73107b1d »
2009-10-12 Readying gem for release
25 h2. Help!
26
27 Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
28
2b92784d »
2009-10-05 Fleshed out schema export
29 h2. Install
30
b6d96d0a »
2009-10-11 updating README from gh-pages: formatting of code snippets
31 ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
32
e18ff853 »
2009-10-12 New gem structure
33 h3. Get the code
2b92784d »
2009-10-05 Fleshed out schema export
34
b1dd801f »
2009-10-12 Readying gem for release
35 We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
2b92784d »
2009-10-05 Fleshed out schema export
36
b1dd801f »
2009-10-12 Readying gem for release
37 pre. $ git clone git://github.com/mrflip/wukong
2b92784d »
2009-10-05 Fleshed out schema export
38
b1dd801f »
2009-10-12 Readying gem for release
39 A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
2b92784d »
2009-10-05 Fleshed out schema export
40
b1dd801f »
2009-10-12 Readying gem for release
41 pre. $ sudo gem install wukong --source=http://gemcutter.org
b6d96d0a »
2009-10-11 updating README from gh-pages: formatting of code snippets
42
e18ff853 »
2009-10-12 New gem structure
43 (don't use the gems.github.com version -- it's way out of date.)
44
b1dd801f »
2009-10-12 Readying gem for release
45 You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
2b92784d »
2009-10-05 Fleshed out schema export
46
e18ff853 »
2009-10-12 New gem structure
47 h3. Dependencies and setup
48
49 To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
2b92784d »
2009-10-05 Fleshed out schema export
50
0f514460 »
2009-02-15 Now using generator (yield()) semantics rather than crudely puts'ing …
51 h2. How to write a Wukong script
52
b6d96d0a »
2009-10-11 updating README from gh-pages: formatting of code snippets
53 ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
54
85253796 »
2009-02-16 Adding to README
55 Here's a script to count words in a text stream:
5c0ca18a »
2009-02-16 Correcting #emit to handle Structs
56
33defd91 »
2009-09-02 Consolidating docs
57 <pre><code> require 'wukong'
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
58 module WordCount
59 class Mapper < Wukong::Streamer::LineStreamer
60 # Emit each word in the line.
61 def process line
62 words = line.strip.split(/\W+/).reject(&:blank?)
63 words.each{|word| yield [word, 1] }
64 end
0f514460 »
2009-02-15 Now using generator (yield()) semantics rather than crudely puts'ing …
65 end
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
66
67 class Reducer < Wukong::Streamer::ListReducer
68 def finalize
69 yield [ key, values.map(&:last).map(&:to_i).sum ]
70 end
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
71 end
72 end
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
73
74 Wukong::Script.new(
75 WordCount::Mapper,
76 WordCount::Reducer
77 ).run # Execute the script
78 </code></pre>
0f514460 »
2009-02-15 Now using generator (yield()) semantics rather than crudely puts'ing …
79
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
80 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
81
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
82 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
83
4dc43dde »
2009-02-16 Adding to README
84 h3. Structured data stream
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
85
4dc43dde »
2009-02-16 Adding to README
86 You can also use structs to treat your dataset as a stream of objects:
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
87
33defd91 »
2009-09-02 Consolidating docs
88 <pre><code> require 'wukong'
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
89 require 'my_blog' #defines the blog models
90 # structs for our input objects
91 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
92 :in_reply_to_user_id, :in_reply_to_status_id, :text )
93 TwitterUser = Struct.new( :id, :username, :fullname,
94 :homepage, :location, :description )
95 module TwitBlog
96 class Mapper < Wukong::Streamer::RecordStreamer
97 # Watch for tweets by me
98 MY_USER_ID = 24601
99 #
100 # If this is a tweet is by me, convert it to a Post.
101 #
102 # If it is a tweet not by me, convert it to a Comment that
103 # will be paired with the correct Post.
104 #
105 # If it is a TwitterUser, convert it to a User record and
106 # a user_location record
107 #
108 def process record
109 case record
110 when TwitterUser
111 user = MyBlog::User.new.merge(record) # grab the fields in common
112 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
113 yield user
114 yield user_loc
115 when Tweet
116 if record.twitter_user_id == MY_USER_ID
117 post = MyBlog::Post.new.merge record
118 post.link = "http://twitter.com/statuses/show/#{record.id}"
119 post.body = record.text
120 post.title = record.text[0..65] + "..."
121 yield post
122 else
123 comment = MyBlog::Comment.new.merge record
124 comment.body = record.text
125 comment.post_id = record.in_reply_to_status_id
126 yield comment
127 end
6ef62b2a »
2009-02-16 Adding to README
128 end
129 end
0d83c183 »
2009-02-16 Adding to README
130 end
131 end
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
132 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
133 </code></pre>
28ea53b5 »
2009-02-16 CountKeys is clearer. Showed 3 versions in README
134
765e58f1 »
2009-10-05 Version bump to 0.2.0
135 h3. Advanced Patterns
136
b6d96d0a »
2009-10-11 updating README from gh-pages: formatting of code snippets
137 Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
765e58f1 »
2009-10-05 Version bump to 0.2.0
138
60bdf2c8 »
2009-10-05 README formatting
139 <pre><code> #
330e388c »
2009-10-05 un-messed-up the list reducer (made it so there's a non-useless final…
140 # Roll up all values for each key into a single line
765e58f1 »
2009-10-05 Version bump to 0.2.0
141 #
142 class GroupByReducer < Wukong::Streamer::AccumulatingReducer
143 attr_accessor :values
144
145 # Start with an empty list
146 def start! *args
147 self.values = []
148 end
149
150 # Aggregate each value in turn
151 def accumulate key, value
152 self.values << value
153 end
154
155 # Emit the key and all values, tab-separated
156 def finalize
157 yield [key, values].flatten
158 end
159 end
60bdf2c8 »
2009-10-05 README formatting
160 </code></pre>
765e58f1 »
2009-10-05 Version bump to 0.2.0
161
330e388c »
2009-10-05 un-messed-up the list reducer (made it so there's a non-useless final…
162 So given adjacency pairs for the following directed friend graph:
163
164 <pre><code>
165 @jerry @elaine
166 @elaine @jerry
167 @jerry @kramer
168 @kramer @jerry
169 @kramer @bobsacamato
170 @kramer @newman
171 @jerry @superman
172 @newman @kramer
173 @newman @elaine
174 @newman @jerry
175 </code></pre>
176
177 You'd end up with
178
179 <pre><code>
180 @elaine @jerry
181 @jerry @elaine @kramer @superman
182 @kramer @bobsacamato @jerry @newman
183 @newman @elaine @jerry @kramer
184 </code></pre>
765e58f1 »
2009-10-05 Version bump to 0.2.0
185
947156b6 »
2011-01-28 Big cleanup of the examples/ directory
186 h2. Gotchas
187
188 h4. RecordStreamer dies on blank lines with "wrong number of arguments"
189
190 If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
191
192 <pre>
193 class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
194 # this will fail if the line has more or fewer than 3 fields:
195 def process x, y, z
196 p [x, y, z]
197 end
198 end
199 </pre>
200
201 The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
202
203 <pre>
204 class MyHappyMapper < Wukong::Streamer::RecordStreamer
205 # extracts three fields always; any missing fields are nil, any extra fields discarded
206 # @example
207 # recordize("a") # ["a", nil, nil]
208 # recordize("a\t\b\tc") # ["a", "b", "c"]
209 # recordize("a\t\b\tc\td") # ["a", "b", "c"]
210 def recordize raw_record
211 x, y, z = super(raw_record)
212 [x, y, z]
213 end
214
215 # Now all lines produce exactly three args
216 def process x, y, z
217 p [x, y, z]
218 end
219 end
220 </pre>
221
222 If you want to preserve any extra fields, use the extra argument to #split():
223
224 <pre>
225 class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
226 # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
227 # @example
228 # recordize("a") # ["a", nil, nil]
229 # recordize("a\t\b\tc") # ["a", "b", "c"]
230 # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
231 def recordize raw_record
232 x, y, z = split(raw_record, "\t", 3)
233 [x, y, z]
234 end
235
236 # Now all lines produce exactly three args
237 def process x, y, z
238 p [x, y, z]
239 end
240 end
241 </pre>
242
243
b74872e6 »
2009-02-15 Correcting readme formatting
244 h2. Why is it called Wukong?
e81349b8 »
2009-02-15 Documentation for script.rb
245
f6967140 »
2009-07-25 Added links to tutorials and presentations. Reformatted readme to non…
246 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
247
248 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
249
250 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
251
252 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
73107b1d »
2009-10-12 Readying gem for release
253
b1dd801f »
2009-10-12 Readying gem for release
254 <notextile><div class="toggle"></notextile>
73107b1d »
2009-10-12 Readying gem for release
255
b1dd801f »
2009-10-12 Readying gem for release
256 h2. More info
257
258 There are many useful examples in the examples/ directory.
259
260 h3. Credits
261
262 Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
73107b1d »
2009-10-12 Readying gem for release
263
264 Patches submitted by:
265 * gemified by Ben Woosley (ben.woosley with the gmails)
266 * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
267
268 Thanks to:
b087c05f »
2010-03-04 making gem version
269 * "Fredrik Möllerstrand (@lenbust)":http://twitter.com/lenbust for the examples/contrib/jeans working example
73107b1d »
2009-10-12 Readying gem for release
270 * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
271 * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
b1dd801f »
2009-10-12 Readying gem for release
272
273 h3. Help!
274
776b4cd2 »
2012-05-09 Added support email and phone number
275 Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code, to info@infochimps.com or call 855-DATA-FUN
276
6dbc8106 »
2012-08-22 Update README.textile
277 You're invited to talk with author Philip (flip) Kromer in a "private consultation":http://www.infochimps.com/free-big-data-consultation?utm_source=git&utm_medium=referral&utm_campaign=consult about your big data project.
b1dd801f »
2009-10-12 Readying gem for release
278
279 <notextile></div></notextile>
Something went wrong with that request. Please try again.