Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 280 lines (211 sloc) 11.399 kB
8525379 Adding to README
Philip (flip) Kromer authored
1 h1. Wukong
e81349b Documentation for script.rb
Philip (flip) Kromer authored
2
e04f928 now takes the last arg as destination, not the second non-option arg
Philip (flip) Kromer authored
3 Wukong is Ruby for Hadoop -- it makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
4
e18ff85 New gem structure
Philip (flip) Kromer authored
5 Treat your dataset like a
8525379 Adding to README
Philip (flip) Kromer authored
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
4759999 Update README.textile
Philip (flip) Kromer authored
10 "RDocs for wukong available at rdoc.info":http://rdoc.info/gems/wukong/frames
11
12
e18ff85 New gem structure
Philip (flip) Kromer authored
13 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
8525379 Adding to README
Philip (flip) Kromer authored
14
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
15 The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
fbeaeb5 Backporting changes from gh-pages docs
Philip (flip) Kromer authored
16
17 * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
18 * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
19 * "Usage notes":http://mrflip.github.com/wukong/usage.html
20 * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
21 * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
22 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
23 * "More info":http://mrflip.github.com/wukong/moreinfo.html
24
73107b1 Readying gem for release
Philip (flip) Kromer authored
25 h2. Help!
26
27 Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
28
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
29 h2. Install
30
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
31 ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
32
e18ff85 New gem structure
Philip (flip) Kromer authored
33 h3. Get the code
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
34
b1dd801 Readying gem for release
Philip (flip) Kromer authored
35 We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
36
b1dd801 Readying gem for release
Philip (flip) Kromer authored
37 pre. $ git clone git://github.com/mrflip/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
38
b1dd801 Readying gem for release
Philip (flip) Kromer authored
39 A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
40
b1dd801 Readying gem for release
Philip (flip) Kromer authored
41 pre. $ sudo gem install wukong --source=http://gemcutter.org
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
42
e18ff85 New gem structure
Philip (flip) Kromer authored
43 (don't use the gems.github.com version -- it's way out of date.)
44
b1dd801 Readying gem for release
Philip (flip) Kromer authored
45 You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
46
e18ff85 New gem structure
Philip (flip) Kromer authored
47 h3. Dependencies and setup
48
49 To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
50
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
51 h2. How to write a Wukong script
52
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
53 ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
54
8525379 Adding to README
Philip (flip) Kromer authored
55 Here's a script to count words in a text stream:
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
56
33defd9 Consolidating docs
Philip (flip) Kromer authored
57 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
58 module WordCount
59 class Mapper < Wukong::Streamer::LineStreamer
60 # Emit each word in the line.
61 def process line
62 words = line.strip.split(/\W+/).reject(&:blank?)
63 words.each{|word| yield [word, 1] }
64 end
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
65 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
66
67 class Reducer < Wukong::Streamer::ListReducer
68 def finalize
69 yield [ key, values.map(&:last).map(&:to_i).sum ]
70 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
71 end
72 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
73
74 Wukong::Script.new(
75 WordCount::Mapper,
76 WordCount::Reducer
77 ).run # Execute the script
78 </code></pre>
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
79
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
80 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
81
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
82 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
83
4dc43dd Adding to README
Philip (flip) Kromer authored
84 h3. Structured data stream
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
85
4dc43dd Adding to README
Philip (flip) Kromer authored
86 You can also use structs to treat your dataset as a stream of objects:
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
87
33defd9 Consolidating docs
Philip (flip) Kromer authored
88 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
89 require 'my_blog' #defines the blog models
90 # structs for our input objects
91 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
92 :in_reply_to_user_id, :in_reply_to_status_id, :text )
93 TwitterUser = Struct.new( :id, :username, :fullname,
94 :homepage, :location, :description )
95 module TwitBlog
96 class Mapper < Wukong::Streamer::RecordStreamer
97 # Watch for tweets by me
98 MY_USER_ID = 24601
99 #
100 # If this is a tweet is by me, convert it to a Post.
101 #
102 # If it is a tweet not by me, convert it to a Comment that
103 # will be paired with the correct Post.
104 #
105 # If it is a TwitterUser, convert it to a User record and
106 # a user_location record
107 #
108 def process record
109 case record
110 when TwitterUser
111 user = MyBlog::User.new.merge(record) # grab the fields in common
112 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
113 yield user
114 yield user_loc
115 when Tweet
116 if record.twitter_user_id == MY_USER_ID
117 post = MyBlog::Post.new.merge record
118 post.link = "http://twitter.com/statuses/show/#{record.id}"
119 post.body = record.text
120 post.title = record.text[0..65] + "..."
121 yield post
122 else
123 comment = MyBlog::Comment.new.merge record
124 comment.body = record.text
125 comment.post_id = record.in_reply_to_status_id
126 yield comment
127 end
6ef62b2 Adding to README
Philip (flip) Kromer authored
128 end
129 end
0d83c18 Adding to README
Philip (flip) Kromer authored
130 end
131 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
132 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
133 </code></pre>
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
134
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
135 h3. Advanced Patterns
136
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
137 Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
138
60bdf2c README formatting
Philip (flip) Kromer authored
139 <pre><code> #
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
140 # Roll up all values for each key into a single line
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
141 #
142 class GroupByReducer < Wukong::Streamer::AccumulatingReducer
143 attr_accessor :values
144
145 # Start with an empty list
146 def start! *args
147 self.values = []
148 end
149
150 # Aggregate each value in turn
151 def accumulate key, value
152 self.values << value
153 end
154
155 # Emit the key and all values, tab-separated
156 def finalize
157 yield [key, values].flatten
158 end
159 end
60bdf2c README formatting
Philip (flip) Kromer authored
160 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
161
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
162 So given adjacency pairs for the following directed friend graph:
163
164 <pre><code>
165 @jerry @elaine
166 @elaine @jerry
167 @jerry @kramer
168 @kramer @jerry
169 @kramer @bobsacamato
170 @kramer @newman
171 @jerry @superman
172 @newman @kramer
173 @newman @elaine
174 @newman @jerry
175 </code></pre>
176
177 You'd end up with
178
179 <pre><code>
180 @elaine @jerry
181 @jerry @elaine @kramer @superman
182 @kramer @bobsacamato @jerry @newman
183 @newman @elaine @jerry @kramer
184 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
185
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
186 h2. Gotchas
187
188 h4. RecordStreamer dies on blank lines with "wrong number of arguments"
189
190 If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
191
192 <pre>
193 class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
194 # this will fail if the line has more or fewer than 3 fields:
195 def process x, y, z
196 p [x, y, z]
197 end
198 end
199 </pre>
200
201 The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
202
203 <pre>
204 class MyHappyMapper < Wukong::Streamer::RecordStreamer
205 # extracts three fields always; any missing fields are nil, any extra fields discarded
206 # @example
207 # recordize("a") # ["a", nil, nil]
208 # recordize("a\t\b\tc") # ["a", "b", "c"]
209 # recordize("a\t\b\tc\td") # ["a", "b", "c"]
210 def recordize raw_record
211 x, y, z = super(raw_record)
212 [x, y, z]
213 end
214
215 # Now all lines produce exactly three args
216 def process x, y, z
217 p [x, y, z]
218 end
219 end
220 </pre>
221
222 If you want to preserve any extra fields, use the extra argument to #split():
223
224 <pre>
225 class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
226 # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
227 # @example
228 # recordize("a") # ["a", nil, nil]
229 # recordize("a\t\b\tc") # ["a", "b", "c"]
230 # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
231 def recordize raw_record
232 x, y, z = split(raw_record, "\t", 3)
233 [x, y, z]
234 end
235
236 # Now all lines produce exactly three args
237 def process x, y, z
238 p [x, y, z]
239 end
240 end
241 </pre>
242
243
b74872e Correcting readme formatting
Philip (flip) Kromer authored
244 h2. Why is it called Wukong?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
245
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
246 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
247
248 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
249
250 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
251
252 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
73107b1 Readying gem for release
Philip (flip) Kromer authored
253
b1dd801 Readying gem for release
Philip (flip) Kromer authored
254 <notextile><div class="toggle"></notextile>
73107b1 Readying gem for release
Philip (flip) Kromer authored
255
b1dd801 Readying gem for release
Philip (flip) Kromer authored
256 h2. More info
257
258 There are many useful examples in the examples/ directory.
259
260 h3. Credits
261
262 Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
73107b1 Readying gem for release
Philip (flip) Kromer authored
263
264 Patches submitted by:
265 * gemified by Ben Woosley (ben.woosley with the gmails)
266 * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
267
268 Thanks to:
b087c05 making gem version
Philip (flip) Kromer authored
269 * "Fredrik Möllerstrand (@lenbust)":http://twitter.com/lenbust for the examples/contrib/jeans working example
73107b1 Readying gem for release
Philip (flip) Kromer authored
270 * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
271 * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
b1dd801 Readying gem for release
Philip (flip) Kromer authored
272
273 h3. Help!
274
776b4cd @shoogie Added support email and phone number
shoogie authored
275 Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code, to info@infochimps.com or call 855-DATA-FUN
276
277 Also, you invited to talk with author Philip (flip) Kromer in a "private consultation":http://www.infochimps.com/free-big-data-consultation?utm_source=git&utm_medium=referral&utm_campaign=consult about your big data project.
b1dd801 Readying gem for release
Philip (flip) Kromer authored
278
279 <notextile></div></notextile>
Something went wrong with that request. Please try again.