Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 246 lines (174 sloc) 10.518 kb
8525379 Adding to README
Philip (flip) Kromer authored
1 h1. Wukong
e81349b Documentation for script.rb
Philip (flip) Kromer authored
2
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
3 Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
4
33defd9 Consolidating docs
Philip (flip) Kromer authored
5 Treat your dataset as a
6
8525379 Adding to README
Philip (flip) Kromer authored
7 * stream of lines when it's efficient to process by lines
8 * stream of field arrays when it's efficient to deal directly with fields
9 * stream of lightweight objects when it's efficient to deal with objects
10
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
11 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
8525379 Adding to README
Philip (flip) Kromer authored
12
33defd9 Consolidating docs
Philip (flip) Kromer authored
13 The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong
14
fbeaeb5 Backporting changes from gh-pages docs
Philip (flip) Kromer authored
15
16 * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
17 * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
18 * "Usage notes":http://mrflip.github.com/wukong/usage.html
19 * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
20 * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
21 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
22 * "More info":http://mrflip.github.com/wukong/moreinfo.html
23
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
24 h2. Install
25
26 Wukong is still under active development. The newest version is available at
27
28 http://github.com/mrflip/wukong
29
30 A gem is available from "github:":http://gems.github.com
31
32 gem install mrflip-wukong --source=http://gems.github.com
33
34 or from "gemcutter":http://gemcutter.org
35
36 gem install wukong --source=http://gemcutter.org
37
38 Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil!
39
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
40 h2. How to write a Wukong script
41
8525379 Adding to README
Philip (flip) Kromer authored
42 Here's a script to count words in a text stream:
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
43
33defd9 Consolidating docs
Philip (flip) Kromer authored
44 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
45 module WordCount
46 class Mapper < Wukong::Streamer::LineStreamer
47 # Emit each word in the line.
48 def process line
49 words = line.strip.split(/\W+/).reject(&:blank?)
50 words.each{|word| yield [word, 1] }
51 end
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
52 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
53
54 class Reducer < Wukong::Streamer::ListReducer
55 def finalize
56 yield [ key, values.map(&:last).map(&:to_i).sum ]
57 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
58 end
59 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
60
61 Wukong::Script.new(
62 WordCount::Mapper,
63 WordCount::Reducer
64 ).run # Execute the script
65 </code></pre>
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
66
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
67 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
68
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
69 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
70
4dc43dd Adding to README
Philip (flip) Kromer authored
71 h3. Structured data stream
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
72
4dc43dd Adding to README
Philip (flip) Kromer authored
73 You can also use structs to treat your dataset as a stream of objects:
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
74
33defd9 Consolidating docs
Philip (flip) Kromer authored
75 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
76 require 'my_blog' #defines the blog models
77 # structs for our input objects
78 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
79 :in_reply_to_user_id, :in_reply_to_status_id, :text )
80 TwitterUser = Struct.new( :id, :username, :fullname,
81 :homepage, :location, :description )
82 module TwitBlog
83 class Mapper < Wukong::Streamer::RecordStreamer
84 # Watch for tweets by me
85 MY_USER_ID = 24601
86 #
87 # If this is a tweet is by me, convert it to a Post.
88 #
89 # If it is a tweet not by me, convert it to a Comment that
90 # will be paired with the correct Post.
91 #
92 # If it is a TwitterUser, convert it to a User record and
93 # a user_location record
94 #
95 def process record
96 case record
97 when TwitterUser
98 user = MyBlog::User.new.merge(record) # grab the fields in common
99 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
100 yield user
101 yield user_loc
102 when Tweet
103 if record.twitter_user_id == MY_USER_ID
104 post = MyBlog::Post.new.merge record
105 post.link = "http://twitter.com/statuses/show/#{record.id}"
106 post.body = record.text
107 post.title = record.text[0..65] + "..."
108 yield post
109 else
110 comment = MyBlog::Comment.new.merge record
111 comment.body = record.text
112 comment.post_id = record.in_reply_to_status_id
113 yield comment
114 end
6ef62b2 Adding to README
Philip (flip) Kromer authored
115 end
116 end
0d83c18 Adding to README
Philip (flip) Kromer authored
117 end
118 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
119 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
120 </code></pre>
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
121
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
122 h3. Advanced Patterns
123
124 Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group.
125
126 The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
127
128 Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
129
60bdf2c README formatting
Philip (flip) Kromer authored
130 <pre><code> #
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
131 # Roll up all values for each key into a single line
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
132 #
133 class GroupByReducer < Wukong::Streamer::AccumulatingReducer
134 attr_accessor :values
135
136 # Start with an empty list
137 def start! *args
138 self.values = []
139 end
140
141 # Aggregate each value in turn
142 def accumulate key, value
143 self.values << value
144 end
145
146 # Emit the key and all values, tab-separated
147 def finalize
148 yield [key, values].flatten
149 end
150 end
60bdf2c README formatting
Philip (flip) Kromer authored
151 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
152
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
153 So given adjacency pairs for the following directed friend graph:
154
155 <pre><code>
156 @jerry @elaine
157 @elaine @jerry
158 @jerry @kramer
159 @kramer @jerry
160 @kramer @bobsacamato
161 @kramer @newman
162 @jerry @superman
163 @newman @kramer
164 @newman @elaine
165 @newman @jerry
166 </code></pre>
167
168 You'd end up with
169
170 <pre><code>
171 @elaine @jerry
172 @jerry @elaine @kramer @superman
173 @kramer @bobsacamato @jerry @newman
174 @newman @elaine @jerry @kramer
175 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
176
4dc43dd Adding to README
Philip (flip) Kromer authored
177 h3. More info
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
178
179 There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
180
665f84a Added config file for default options and site configuration
Philip (flip) Kromer authored
181 h2. Setup
182
183 1. Allow Wukong to discover where his elephant friend lives: either
184
33defd9 Consolidating docs
Philip (flip) Kromer authored
185 * set a @$HADOOP_HOME@ environment variable,
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
186
187 * or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install:
665f84a Added config file for default options and site configuration
Philip (flip) Kromer authored
188
33defd9 Consolidating docs
Philip (flip) Kromer authored
189 @:hadoop_home: /usr/local/share/hadoop@
665f84a Added config file for default options and site configuration
Philip (flip) Kromer authored
190
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
191 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
665f84a Added config file for default options and site configuration
Philip (flip) Kromer authored
192
b74872e Correcting readme formatting
Philip (flip) Kromer authored
193 h2. How to run a Wukong script
e81349b Documentation for script.rb
Philip (flip) Kromer authored
194
27f916a Details about local mode into README
Philip (flip) Kromer authored
195 To run your script using local files and no connection to a hadoop cluster,
e81349b Documentation for script.rb
Philip (flip) Kromer authored
196
33defd9 Consolidating docs
Philip (flip) Kromer authored
197 @your/script.rb --run=local path/to/input_files path/to/output_dir@
27f916a Details about local mode into README
Philip (flip) Kromer authored
198
199 To run the command across a Hadoop cluster,
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
200
33defd9 Consolidating docs
Philip (flip) Kromer authored
201 @your/script.rb --run=hadoop path/to/input_files path/to/output_dir@
27f916a Details about local mode into README
Philip (flip) Kromer authored
202
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
203 You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
27f916a Details about local mode into README
Philip (flip) Kromer authored
204
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
205 If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths. (your/script path, of course, lives on the local filesystem).
e81349b Documentation for script.rb
Philip (flip) Kromer authored
206
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
207 You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
e81349b Documentation for script.rb
Philip (flip) Kromer authored
208
33defd9 Consolidating docs
Philip (flip) Kromer authored
209 ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
210 --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
e81349b Documentation for script.rb
Philip (flip) Kromer authored
211
2e2ed5e Updated readme, pulled some debug strings left in oops
Philip (flip) Kromer authored
212 Note that all @--options@ must precede (in any order) all non-options.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
213
b74872e Correcting readme formatting
Philip (flip) Kromer authored
214 h2. How to test your scripts
e81349b Documentation for script.rb
Philip (flip) Kromer authored
215
216 To run mapper on its own:
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
217
e81349b Documentation for script.rb
Philip (flip) Kromer authored
218 cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
219
e81349b Documentation for script.rb
Philip (flip) Kromer authored
220 or if your test data lies on the HDFS,
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
221
e81349b Documentation for script.rb
Philip (flip) Kromer authored
222 hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
223
27f916a Details about local mode into README
Philip (flip) Kromer authored
224 Next graduate to running @--run=local@ mode so you can inspect the reducer.
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
225
27f916a Details about local mode into README
Philip (flip) Kromer authored
226
b74872e Correcting readme formatting
Philip (flip) Kromer authored
227 h2. What's up with Wukong::AndPig?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
228
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
229 @Wukong::AndPig@ is a small library to more easily generate code for the
b74872e Correcting readme formatting
Philip (flip) Kromer authored
230 "Pig":http://hadoop.apache.org/pig data analysis language. See its
231 "README":wukong/and_pig/README.textile for more.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
232
b74872e Correcting readme formatting
Philip (flip) Kromer authored
233 h2. Why is it called Wukong?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
234
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
235 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
236
237 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
238
239 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
240
241 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
8525379 Adding to README
Philip (flip) Kromer authored
242
243 h2. What tools does Wukong work with?
244
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
245 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. We're looking forward to being friends with "martinis":http://datamapper.org and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
Something went wrong with that request. Please try again.