Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 173 lines (134 sloc) 6.281 kb
8525379 Adding to README
Philip (flip) Kromer authored
1 h1. Wukong
e81349b Documentation for script.rb
Philip (flip) Kromer authored
2
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
3 Wukong makes Hadoop so easy a chimpanzee can use it.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
4
8525379 Adding to README
Philip (flip) Kromer authored
5 Treat your dataset like a
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
10 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant,
11 "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your
12 command line.
13
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
14 h2. How to write a Wukong script
15
4dc43dd Adding to README
Philip (flip) Kromer authored
16
8525379 Adding to README
Philip (flip) Kromer authored
17 Here's a script to count words in a text stream:
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
18
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
19 require 'wukong'
20 module WordCount
21 class Mapper < Wukong::Streamer::LineStreamer
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
22 # Emit each word in the line.
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
23 def process line
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
24 words = line.strip.split(/\W+/).reject(&:blank?)
25 words.each{|word| yield [word, 1] }
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
26 end
27 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
28
4dc43dd Adding to README
Philip (flip) Kromer authored
29 class Reducer < Wukong::Streamer::ListReducer
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
30 def finalize
4dc43dd Adding to README
Philip (flip) Kromer authored
31 yield [ key, values.map(&:last).map(&:to_i).sum ]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
32 end
33 end
34 end
4dc43dd Adding to README
Philip (flip) Kromer authored
35
466a6e7 Correcting readme formatting
Philip (flip) Kromer authored
36 Wukong::Script.new(
37 WordCount::Mapper,
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
38 WordCount::Reducer
466a6e7 Correcting readme formatting
Philip (flip) Kromer authored
39 ).run # Execute the script
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
40
4dc43dd Adding to README
Philip (flip) Kromer authored
41 The first class, the Mapper, eats lines and craps @[word, count]@ records: word
42 is the /key/, its count is the /value/.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
43
4dc43dd Adding to README
Philip (flip) Kromer authored
44 In the reducer, the values for each key are stacked up into a list; then the
45 record(s) yielded by @#finalize@ are emitted.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
46
4dc43dd Adding to README
Philip (flip) Kromer authored
47 h3. Structured data stream
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
48
4dc43dd Adding to README
Philip (flip) Kromer authored
49 You can also use structs to treat your dataset as a stream of objects:
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
50
4dc43dd Adding to README
Philip (flip) Kromer authored
51 require 'wukong'
52 require 'my_blog' #defines the blog models
53 # structs for our input objects
54 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
55 :in_reply_to_user_id, :in_reply_to_status_id, :text )
56 TwitterUser = Struct.new( :id, :username, :fullname,
57 :homepage, :location, :description )
58 module TwitBlog
59 class Mapper < Wukong::Streamer::RecordStreamer
60 # Watch for tweets by me
61 MY_USER_ID = 24601
62 #
63 # If this is a tweet is by me, convert it to a Post.
64 #
65 # If it is a tweet not by me, convert it to a Comment that
66 # will be paired with the correct Post.
67 #
68 # If it is a TwitterUser, convert it to a User record and
69 # a user_location record
70 #
71 def process record
72 case record
73 when TwitterUser
74 user = MyBlog::User.new.merge(record) # grab the fields in common
75 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
76 yield user
77 yield user_loc
78 when Tweet
79 if record.twitter_user_id == MY_USER_ID
80 post = MyBlog::Post.new.merge record
81 post.link = "http://twitter.com/statuses/show/#{record.id}"
82 post.body = record.text
83 post.title = record.text[0..65] + "..."
84 yield post
85 else
86 comment = MyBlog::Comment.new.merge record
87 comment.body = record.text
88 comment.post_id = record.in_reply_to_status_id
89 yield comment
90 end
91 end
92 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
93 end
4dc43dd Adding to README
Philip (flip) Kromer authored
94 end
95 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
96
4dc43dd Adding to README
Philip (flip) Kromer authored
97 h3. More info
98
99 There are many useful examples (including an actually-useful version of the
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
100 WordCount script) in examples/ directory.
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
101
4dc43dd Adding to README
Philip (flip) Kromer authored
102
b74872e Correcting readme formatting
Philip (flip) Kromer authored
103 h2. How to run a Wukong script
e81349b Documentation for script.rb
Philip (flip) Kromer authored
104
105 your/script.rb --go path/to/input_files path/to/output_dir
106
107 All of the file paths are HDFS paths except your script path, of course, which
108 is on the local filesystem.
109
110 You can supply arbitrary command line arguments (they wind up as key-value pairs
111 in the options path your mapper and reducer receive), and you can use the hadoop
112 syntax to specify more than one input file:
113
114 ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
115 --go "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
116
117
b74872e Correcting readme formatting
Philip (flip) Kromer authored
118 h2. How to test your scripts
e81349b Documentation for script.rb
Philip (flip) Kromer authored
119
120 To run mapper on its own:
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
121
e81349b Documentation for script.rb
Philip (flip) Kromer authored
122 cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
123
e81349b Documentation for script.rb
Philip (flip) Kromer authored
124 or if your test data lies on the HDFS,
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
125
e81349b Documentation for script.rb
Philip (flip) Kromer authored
126 hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
127
128
b74872e Correcting readme formatting
Philip (flip) Kromer authored
129 h2. What's up with Wukong::AndPig?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
130
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
131 @Wukong::AndPig@ is a small library to more easily generate code for the
b74872e Correcting readme formatting
Philip (flip) Kromer authored
132 "Pig":http://hadoop.apache.org/pig data analysis language. See its
133 "README":wukong/and_pig/README.textile for more.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
134
b74872e Correcting readme formatting
Philip (flip) Kromer authored
135 h2. Why is it called Wukong?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
136
137 Hadoop, as you may know, is "named after a stuffed
8525379 Adding to README
Philip (flip) Kromer authored
138 elephant,"http://en.wikipedia.org/wiki/Hadoop and since Wukong was started by
139 the "infochimps":http://infochimps.org team, we needed a simian analog. Wukong
140 (the Monkey King), known for his power and agility, is hero of a famous Chinese
141 Fairytale in which he journeys to the land of the Elephant:
e81349b Documentation for script.rb
Philip (flip) Kromer authored
142
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
143 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main
e81349b Documentation for script.rb
Philip (flip) Kromer authored
144 character in the classical Chinese epic novel Journey to the West. In the novel,
145 he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from
146 India.
147
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
148 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
e81349b Documentation for script.rb
Philip (flip) Kromer authored
149 (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling
150 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations,
151 which allows him to transform into various animals and objects; he is, however,
152 shown with slight problems transforming into other people, since he is unable to
153 complete the transformation of his tail. He is a skilled fighter, capable of
154 holding his own against the best generals of heaven. Each of his hairs possesses
155 magical properties, and is capable of transforming into a clone of the Monkey
156 King himself, or various weapons, animals, and other objects. He also knows
157 various spells in order to command wind, part water, conjure protective circles
158 against demons, freeze humans, demons, and gods alike. (Journey to the West, Wu
8525379 Adding to README
Philip (flip) Kromer authored
159 Cheng'en (1500-1582), Translated by Foreign Languages Press, Beijing 1993.) --
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
160 /["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]/
e81349b Documentation for script.rb
Philip (flip) Kromer authored
161
162 p. Sounds about right to us :) The "BBC-produced Jaime Hewlett / Damon Albarn
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
163 short,":http://news.bbc.co.uk/sport1/hi/olympics/monkey made for the 2008
164 Olympics, gives the general idea.
8525379 Adding to README
Philip (flip) Kromer authored
165
166 h2. What tools does Wukong work with?
167
168 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant,
169 "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your
170 command line. We're looking forward to being friends with
171 "martinis":http://datamapper.org and "express
172 trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
Something went wrong with that request. Please try again.