Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 175 lines (136 sloc) 6.293 kb
8525379 @mrflip Adding to README
mrflip authored
1 h1. Wukong
e81349b @mrflip Documentation for script.rb
mrflip authored
2
5c0ca18 @mrflip Correcting #emit to handle Structs
mrflip authored
3 Wukong makes Hadoop so easy a chimpanzee can use it.
e81349b @mrflip Documentation for script.rb
mrflip authored
4
8525379 @mrflip Adding to README
mrflip authored
5 Treat your dataset like a
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
10 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant,
11 "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your
12 command line.
13
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
14 h2. How to write a Wukong script
15
4dc43dd @mrflip Adding to README
mrflip authored
16
8525379 @mrflip Adding to README
mrflip authored
17 Here's a script to count words in a text stream:
5c0ca18 @mrflip Correcting #emit to handle Structs
mrflip authored
18
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
19 require 'wukong'
20 module WordCount
21 class Mapper < Wukong::Streamer::LineStreamer
5c0ca18 @mrflip Correcting #emit to handle Structs
mrflip authored
22 # Emit each word in the line.
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
23 def process line
5c0ca18 @mrflip Correcting #emit to handle Structs
mrflip authored
24 words = line.strip.split(/\W+/).reject(&:blank?)
25 words.each{|word| yield [word, 1] }
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
26 end
27 end
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
28
4dc43dd @mrflip Adding to README
mrflip authored
29 class Reducer < Wukong::Streamer::ListReducer
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
30 def finalize
4dc43dd @mrflip Adding to README
mrflip authored
31 yield [ key, values.map(&:last).map(&:to_i).sum ]
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
32 end
33 end
34 end
4dc43dd @mrflip Adding to README
mrflip authored
35
466a6e7 @mrflip Correcting readme formatting
mrflip authored
36 Wukong::Script.new(
37 WordCount::Mapper,
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
38 WordCount::Reducer
466a6e7 @mrflip Correcting readme formatting
mrflip authored
39 ).run # Execute the script
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
40
4dc43dd @mrflip Adding to README
mrflip authored
41 The first class, the Mapper, eats lines and craps @[word, count]@ records: word
42 is the /key/, its count is the /value/.
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
43
4dc43dd @mrflip Adding to README
mrflip authored
44 In the reducer, the values for each key are stacked up into a list; then the
45 record(s) yielded by @#finalize@ are emitted.
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
46
4dc43dd @mrflip Adding to README
mrflip authored
47 h3. Structured data stream
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
48
4dc43dd @mrflip Adding to README
mrflip authored
49 You can also use structs to treat your dataset as a stream of objects:
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
50
0d83c18 @mrflip Adding to README
mrflip authored
51 @@@
52 require 'wukong'
53 require 'my_blog' #defines the blog models
54 # structs for our input objects
55 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
56 :in_reply_to_user_id, :in_reply_to_status_id, :text )
57 TwitterUser = Struct.new( :id, :username, :fullname,
58 :homepage, :location, :description )
59 module TwitBlog
60 class Mapper < Wukong::Streamer::RecordStreamer
61 # Watch for tweets by me
62 MY_USER_ID = 24601
63 #
64 # If this is a tweet is by me, convert it to a Post.
65 #
66 # If it is a tweet not by me, convert it to a Comment that
67 # will be paired with the correct Post.
68 #
69 # If it is a TwitterUser, convert it to a User record and
70 # a user_location record
71 #
72 def process record
73 case record
74 when TwitterUser
75 user = MyBlog::User.new.merge(record) # grab the fields in common
76 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
77 yield user
78 yield user_loc
79 when Tweet
80 if record.twitter_user_id == MY_USER_ID
81 post = MyBlog::Post.new.merge record
82 post.link = "http://twitter.com/statuses/show/#{record.id}"
83 post.body = record.text
84 post.title = record.text[0..65] + "..."
85 yield post
86 else
87 comment = MyBlog::Comment.new.merge record
88 comment.body = record.text
89 comment.post_id = record.in_reply_to_status_id
90 yield comment
4dc43dd @mrflip Adding to README
mrflip authored
91 end
92 end
0d83c18 @mrflip Adding to README
mrflip authored
93 end
94 end
95 end
96 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
97 @@@
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
98
4dc43dd @mrflip Adding to README
mrflip authored
99 h3. More info
100
101 There are many useful examples (including an actually-useful version of the
5c0ca18 @mrflip Correcting #emit to handle Structs
mrflip authored
102 WordCount script) in examples/ directory.
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
103
4dc43dd @mrflip Adding to README
mrflip authored
104
b74872e @mrflip Correcting readme formatting
mrflip authored
105 h2. How to run a Wukong script
e81349b @mrflip Documentation for script.rb
mrflip authored
106
107 your/script.rb --go path/to/input_files path/to/output_dir
108
109 All of the file paths are HDFS paths except your script path, of course, which
110 is on the local filesystem.
111
112 You can supply arbitrary command line arguments (they wind up as key-value pairs
113 in the options path your mapper and reducer receive), and you can use the hadoop
114 syntax to specify more than one input file:
115
116 ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
117 --go "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
118
119
b74872e @mrflip Correcting readme formatting
mrflip authored
120 h2. How to test your scripts
e81349b @mrflip Documentation for script.rb
mrflip authored
121
122 To run mapper on its own:
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
123
e81349b @mrflip Documentation for script.rb
mrflip authored
124 cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
125
e81349b @mrflip Documentation for script.rb
mrflip authored
126 or if your test data lies on the HDFS,
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
127
e81349b @mrflip Documentation for script.rb
mrflip authored
128 hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
129
130
b74872e @mrflip Correcting readme formatting
mrflip authored
131 h2. What's up with Wukong::AndPig?
e81349b @mrflip Documentation for script.rb
mrflip authored
132
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
133 @Wukong::AndPig@ is a small library to more easily generate code for the
b74872e @mrflip Correcting readme formatting
mrflip authored
134 "Pig":http://hadoop.apache.org/pig data analysis language. See its
135 "README":wukong/and_pig/README.textile for more.
e81349b @mrflip Documentation for script.rb
mrflip authored
136
b74872e @mrflip Correcting readme formatting
mrflip authored
137 h2. Why is it called Wukong?
e81349b @mrflip Documentation for script.rb
mrflip authored
138
139 Hadoop, as you may know, is "named after a stuffed
8525379 @mrflip Adding to README
mrflip authored
140 elephant,"http://en.wikipedia.org/wiki/Hadoop and since Wukong was started by
141 the "infochimps":http://infochimps.org team, we needed a simian analog. Wukong
142 (the Monkey King), known for his power and agility, is hero of a famous Chinese
143 Fairytale in which he journeys to the land of the Elephant:
e81349b @mrflip Documentation for script.rb
mrflip authored
144
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
145 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main
e81349b @mrflip Documentation for script.rb
mrflip authored
146 character in the classical Chinese epic novel Journey to the West. In the novel,
147 he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from
148 India.
149
0f51446 @mrflip Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
150 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
e81349b @mrflip Documentation for script.rb
mrflip authored
151 (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling
152 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations,
153 which allows him to transform into various animals and objects; he is, however,
154 shown with slight problems transforming into other people, since he is unable to
155 complete the transformation of his tail. He is a skilled fighter, capable of
156 holding his own against the best generals of heaven. Each of his hairs possesses
157 magical properties, and is capable of transforming into a clone of the Monkey
158 King himself, or various weapons, animals, and other objects. He also knows
159 various spells in order to command wind, part water, conjure protective circles
160 against demons, freeze humans, demons, and gods alike. (Journey to the West, Wu
8525379 @mrflip Adding to README
mrflip authored
161 Cheng'en (1500-1582), Translated by Foreign Languages Press, Beijing 1993.) --
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
162 /["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]/
e81349b @mrflip Documentation for script.rb
mrflip authored
163
164 p. Sounds about right to us :) The "BBC-produced Jaime Hewlett / Damon Albarn
28ea53b @mrflip CountKeys is clearer. Showed 3 versions in README
mrflip authored
165 short,":http://news.bbc.co.uk/sport1/hi/olympics/monkey made for the 2008
166 Olympics, gives the general idea.
8525379 @mrflip Adding to README
mrflip authored
167
168 h2. What tools does Wukong work with?
169
170 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant,
171 "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your
172 command line. We're looking forward to being friends with
173 "martinis":http://datamapper.org and "express
174 trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
Something went wrong with that request. Please try again.