Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 166 lines (117 sloc) 7.518 kb
8525379 Philip (flip) Kromer Adding to README
mrflip authored
1 h1. Wukong
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
2
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
3 Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
4
8525379 Philip (flip) Kromer Adding to README
mrflip authored
5 Treat your dataset like a
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
10 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
8525379 Philip (flip) Kromer Adding to README
mrflip authored
11
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
12 h2. How to write a Wukong script
13
8525379 Philip (flip) Kromer Adding to README
mrflip authored
14 Here's a script to count words in a text stream:
5c0ca18 Philip (flip) Kromer Correcting #emit to handle Structs
mrflip authored
15
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
16 <pre><code>
17 require 'wukong'
18 module WordCount
19 class Mapper < Wukong::Streamer::LineStreamer
20 # Emit each word in the line.
21 def process line
22 words = line.strip.split(/\W+/).reject(&:blank?)
23 words.each{|word| yield [word, 1] }
24 end
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
25 end
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
26
27 class Reducer < Wukong::Streamer::ListReducer
28 def finalize
29 yield [ key, values.map(&:last).map(&:to_i).sum ]
30 end
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
31 end
32 end
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
33
34 Wukong::Script.new(
35 WordCount::Mapper,
36 WordCount::Reducer
37 ).run # Execute the script
38 </code></pre>
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
39
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
40 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
41
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
42 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
43
4dc43dd Philip (flip) Kromer Adding to README
mrflip authored
44 h3. Structured data stream
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
45
4dc43dd Philip (flip) Kromer Adding to README
mrflip authored
46 You can also use structs to treat your dataset as a stream of objects:
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
47
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
48 <pre><code>
49 require 'wukong'
50 require 'my_blog' #defines the blog models
51 # structs for our input objects
52 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
53 :in_reply_to_user_id, :in_reply_to_status_id, :text )
54 TwitterUser = Struct.new( :id, :username, :fullname,
55 :homepage, :location, :description )
56 module TwitBlog
57 class Mapper < Wukong::Streamer::RecordStreamer
58 # Watch for tweets by me
59 MY_USER_ID = 24601
60 #
61 # If this is a tweet is by me, convert it to a Post.
62 #
63 # If it is a tweet not by me, convert it to a Comment that
64 # will be paired with the correct Post.
65 #
66 # If it is a TwitterUser, convert it to a User record and
67 # a user_location record
68 #
69 def process record
70 case record
71 when TwitterUser
72 user = MyBlog::User.new.merge(record) # grab the fields in common
73 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
74 yield user
75 yield user_loc
76 when Tweet
77 if record.twitter_user_id == MY_USER_ID
78 post = MyBlog::Post.new.merge record
79 post.link = "http://twitter.com/statuses/show/#{record.id}"
80 post.body = record.text
81 post.title = record.text[0..65] + "..."
82 yield post
83 else
84 comment = MyBlog::Comment.new.merge record
85 comment.body = record.text
86 comment.post_id = record.in_reply_to_status_id
87 yield comment
88 end
6ef62b2 Philip (flip) Kromer Adding to README
mrflip authored
89 end
90 end
0d83c18 Philip (flip) Kromer Adding to README
mrflip authored
91 end
92 end
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
93 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
94 </code></pre>
28ea53b Philip (flip) Kromer CountKeys is clearer. Showed 3 versions in README
mrflip authored
95
4dc43dd Philip (flip) Kromer Adding to README
mrflip authored
96 h3. More info
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
97
98 There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
99
665f84a Philip (flip) Kromer Added config file for default options and site configuration
mrflip authored
100 h2. Setup
101
102 1. Allow Wukong to discover where his elephant friend lives: either
103
104 * set a $HADOOP_HOME environment variable,
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
105
106 * or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install:
665f84a Philip (flip) Kromer Added config file for default options and site configuration
mrflip authored
107
108 :hadoop_home: /usr/local/share/hadoop
109
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
110 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
665f84a Philip (flip) Kromer Added config file for default options and site configuration
mrflip authored
111
4dc43dd Philip (flip) Kromer Adding to README
mrflip authored
112
b74872e Philip (flip) Kromer Correcting readme formatting
mrflip authored
113 h2. How to run a Wukong script
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
114
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
115 To run your script using local files and no connection to a hadoop cluster,
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
116
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
117 your/script.rb --run=local path/to/input_files path/to/output_dir
118
119 To run the command across a Hadoop cluster,
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
120
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
121 your/script.rb --run=hadoop path/to/input_files path/to/output_dir
122
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
123 You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
124
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
125 If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths. (your/script path, of course, lives on the local filesystem).
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
126
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
127 You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
128
129 ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
665f84a Philip (flip) Kromer Added config file for default options and site configuration
mrflip authored
130 --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
131
2e2ed5e Philip (flip) Kromer Updated readme, pulled some debug strings left in oops
mrflip authored
132 Note that all @--options@ must precede (in any order) all non-options.
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
133
b74872e Philip (flip) Kromer Correcting readme formatting
mrflip authored
134 h2. How to test your scripts
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
135
136 To run mapper on its own:
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
137
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
138 cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
139
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
140 or if your test data lies on the HDFS,
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
141
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
142 hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
143
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
144 Next graduate to running @--run=local@ mode so you can inspect the reducer.
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
145
27f916a Philip (flip) Kromer Details about local mode into README
mrflip authored
146
b74872e Philip (flip) Kromer Correcting readme formatting
mrflip authored
147 h2. What's up with Wukong::AndPig?
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
148
0f51446 Philip (flip) Kromer Now using generator (yield()) semantics rather than crudely puts'ing res...
mrflip authored
149 @Wukong::AndPig@ is a small library to more easily generate code for the
b74872e Philip (flip) Kromer Correcting readme formatting
mrflip authored
150 "Pig":http://hadoop.apache.org/pig data analysis language. See its
151 "README":wukong/and_pig/README.textile for more.
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
152
b74872e Philip (flip) Kromer Correcting readme formatting
mrflip authored
153 h2. Why is it called Wukong?
e81349b Philip (flip) Kromer Documentation for script.rb
mrflip authored
154
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
155 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
156
157 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
158
159 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
160
161 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
8525379 Philip (flip) Kromer Adding to README
mrflip authored
162
163 h2. What tools does Wukong work with?
164
f696714 Philip (flip) Kromer Added links to tutorials and presentations. Reformatted readme to non-li...
mrflip authored
165 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. We're looking forward to being friends with "martinis":http://datamapper.org and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
Something went wrong with that request. Please try again.