Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 232 lines (171 sloc) 10.111 kB
8525379 Adding to README
Philip (flip) Kromer authored
1 h1. Wukong
e81349b Documentation for script.rb
Philip (flip) Kromer authored
2
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
3 Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
4
e18ff85 New gem structure
Philip (flip) Kromer authored
5 Treat your dataset like a
8525379 Adding to README
Philip (flip) Kromer authored
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
e18ff85 New gem structure
Philip (flip) Kromer authored
10 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
8525379 Adding to README
Philip (flip) Kromer authored
11
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
12 The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
fbeaeb5 Backporting changes from gh-pages docs
Philip (flip) Kromer authored
13
14 * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
15 * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
16 * "Usage notes":http://mrflip.github.com/wukong/usage.html
17 * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
18 * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
19 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20 * "More info":http://mrflip.github.com/wukong/moreinfo.html
21
e856c4e coming changes
Philip (flip) Kromer authored
22 h2. Imminent Changes
23
24 I'm pushing to release "Wukong 3.0 the actual 1.0 release".
25
26 * For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
27 * Methods on TypedStruct to
28
29 * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
30 * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
31 * May make some things that are derived classes into mixin'ed modules
32 * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
33
34
35 *
36
73107b1 Readying gem for release
Philip (flip) Kromer authored
37 h2. Help!
38
39 Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
40
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
41 h2. Install
42
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
43 ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
44
e18ff85 New gem structure
Philip (flip) Kromer authored
45 h3. Get the code
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
46
b1dd801 Readying gem for release
Philip (flip) Kromer authored
47 We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
48
b1dd801 Readying gem for release
Philip (flip) Kromer authored
49 pre. $ git clone git://github.com/mrflip/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
50
b1dd801 Readying gem for release
Philip (flip) Kromer authored
51 A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
52
b1dd801 Readying gem for release
Philip (flip) Kromer authored
53 pre. $ sudo gem install wukong --source=http://gemcutter.org
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
54
e18ff85 New gem structure
Philip (flip) Kromer authored
55 (don't use the gems.github.com version -- it's way out of date.)
56
b1dd801 Readying gem for release
Philip (flip) Kromer authored
57 You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
58
e18ff85 New gem structure
Philip (flip) Kromer authored
59 h3. Dependencies and setup
60
61 To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
62
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
63 h2. How to write a Wukong script
64
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
65 ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
66
8525379 Adding to README
Philip (flip) Kromer authored
67 Here's a script to count words in a text stream:
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
68
33defd9 Consolidating docs
Philip (flip) Kromer authored
69 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
70 module WordCount
71 class Mapper < Wukong::Streamer::LineStreamer
72 # Emit each word in the line.
73 def process line
74 words = line.strip.split(/\W+/).reject(&:blank?)
75 words.each{|word| yield [word, 1] }
76 end
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
77 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
78
79 class Reducer < Wukong::Streamer::ListReducer
80 def finalize
81 yield [ key, values.map(&:last).map(&:to_i).sum ]
82 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
83 end
84 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
85
86 Wukong::Script.new(
87 WordCount::Mapper,
88 WordCount::Reducer
89 ).run # Execute the script
90 </code></pre>
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
91
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
92 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
93
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
94 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
95
4dc43dd Adding to README
Philip (flip) Kromer authored
96 h3. Structured data stream
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
97
4dc43dd Adding to README
Philip (flip) Kromer authored
98 You can also use structs to treat your dataset as a stream of objects:
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
99
33defd9 Consolidating docs
Philip (flip) Kromer authored
100 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
101 require 'my_blog' #defines the blog models
102 # structs for our input objects
103 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
104 :in_reply_to_user_id, :in_reply_to_status_id, :text )
105 TwitterUser = Struct.new( :id, :username, :fullname,
106 :homepage, :location, :description )
107 module TwitBlog
108 class Mapper < Wukong::Streamer::RecordStreamer
109 # Watch for tweets by me
110 MY_USER_ID = 24601
111 #
112 # If this is a tweet is by me, convert it to a Post.
113 #
114 # If it is a tweet not by me, convert it to a Comment that
115 # will be paired with the correct Post.
116 #
117 # If it is a TwitterUser, convert it to a User record and
118 # a user_location record
119 #
120 def process record
121 case record
122 when TwitterUser
123 user = MyBlog::User.new.merge(record) # grab the fields in common
124 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
125 yield user
126 yield user_loc
127 when Tweet
128 if record.twitter_user_id == MY_USER_ID
129 post = MyBlog::Post.new.merge record
130 post.link = "http://twitter.com/statuses/show/#{record.id}"
131 post.body = record.text
132 post.title = record.text[0..65] + "..."
133 yield post
134 else
135 comment = MyBlog::Comment.new.merge record
136 comment.body = record.text
137 comment.post_id = record.in_reply_to_status_id
138 yield comment
139 end
6ef62b2 Adding to README
Philip (flip) Kromer authored
140 end
141 end
0d83c18 Adding to README
Philip (flip) Kromer authored
142 end
143 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
144 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
145 </code></pre>
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
146
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
147 h3. Advanced Patterns
148
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
149 Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
150
60bdf2c README formatting
Philip (flip) Kromer authored
151 <pre><code> #
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
152 # Roll up all values for each key into a single line
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
153 #
154 class GroupByReducer < Wukong::Streamer::AccumulatingReducer
155 attr_accessor :values
156
157 # Start with an empty list
158 def start! *args
159 self.values = []
160 end
161
162 # Aggregate each value in turn
163 def accumulate key, value
164 self.values << value
165 end
166
167 # Emit the key and all values, tab-separated
168 def finalize
169 yield [key, values].flatten
170 end
171 end
60bdf2c README formatting
Philip (flip) Kromer authored
172 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
173
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
174 So given adjacency pairs for the following directed friend graph:
175
176 <pre><code>
177 @jerry @elaine
178 @elaine @jerry
179 @jerry @kramer
180 @kramer @jerry
181 @kramer @bobsacamato
182 @kramer @newman
183 @jerry @superman
184 @newman @kramer
185 @newman @elaine
186 @newman @jerry
187 </code></pre>
188
189 You'd end up with
190
191 <pre><code>
192 @elaine @jerry
193 @jerry @elaine @kramer @superman
194 @kramer @bobsacamato @jerry @newman
195 @newman @elaine @jerry @kramer
196 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
197
b74872e Correcting readme formatting
Philip (flip) Kromer authored
198 h2. Why is it called Wukong?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
199
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
200 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
201
202 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
203
204 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
205
206 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
73107b1 Readying gem for release
Philip (flip) Kromer authored
207
b1dd801 Readying gem for release
Philip (flip) Kromer authored
208 <notextile><div class="toggle"></notextile>
73107b1 Readying gem for release
Philip (flip) Kromer authored
209
b1dd801 Readying gem for release
Philip (flip) Kromer authored
210 h2. More info
211
212 There are many useful examples in the examples/ directory.
213
214 h3. Credits
215
216 Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
73107b1 Readying gem for release
Philip (flip) Kromer authored
217
218 Patches submitted by:
219 * gemified by Ben Woosley (ben.woosley with the gmails)
220 * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
221
222 Thanks to:
b087c05 making gem version
Philip (flip) Kromer authored
223 * "Fredrik Möllerstrand (@lenbust)":http://twitter.com/lenbust for the examples/contrib/jeans working example
73107b1 Readying gem for release
Philip (flip) Kromer authored
224 * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
225 * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
b1dd801 Readying gem for release
Philip (flip) Kromer authored
226
227 h3. Help!
228
229 Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
230
231 <notextile></div></notextile>
Something went wrong with that request. Please try again.