Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 188 lines (145 sloc) 8.143 kb
8525379 Adding to README
Philip (flip) Kromer authored
1 h1. Wukong
e81349b Documentation for script.rb
Philip (flip) Kromer authored
2
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
3 Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
e81349b Documentation for script.rb
Philip (flip) Kromer authored
4
e18ff85 New gem structure
Philip (flip) Kromer authored
5 Treat your dataset like a
8525379 Adding to README
Philip (flip) Kromer authored
6 * stream of lines when it's efficient to process by lines
7 * stream of field arrays when it's efficient to deal directly with fields
8 * stream of lightweight objects when it's efficient to deal with objects
9
e18ff85 New gem structure
Philip (flip) Kromer authored
10 Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
8525379 Adding to README
Philip (flip) Kromer authored
11
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
12 The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
fbeaeb5 Backporting changes from gh-pages docs
Philip (flip) Kromer authored
13
14 * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
15 * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
16 * "Usage notes":http://mrflip.github.com/wukong/usage.html
17 * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
18 * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
19 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20 * "More info":http://mrflip.github.com/wukong/moreinfo.html
21
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
22 h2. Install
23
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
24 ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
25
e18ff85 New gem structure
Philip (flip) Kromer authored
26 h3. Get the code
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
27
e18ff85 New gem structure
Philip (flip) Kromer authored
28 We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
29
e18ff85 New gem structure
Philip (flip) Kromer authored
30 pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
31
e18ff85 New gem structure
Philip (flip) Kromer authored
32 A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
33
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
34 pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
35
e18ff85 New gem structure
Philip (flip) Kromer authored
36 (don't use the gems.github.com version -- it's way out of date.)
37
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
38 You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
39
e18ff85 New gem structure
Philip (flip) Kromer authored
40 h3. Dependencies and setup
41
42 To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
2b92784 Fleshed out schema export
Philip (flip) Kromer authored
43
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
44 h2. How to write a Wukong script
45
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
46 ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
47
8525379 Adding to README
Philip (flip) Kromer authored
48 Here's a script to count words in a text stream:
5c0ca18 Correcting #emit to handle Structs
Philip (flip) Kromer authored
49
33defd9 Consolidating docs
Philip (flip) Kromer authored
50 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
51 module WordCount
52 class Mapper < Wukong::Streamer::LineStreamer
53 # Emit each word in the line.
54 def process line
55 words = line.strip.split(/\W+/).reject(&:blank?)
56 words.each{|word| yield [word, 1] }
57 end
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
58 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
59
60 class Reducer < Wukong::Streamer::ListReducer
61 def finalize
62 yield [ key, values.map(&:last).map(&:to_i).sum ]
63 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
64 end
65 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
66
67 Wukong::Script.new(
68 WordCount::Mapper,
69 WordCount::Reducer
70 ).run # Execute the script
71 </code></pre>
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
72
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
73 The first class, the Mapper, eats lines and craps @[word, count]@ records: word is the /key/, its count is the /value/.
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
74
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
75 In the reducer, the values for each key are stacked up into a list; then the record(s) yielded by @#finalize@ are emitted. There are many other ways to write the reducer (most of them are better) -- see the ["examples":examples/]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
76
4dc43dd Adding to README
Philip (flip) Kromer authored
77 h3. Structured data stream
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
78
4dc43dd Adding to README
Philip (flip) Kromer authored
79 You can also use structs to treat your dataset as a stream of objects:
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
80
33defd9 Consolidating docs
Philip (flip) Kromer authored
81 <pre><code> require 'wukong'
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
82 require 'my_blog' #defines the blog models
83 # structs for our input objects
84 Tweet = Struct.new( :id, :created_at, :twitter_user_id,
85 :in_reply_to_user_id, :in_reply_to_status_id, :text )
86 TwitterUser = Struct.new( :id, :username, :fullname,
87 :homepage, :location, :description )
88 module TwitBlog
89 class Mapper < Wukong::Streamer::RecordStreamer
90 # Watch for tweets by me
91 MY_USER_ID = 24601
92 #
93 # If this is a tweet is by me, convert it to a Post.
94 #
95 # If it is a tweet not by me, convert it to a Comment that
96 # will be paired with the correct Post.
97 #
98 # If it is a TwitterUser, convert it to a User record and
99 # a user_location record
100 #
101 def process record
102 case record
103 when TwitterUser
104 user = MyBlog::User.new.merge(record) # grab the fields in common
105 user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
106 yield user
107 yield user_loc
108 when Tweet
109 if record.twitter_user_id == MY_USER_ID
110 post = MyBlog::Post.new.merge record
111 post.link = "http://twitter.com/statuses/show/#{record.id}"
112 post.body = record.text
113 post.title = record.text[0..65] + "..."
114 yield post
115 else
116 comment = MyBlog::Comment.new.merge record
117 comment.body = record.text
118 comment.post_id = record.in_reply_to_status_id
119 yield comment
120 end
6ef62b2 Adding to README
Philip (flip) Kromer authored
121 end
122 end
0d83c18 Adding to README
Philip (flip) Kromer authored
123 end
124 end
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
125 Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
126 </code></pre>
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
127
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
128 h3. Advanced Patterns
129
b6d96d0 updating README from gh-pages: formatting of code snippets
Philip (flip) Kromer authored
130 Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
131
60bdf2c README formatting
Philip (flip) Kromer authored
132 <pre><code> #
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
133 # Roll up all values for each key into a single line
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
134 #
135 class GroupByReducer < Wukong::Streamer::AccumulatingReducer
136 attr_accessor :values
137
138 # Start with an empty list
139 def start! *args
140 self.values = []
141 end
142
143 # Aggregate each value in turn
144 def accumulate key, value
145 self.values << value
146 end
147
148 # Emit the key and all values, tab-separated
149 def finalize
150 yield [key, values].flatten
151 end
152 end
60bdf2c README formatting
Philip (flip) Kromer authored
153 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
154
330e388 un-messed-up the list reducer (made it so there's a non-useless final…
Philip (flip) Kromer authored
155 So given adjacency pairs for the following directed friend graph:
156
157 <pre><code>
158 @jerry @elaine
159 @elaine @jerry
160 @jerry @kramer
161 @kramer @jerry
162 @kramer @bobsacamato
163 @kramer @newman
164 @jerry @superman
165 @newman @kramer
166 @newman @elaine
167 @newman @jerry
168 </code></pre>
169
170 You'd end up with
171
172 <pre><code>
173 @elaine @jerry
174 @jerry @elaine @kramer @superman
175 @kramer @bobsacamato @jerry @newman
176 @newman @elaine @jerry @kramer
177 </code></pre>
765e58f Version bump to 0.2.0
Philip (flip) Kromer authored
178
b74872e Correcting readme formatting
Philip (flip) Kromer authored
179 h2. Why is it called Wukong?
e81349b Documentation for script.rb
Philip (flip) Kromer authored
180
f696714 Added links to tutorials and presentations. Reformatted readme to non…
Philip (flip) Kromer authored
181 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
182
183 bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
184
185 bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
186
187 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
Something went wrong with that request. Please try again.