Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100755 76 lines (69 sloc) 2.026 kB
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
1 #!/usr/bin/env ruby
f953cba Ported to configliere for options management. Much nicer.
Philip (flip) Kromer authored
2 require 'rubygems'
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
3 require 'wukong/script'
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
4
5 module WordCount
e88d4e2 updated examples to work with new options structure
Philip (flip) Kromer authored
6 class Mapper < Wukong::Streamer::LineStreamer
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
7 #
8 # Split a string into its constituent words.
9 #
10 # This is pretty simpleminded:
11 # * downcase the word
12 # * Split at any non-alphanumeric boundary, including '_'
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
13 # * However, preserve the special cases of 's, 'd or 't at the end of a
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
14 # word.
15 #
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
16 # tokenize("Ability is a poor man's wealth #johnwoodenquote")
17 # # => ["ability", "is", "a", "poor", "man's", "wealth", "johnwoodenquote"]
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
18 #
19 def tokenize str
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
20 return [] if str.blank?
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
21 str = str.downcase;
cfa3467 Example script works
Philip (flip) Kromer authored
22 # kill off all punctuation except [stuff]'s or [stuff]'t
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
23 # this includes hyphens (words are split)
24 str = str.
25 gsub(/[^a-zA-Z0-9\']+/, ' ').
69cf15c stopwords helpers in wordcount example
Philip (flip) Kromer authored
26 gsub(/(\w)\'([stdm]|re|ve|ll)\b/, '\1!\2').gsub(/\'/, ' ').gsub(/!/, "'")
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
27 # Busticate at whitespace
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
28 words = str.split(/\s+/)
69cf15c stopwords helpers in wordcount example
Philip (flip) Kromer authored
29 words.reject!{|w| w.length < 3 }
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
30 words
31 end
32
33 #
34 # Emit each word in each line.
35 #
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
36 def process line
37 tokenize(line).each{|word| yield [word, 1] }
38 end
39 end
40
41 #
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
42 # You can stack up all the values in a list then sum them at once.
60bc2d2 word count
Philip (flip) Kromer authored
43 #
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
44 # This isn't good style, as it means the whole list is held in memory
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
45 #
46 class Reducer1 < Wukong::Streamer::ListReducer
47 def finalize
69cf15c stopwords helpers in wordcount example
Philip (flip) Kromer authored
48 yield [ values.map(&:last).map(&:to_i).inject(0){|x,tot| x+tot }, key ]
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
49 end
50 end
51
52 #
53 # A bit kinder to your memory manager: accumulate the sum record-by-record:
54 #
55 class Reducer2 < Wukong::Streamer::AccumulatingReducer
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
56 def start!(*args) @key_count = 0 end
57 def accumulate(*args) @key_count += 1 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
58 def finalize
69cf15c stopwords helpers in wordcount example
Philip (flip) Kromer authored
59 yield [ @key_count, key ]
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
60 end
61 end
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
62
63 #
64 # ... easiest of all, though: this is common enough that it's already included
65 #
0c1f7df @Empact One must include the CountKeys streamer to have access to it
Empact authored
66 require 'wukong/streamer/count_keys'
28ea53b CountKeys is clearer. Showed 3 versions in README
Philip (flip) Kromer authored
67 class Reducer3 < Wukong::Streamer::CountKeys
68 end
d738f3f Adding demonstrative examples
Philip (flip) Kromer authored
69 end
70
0f51446 Now using generator (yield()) semantics rather than crudely puts'ing …
Philip (flip) Kromer authored
71 # Execute the script
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
72 Wukong.run(
807d208 Fixing example script (and the various broken bits it's exposed).
Philip (flip) Kromer authored
73 WordCount::Mapper,
69cf15c stopwords helpers in wordcount example
Philip (flip) Kromer authored
74 WordCount::Reducer2
947156b Big cleanup of the examples/ directory
Philip (flip) Kromer authored
75 )
Something went wrong with that request. Please try again.