This repository has been archived by the owner. It is now read-only.
hadoop ruby/streaming statistically improbable phrases
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
input.eg
input.simple
util
README
Rakefile
calc_sips_simple.rb
count_total_num_terms.rb
double_value_sum.rb
emit_first_component_as_key.rb
emit_ngrams.rb
emit_terms.rb
emit_unique_ngrams.rb
explode_ngrams.rb
explode_trigrams.rb
explode_trigrams_as_bigrams.rb
first_component_freq.rb
insert_filename_at_start_and_remove_blanks.rb
join_markov_chain.rb
join_trigram_frequency.rb
join_trigram_markov_frequency.rb
least_frequent_trigrams_map.rb
least_frequent_trigrams_reduce.rb
long_value_sum.rb
top_n.rb

README

on the train project to attempt to copy amazons statistically improbable phrase calculations
data from project gutenberg, runs using hadoop streaming with ruby map/reduce functions

see project page at http://matpalm.com/sip

bash> <start-hadoop-here/>
bash> rake prepare_files input=input.eg # upload 8 file example
bash> rake upload_input
bash> rake calculate_sips 
bash> rake cat dir=least_freq_trigrams
bash> # bask in glow of diy-sips
bash> zcat hadoop-input/*gz | ./calc_sips_simple.rb
bash> # be amazed by how much faster it was to NOT use hadoop

coming soon: running in the cloud, when does hadoop become worth it...