GitHub - petewarden/cc2text: An example job that converts Common Crawl archived web pages into text

petewarden / cc2text Public

Notifications You must be signed in to change notification settings
Fork 0
Star 7

An example job that converts Common Crawl archived web pages into text

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cc2text.xcodeproj		cc2text.xcodeproj
README		README
cc2text_map.rb		cc2text_map.rb
cc2text_reduce.rb		cc2text_reduce.rb
example_input.txt		example_input.txt
setup.sh		setup.sh

Repository files navigation

cc2text

This project converts the web page archives stored in Common Crawl's public data set
into text equivalents of those same pages.

To test it locally, use this set of commands:
./cc2text_map.rb < example_input.txt | ./cc2text_reduce.rb | gzip -c > example_output.arc.gz

To run it on Amazon's Elastic MapReduce service, you can follow very similar steps to these:
http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages.html
You'll need to add these to the Extra Args box to get gzipped output files:
-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Based on original code by Ben Nagy, this example by Pete Warden, pete@petewarden.com