-
Notifications
You must be signed in to change notification settings - Fork 0
petewarden/cc2text
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
cc2text This project converts the web page archives stored in Common Crawl's public data set into text equivalents of those same pages. To test it locally, use this set of commands: ./cc2text_map.rb < example_input.txt | ./cc2text_reduce.rb | gzip -c > example_output.arc.gz To run it on Amazon's Elastic MapReduce service, you can follow very similar steps to these: http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages.html You'll need to add these to the Extra Args box to get gzipped output files: -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec Based on original code by Ben Nagy, this example by Pete Warden, pete@petewarden.com
About
An example job that converts Common Crawl archived web pages into text
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published