Skip to content

An example job that converts Common Crawl archived web pages into text

Notifications You must be signed in to change notification settings

petewarden/cc2text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cc2text

This project converts the web page archives stored in Common Crawl's public data set
into text equivalents of those same pages.

To test it locally, use this set of commands:
./cc2text_map.rb < example_input.txt | ./cc2text_reduce.rb | gzip -c > example_output.arc.gz

To run it on Amazon's Elastic MapReduce service, you can follow very similar steps to these:
http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages.html
You'll need to add these to the Extra Args box to get gzipped output files:
-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Based on original code by Ben Nagy, this example by Pete Warden, pete@petewarden.com

About

An example job that converts Common Crawl archived web pages into text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published