CommonCrawl Hello World example
Switch branches/tags
Nothing to show
Pull request Compare This branch is 1 commit ahead, 9 commits behind ssalevan:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This is a simple library demonstrating the analysis of the CommonCrawl dataset
through implementing the canonical Hadoop Hello World program, a simple word

To build

You'll need to have Apache Ant (
installed, and once you do, just run a:

# ant dist

This step will compile the libraries and Hadoop code into an Elastic MapReduce-
friendly JAR at dist/lib/HelloWorld.jar, suitable for use as a custom JAR-based
Elastic MapReduce workflow.

To run locally

You'll need to be running Hadoop, and if you don't have it installed, Cloudera
provides a useful set of OS-specific Hadoop packages which will make it easy.
Check out their site:

Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute
the tutorial code.  Here's the pattern:

hadoop jar <checkout location>/dist/lib/HelloWorld.jar org.commoncrawl.tutorial.HelloWorld <Amazon AWS access key ID> <Amazon AWS secret access key> <CommonCrawl crawl files to use as input> <HDFS output location>