Code used for running Hadoop jobs on Amazon EC2 to analyze Common Crawl data set. Building this code requires Apache Ant.
To use this code, do the following:
- Clone the repository
- Update the file at
src/com/lavalamp/watprocessing/HadoopRunner.javato reflect the Common Crawl data you want to process and the S3 bucket where you want to store the results (
- Build the JAR file by running
ant compile jarin the root project directory
- Use the JAR file located at
dist/lib/LavaHadoop.jarwith Hadoop to analyze Common Crawl data
You should definitely only run this on Amazon Elastic MapReduce unless you want it to take forever and drain your bank account.
To process the results of Hadoop jobs resulting from this code, check out the lava-hadoop-processing repository. If you're just interested in getting the content discovery hit lists resulting from this research, take a look at the content-discovery-hit-lists repository.
More details will be available via a blog post on lavalamp's personal blog in the near future.