GitHub - netconstructor/commoncrawl: CommonCrawl Project Repository

#CommonCrawl Support Library

##Overview

The commoncrawl source code repository is used as a distribution vehicle for our custom Hadoop InputFormat (ARCInputFormat located in org.commoncrawl.hadoop.io). Please refer to the CommonCrawl website at http://www.commoncrawl.org/ for more details on how to access our crawl corpus.

The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) for an example of how to configure the InputFormat. A more detailed example of how to use it in the context of a Hadoop Job will be forthcoming.

##Build Notes:

You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
Set hadoop.path (in build.properties) to point to your Hadoop folder.
Make sure you have the thrift compiler (version 0.7.0) installed on your system.
If you want to use the Google URL Canoncilization library in Hadoop job, copy the shared libraries under lib/native/{Platform} to /usr/local/lib or equivalent.

#Sample Usage:

Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script. The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) demonstrates
how you can go about configuring our InputFormat. To run the BasicArcFileReaderSample against an ARC file in the corpus (2010/01/07/18/1262876244253_18.arc.gz for example), you would run the following command line:

bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample {AWS ACCESS KEY} {AWS SECRET KEY} commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
bin		bin
conf		conf
lib		lib
src		src
README.md		README.md
build.properties		build.properties
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

conf

conf

lib

lib

src

src

README.md

README.md

build.properties

build.properties

build.xml

build.xml

Repository files navigation

About

Releases

Packages

netconstructor/commoncrawl

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks