Skip to content

Packages the ARCInputFormat used in Common Crawl in a small jar file that can be used in MapReduce jobs. Implements HdfsARCSource. See README for details

License

Notifications You must be signed in to change notification settings

noiano/ARCInputFormat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project extracts from the original commoncrawl project only the ARCInputFormat class and its dependencies. It also implement a new ARCSource, HDFSSource, which allows ARC files to be read from HDFS.

Differences from the original project:

How to compile

In order to ensure a successful compilation of the library please modify the build.proprieties file and set the hadoop.path variable correctly. Then simply invoke:

ant

You'll find ARCInputFormat.jar ready for use.

About

Packages the ARCInputFormat used in Common Crawl in a small jar file that can be used in MapReduce jobs. Implements HdfsARCSource. See README for details

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages