Skip to content

Latest commit

 

History

History

arc2warc-migration-cli

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

arc2warc-migration-cli

Java-based command line application to convert web archive ARC files to WARC files.

Installation

Install artifact in your local repository:

cd arc2warc-migration-cli
mvn install

Create executable jar with dependencies:

cd arc2warc-migration-cli
mvn assembly:single

Usage

usage: hadoop jar target/arc2warc-migration-1.0-SNAPSHOT-jar-with-dependencies.jar [-h] [-i ] [-m] [-o ] [-r ] [-x] -h,--help print this message [optional]. -i,--input Path to input. [required]. -m,--mimeident Do payload mime type identification. [optional]. -o,--output Path to output. [optional]. -r,--regex Only input paths matching the regular expression will be processed. [optional]. -x,--comprwarc Create compressed WARC file. [optional].

Example:

java -jar hawarp/arc2warc-migration-cli/target/arc2warc-migration-cli-1.0-jar-with-dependencies.jar 
-i /local/input/directory/ -o /local/output/directory/

And to process only files which have a specific extension (e.g. ".arc.gz"), a regular expression can be used as a filter for input files (parameter -r):

java -jar hawarp/arc2warc-migration-cli/target/arc2warc-migration-cli-1.0-jar-with-dependencies.jar 
-i /local/input/directory/ -o /local/output/directory/ -r ".*\.arc\.gz"

Read more about how this tool can be used in a scalable fashion here

This tool will do deduplication of arc records by reading the metadata arc files (actually, the crawler logs in the metadata arc files). Read more about this here