Let your Preservation Tools Scale
ToMaR supports the use of legacy applications in a MapReduce environment by providing third party tools as User Defined Functions. The application specifically addresses the need for processing large volumes of binary content using existing, content-specific applications. ToMaR provides a generic MapReduce wrapper that can be use with command-line and Java applications. It supports tools that read input based on local file pointers or stdin/stdout streams. ToMaR implements a custom InputFormat to take advantage of data locality and can be used both, as a MapReduce application or as part of an Apache PIG script. More documentation can be found on the SCAPE project web site.
Simply git clone
the repository and run mvn install
. The Hadoop executable jar can be found in target/tomar-*-with-dependencies.jar
.
- A running Hadoop 1.0.x installation (either standalone, pseudo-distributed or cluster)
- SCAPE Toolspecs on HDFS and tools installed on each node
- The control file (see below for details)
hadoop jar {path-to-jar}
-i {control-file}
-o {output-dir-for-job}
-r {toolspec-repo-dir}
-n {lines-per-split}
- path-to-jar leads to the jar file of the SCAPE ToMaR
- control-file located on HDFS
- output-dir-for-job is the directory on HDFS where output files will be written to. Default is
out/{some random number}
- toolspec-repo-dir is a directory on HDFS containing available toolspecs
- lines-per-split configures the number of lines each mapper (worker node) will receive for processing, default is 10
Additionally, you can specify generic options for the Hadoop job, eg. for a custom input format or a reducer class, described here.
ToMaR consumes a plain text control file which describes one tool invocation line per line. A control line consists of a pair of a toolspec name and an action of that toolspec. An action is associated with a specific shell command pattern by the toolspec.
Beneath the toolspec-action pair a control line may contain additional parameters for the action. These are mapped to the placeholders in the definition of the action. Parameters are specified by a list of --{placeholder}="{value}" strings after the toolspec-action pair. For example:
fancy-tool do-fancy-thing --fancy-parameter="foo" --another-fancy-parameter="bar"
For this control line to function there should be a toolspec named fancy-tool
containing the action do-fancy-thing
, which should have fancy-parameter
and another-fancy-parameter
defined in its parameters section. An action's input and output file parameters are specified the same way. For example:
fancy-tool do-fancy-file-stuff --input="hdfs:///fancy-file.foo" --output="hdfs:///fancy-output-file.bar"
Again, an input parameter input
and an output parameter output
needs to be defined in the correspondent sections of do-fancy-file-stuff
.
Don't use the character '_' (underscore) in toolspec, action or key names. It is allowed to use it in values within quotes.
As an action's command may be reading from standard input and/or writing to standard output, a stdin and/or stdout section should be defined for the action. From the control line's perspective these properties are mapped by the >
character. For example:
"hdfs:///input-file.foo" > fancy-tool do-fancy-streaming > "hdfs:///output-file.bar"
Prior to the execution of the action, the wrapper will start reading an input stream of hdfs:///input-file.foo
and feeding its contents to the command of do-fancy-streaming
. Respectively, the output is redirected to an output stream of hdfs:///output-file.bar
.
Instead of streaming the command's output to a file, it could be streamed to another action of another toolspec imitating pipes in the UNIX shell. For example:
"hdfs:///input-file.foo" > fancy-tool do-fancy-streaming | funny-tool do-funny-streaming > "hdfs:///output-file.bar"
This control line results in the output of the command of do-fancy-streaming
being piped to the command of do-funny-streaming
. Then the output of the latter one will be redirected to hdfs:///output-file.bar
.
There can be numerous pipes in one control line but only one input file at the beginning and one output file at the end for file redirection. Independently from this, the piped toolspec-action pairs may contain parameters as explained in the previous section, ie. input and output file parameters too.
If a control line produces standard output and there is not final redirection to an output file, then the output is written to Hadoop default output file part-r-00000
. It contains the Job's output key-value pairs. Key is the hashcode of the control line.
As an example the execution of ToMaR on
- file identification
- streamed file identification
- postscript to PDF migration of an input ps-file to an output pdf-file
- streamed in postscript to PDF migration of an input ps-file to an streamed out pdf-file
- streamed in ps-to-pdf migration with consecutive piped file identification
- streamed in ps-to-pdf migration with two consecutive piped file identifications
is described and demonstrated in this section. The input control file used in this example only contains one control line each. Of course in a productive environment one would have thousands of such control lines.
- Make sure the commands
file
andps2pdf
are in the path of your system. - Copy the toolspecs file.xml and ps2pdf.xml to a directory of your choice on HDFS (eg.
/user/you/toolspecs/
). - Copy ps2pdf-input.ps to a directory of your choice on HDFS (eg.
/user/you/input/
).
Contents of the control file:
file identify --input="hdfs:///user/you/input/ps2pdf-input.ps"
After running the job, contents of part-r-00000
in output directory is:
0 PostScript document text conforming DSC level 3.0, Level 2
Contents of the control file:
"hdfs:///user/you/input/ps2pdf-input.ps" > file identify-stdin
After running the job, contents of part-r-00000
in output directory is:
0 PostScript document text conforming DSC level 3.0, Level 2
Contents of the control file:
ps2pdf convert --input="hdfs:///user/you/input/ps2pdf-input.ps" --output="hdfs:///user/you/output/ps2pdf-output.pdf"
After running the job, specified output file location references the migrated PDF.
Contents of the control file:
"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed > "hdfs:///user/you/output/ps2pdf-output.pdf"
After running the job, specified output file location references the migrated PDF.
Contents of the control file:
"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed | file identify-stdin > "hdfs:///user/you/output/file-identified.txt"
After running the job, contents of file-identified.txt
in output directory is:
PDF document, version 1.4
Contents of the control file:
"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed | file identify-stdin | file identify-stdin
After running the job, contents of part-r-00000
in output directory is:
0 ASCII text