Skip to content

Extracts full text from a HT volume stored in Pairtree by concatenating the pages in the correct order, performing optional post-processing to remove hyphenation, empty lines, headers/footers, etc.

htrc/HTRC-Tools-PairtreeToText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Workflow Status codecov GitHub release (latest SemVer including pre-releases)

HTRC-Tools-PairtreeToText

Tool that extracts full text from a HathiTrust volume stored in Pairtree by concatenating the pages in the correct order, optionally performing additional post-processing to identify running headers, fix end-of-line hyphenation, and reformat the text.

Build

  • To generate a package that can be invoked via a shell script, run:
    sbt stage
    then find the result in target/universal/stage/ folder.
  • To generate a distributable ZIP package, run:
    sbt dist
    then find the result in target/universal/ folder.

Run

The following command line arguments are available:

pairtree-to-text
HathiTrust Research Center
 Main Options:
  -p, --pairtree  <DIR>       The path to the pairtree root hierarchy to process
  -o, --output  <DIR>         The folder where the output will be written to
  -b, --body-only             Remove running headers/footers from the pages
                              before concatenation
  -h, --dehyphenate-at-eol    Remove hyphenation for words occurring at the end
                              of a line
  -l, --para-lines            Join lines such that each paragraph is on a single
                              line

  -c, --codec  <CODEC>        The codec to use for reading the volume
      --log-level  <LEVEL>    The application log level; one of INFO, DEBUG, OFF
  -n, --num-partitions  <N>   The number of partitions to split the input set of
                              HT IDs into, for increased parallelism
      --spark-log  <FILE>     Where to write logging output from Spark to
  -w, --write-pages           Writes each page as a separate text file
      --help                  Show help message
  -v, --version               Show version of this program

 trailing arguments:
  htids (not required)   The file containing the list of HT IDs to process (if
                         omitted, will read from stdin)

About

Extracts full text from a HT volume stored in Pairtree by concatenating the pages in the correct order, performing optional post-processing to remove hyphenation, empty lines, headers/footers, etc.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages