HTRC-Tools-PairtreeToText

Tool that extracts full text from a HathiTrust volume stored in Pairtree by concatenating the pages in the correct order, optionally performing additional post-processing to identify running headers, fix end-of-line hyphenation, and reformat the text.

Build

To generate a package that can be invoked via a shell script, run:
sbt stage
then find the result in target/universal/stage/ folder.
To generate a distributable ZIP package, run:
sbt dist
then find the result in target/universal/ folder.

Run

The following command line arguments are available:

pairtree-to-text
HathiTrust Research Center
 Main Options:
  -p, --pairtree  <DIR>       The path to the pairtree root hierarchy to process
  -o, --output  <DIR>         The folder where the output will be written to
  -b, --body-only             Remove running headers/footers from the pages
                              before concatenation
  -h, --dehyphenate-at-eol    Remove hyphenation for words occurring at the end
                              of a line
  -l, --para-lines            Join lines such that each paragraph is on a single
                              line

  -c, --codec  <CODEC>        The codec to use for reading the volume
      --log-level  <LEVEL>    The application log level; one of INFO, DEBUG, OFF
  -n, --num-partitions  <N>   The number of partitions to split the input set of
                              HT IDs into, for increased parallelism
      --spark-log  <FILE>     Where to write logging output from Spark to
  -w, --write-pages           Writes each page as a separate text file
      --help                  Show help message
  -v, --version               Show version of this program

 trailing arguments:
  htids (not required)   The file containing the list of HT IDs to process (if
                         omitted, will read from stdin)

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

project

project

src/main

src/main

.gitignore

.gitignore

README.md

README.md

build.sbt

build.sbt

Repository files navigation

HTRC-Tools-PairtreeToText

Build

Run

About

Releases 2

Packages

Languages

htrc/HTRC-Tools-PairtreeToText

Folders and files

Latest commit

History

Repository files navigation

HTRC-Tools-PairtreeToText

Build

Run

About

Resources

Stars

Watchers

Forks

Languages