Skip to content

Importing Data Into the Archive

John Abrahams edited this page Feb 4, 2019 · 7 revisions

If you need to import ONLY AoR transcriptions, you can skip to the last step, as the rest should be taken care of already.

You can get the tool by building it with Maven from source. Maven and Git are needed to grab the code and build the tool. The external program ImageMagick needs to be installed in order to successfully build and use the tool.

  1. Grab the code from Github using Git:
    git clone https://github.com/jhu-digital-manuscripts/rosa2.git
  2. From within the newly created rosa2 directory, install the code using Maven.
    mvn install
    Once this command completes successfully, the tool's executable JAR will be located at `rosa-archive/rosa-archive-tool/target/tool.jar

Once you have the JAR, you can start using it to aid in the import process. A list of commands can be found at https://github.com/jhu-digital-manuscripts/rosa2/wiki/rosa-archive-tool

  • External images must be copied to the CIS.
    • rosecollection, biblehistoriale - roheryn.mse.jhu.edu:/sam4/
    • aorcollection, pizancollection - roheryn.mse.jhu.edu:/sam2/
  • Image names must be adjusted to fit the archive naming scheme. Create a file map that maps the original file names to the new file names that fit the naming scheme. This can be done if it is assumed that the original images are in order, with the front cover + pastedown appearing first, followed by possible front matter pages, then the main content pages, possible end matter pages, back pastedown + cover, and possible miscellaneous images at the end. Given the presence of the covers and pastedowns, the number of pages in each of these categories, a file map can easily be generated. The files can then be automatically renamed following this map.
    <tool> file-map <collection> <book>
    <tool> rename-images <collection> <book>
  • Generate the image list. Using the rosa tool, <tool> update-image-list <collection> <book> an image list will automatically be generated from the images present in the archive in collection and book specified. Note that this command requires ImageMagick on the system on which the command is being executed.
  • If cropping information is available for the images, the images can now be cropped using the rosa tool. <tool> crop-images <collection> <book> will handle this. Again, imagemagick is required for this command to work.
  • At this point it is a good idea to record the checksum values for all of the images. The rosa tool will handle this and generate a SHA1SUM file in the book directory. <tool> update <collection> <book> will smartly update any existing SHA1SUM file with new data. The '-force' flag can be used to ignore any current SHA1SUM data and generate a completely new one.
  • Now that the images are in place with the correct names, they can be indexed by the image server. Once the process is complete, they will be accessible on the web through the image server.
  • In order to complete the import process and end up with a valid book in the archive, at a minimum, several more files will need to be copied/created. The description/metadata files and permission files must be present in the archive, one for each language supported by the collection. These files must be added to the checksum file.

  • Any AoR transcription files can be copied into the archive directory. The newest data should be grabbed from Github before being copied over to the archive (https://github.com/livesandletters/aor). Once these XML files are in place, they will likely have to be renamed, like the images, to fit the archive naming scheme. Fortunately, with the file map in place, and the images already renamed, just run the <tool> rename-transcriptions <collection> <book> command. Assuming no errors occur, the AoR transcriptions have successfully been imported.
  • NOTE If importing data from the Github repo, make sure to validate the data with this tool before trying to rename the transcription files. First, add the required spreadsheets to the github repo base directory as CSV files. You can then use the check_aor command. Example: java -jar tool.jar check-aor <git_repo_directory> can be used to run the data checks against the raw data from github. If this check dies on any file, the rename command will also die on the same files. They must be fixed before you can do the import. Other errors related to names in the spreadsheets will not effect the file renaming.

With the images indexed and all required files in place, the book is now valid and ready to be used by other tools and viewers. Other files can be added to the book, including transcriptions and annotations. They must follow the naming scheme and must pass the rosa tool validator.