Skip to content

Cutting an AOR data release

John Abrahams edited this page Jan 29, 2019 · 1 revision

There are several steps to cut a data release.

  • Make sure you have the most up-to-date tool.jar. You can build from source if you want
    • git clone https://github.com/jhu-digital-manuscripts/rosa2.git - grab the most up-to-date code
    • cd rosa2/rosa-archive && mvn package - build only the latest tool, skipping all the IIIF stuffs
    • The tool should now be at rosa2/rosa-archive/rosa-archive-tool/target/tool.jar and the stats tool should be at rosa2/rosa-archive/rosa-archive-aor-tool/target/aor-stats-tool.jar
  • Make sure you have the latest transcriptions from Githut: git clone https://github.com/livesandletters/aor.git
  • Make sure you have updated filemap.csv for each known book. These should already be in the archive
  • Make a data-release directory
  • cp -r aor/XMLschema data-release/
  • For each book:
    • mkdir data-release/books/${BOOK_NAME}
    • Get its filemap.csv, metadata.xml, and images.csv, and put it in the new directory
    • Copy transcription files from the git repo into the new directory
    • Use the archive tool to rename the transcriptions: java -jar ${TOOL} rename-transcriptions books ${BOOK_NAME} -D archive.path=${DATE_RELEASE_DIR}
  • Filemaps can be removed now
  • Generate relevant stats
    • mkdir data-release/stats
    • java -jar ${STATS_TOOL} stats ... where the ... is a space delimited list of the renamed books
  • If you want stats over time, you can use the stats tool to generate stats for each commit to the AOR Github repository
    • java -jar ${STATS_TOOL} git-stats ${AOR_GITHUB_URL}
  • A few manual steps must be taken to finsh the data release
    • corpus.csv, books.csv, locations.csv, people.csv must all be copied into data-release. These CSV files are simply converted from the Excel spreadsheets in the AOR Google Drive.
    • The Bagit utility is now run on the data-release directory: bagit create aor-drxx data-release/*
    • Zip up the new bag: `zip -r aor-drxx.zip aor-drxx/
  • Done! You now have an AOR data release ZIP file that can be hosted somewhere.