-
Notifications
You must be signed in to change notification settings - Fork 2
Cutting an AOR data release
John Abrahams edited this page Jan 29, 2019
·
1 revision
There are several steps to cut a data release.
- Make sure you have the most up-to-date
tool.jar
. You can build from source if you want-
git clone https://github.com/jhu-digital-manuscripts/rosa2.git
- grab the most up-to-date code -
cd rosa2/rosa-archive && mvn package
- build only the latest tool, skipping all the IIIF stuffs - The tool should now be at
rosa2/rosa-archive/rosa-archive-tool/target/tool.jar
and the stats tool should be atrosa2/rosa-archive/rosa-archive-aor-tool/target/aor-stats-tool.jar
-
- Make sure you have the latest transcriptions from Githut:
git clone https://github.com/livesandletters/aor.git
- Make sure you have updated
filemap.csv
for each known book. These should already be in the archive - Make a
data-release
directory cp -r aor/XMLschema data-release/
- For each book:
mkdir data-release/books/${BOOK_NAME}
- Get its
filemap.csv
,metadata.xml
, andimages.csv
, and put it in the new directory - Copy transcription files from the git repo into the new directory
- Use the archive tool to rename the transcriptions:
java -jar ${TOOL} rename-transcriptions books ${BOOK_NAME} -D archive.path=${DATE_RELEASE_DIR}
- Filemaps can be removed now
- Generate relevant stats
mkdir data-release/stats
-
java -jar ${STATS_TOOL} stats ...
where the...
is a space delimited list of the renamed books
- If you want stats over time, you can use the stats tool to generate stats for each commit to the AOR Github repository
java -jar ${STATS_TOOL} git-stats ${AOR_GITHUB_URL}
- A few manual steps must be taken to finsh the data release
-
corpus.csv
,books.csv
,locations.csv
,people.csv
must all be copied intodata-release
. These CSV files are simply converted from the Excel spreadsheets in the AOR Google Drive. - The Bagit utility is now run on the
data-release
directory:bagit create aor-drxx data-release/*
- Zip up the new bag: `zip -r aor-drxx.zip aor-drxx/
-
- Done! You now have an AOR data release ZIP file that can be hosted somewhere.