Skip to content

Updating AOR data in the archive

jabrah edited this page Aug 25, 2016 · 3 revisions

Note: These are the steps that I follow when updating AOR data. This is certainly not the only way, probably not even the best way of doing this.

This assumes that the archive has already been setup and AOR images have already been imported. Transcription data changes fairly rapidly, as the scholars add transcriptions or correct old transcriptions. This process of updating the data requires the AOR archive command line tool and access to the archive. You will need Git in order to grab the latest version of the transcription data and, of course, Java to use the AOR archive tool.

##Getting the archive tool If you do not already have the AOR archive tool, you can download the source Git. From source, you will have to build the tool using Maven.

  1. Grab source code:
    git clone https://github.com/jhu-digital-manuscripts/rosa2.git

  2. Navigate to rosa-archive directory and build artifacts with Maven. The other modules are not required for the command line archive tool:
    mvn package
    Once the Maven build is complete, the archive tool will be located at rosa-archive-tool/target/tool.jar

##Updating the archive

###Grab and check the data Now that you have the tool ready, you need the latest data to work with. The AOR data is kept in a Github repository, so Git will be needed to get the data. First clone the repository with git clone https://github.com/livesandletters/aor.git, or if you already have the repository, you can just pull the latest changes using git pull command.

Download the three spreadsheets from the AOR Google Drive, Books.xlsx, Locations.xlsx, People.xlsx and put them in the same directory as the AOR Git repository. These spreadsheets must be converted to CSV files, and the names must be all lower case. At this point, you need to run a check of the data, which will look for XML errors and data inconsistencies in the transcriptions. Using the AOR archive tool, run the command:

<tool> check-aor <aor-git-repo-dir>

Example:
java -jar /home/user/workspace/rosa2/rosa-archive/rosa-archive-tool/target/tool.jar check-aor /home/user/workspace/aor

If there are any XML errors due to malformed XML, these must be fixed, as those files cannot be parsed. Inconsistencies between the transcriptions and spreadsheets do not have to be fixed at this stage.

###Setup staging area In a temporary directory (I usually use the name staging), create new directories with the correct archive names of each book.

mkdir /tmp/staging && cd /tmp/staging
mkdir PrincetonRB16th11 PrincetonRB16th12 PrincetonRB16th99a UclCastiglione1541 Newberry27495 FolgersHa2 PrincetonK6233 HoughtonSTC11402 PrincetonPA6452 PrincetonPA8550 PrincetonPE1137 PrincetonU101 PrincetonDL45

Copy over the transcription data from the AOR git repository into the correct temporary book directory, as stated in the corpus.csv spreadsheet AOR Google Drive. You also need to copy each respective filemap file from the archive in order to properly rename the transcription files.

Git name archive name
Buchanan_MariaScotorumRegina PrincetonRB16th11
Bucnahan_AneDetectioun PrincetonRB16th12
Tusser PrincetonRB16th99a
Castiglione UclCastiglione1541
Castiglione_Newberry Newberry27495
Domenichi FolgersHa2
Freigius PrincetonK6233
Frontinus HoughtonSTC11402
Livy PrincetonPA6452
Melancthon PrincetonPA8550
Smith PrincetonPE1137
Machiavelli PrincetonU101
Olaus PrincetonDL45

Now that all of the data is copied into appropriately named directories along with each file map, you can rename the transcriptions.

###Rename transcriptions Using the AOR archive tool, you can rename all of the transcriptions in one go. This is a little bit of a hack to use the archive tool outside of the archive, but you can run the rename command in your staging directory:

<tool> rename-transcriptions staging -D archive.path=<staging-dir>

Example, assume staging directory is '/tmp/staging':
java -jar /home/user/workspace/rosa2/rosa-archive/rosa-archive-tool/target/tool.jar rename-transcriptions staging -D archive.path=/tmp

###Transferring new data to the archive This assumes that previous operations did not throw any errors.

Now the data is prepared for the archive, everything can be transferred over. I usually use rsync -c to do the transfer to the archive, assuming the utility is available. This will let me transfer all the transcriptions in one go, regardless of whether the archive is mounted locally or accessed remotely. The given option ensures that only the new and updated files are copied over, as rsync will compare the checksums of source and target files and do the copy only if the checksums are different.

###Update archive checksums Once all files have been transferred to the archive, the archive checksum records must be updated. This can be done with the AOR archive tool, using the update command.