Skip to content
John Abrahams edited this page Feb 19, 2018 · 28 revisions

AoR Statistics tool

Required:

  • Java 7 or newer

Usage:

Using this Java utility should be basically the same on both Windows, Mac, and Linux. Being a command line tool, it must be run from the command line.

  1. Download the JAR file, if not already done. On newer
  2. Open up the Command Prompt if using Windows, or a terminal for unix based systems.
  3. Navigate to the location you want the output files. In both Windows and Unix systems, you can use the 'cd' command to change directories.
  4. Run the executable Java using the java -jar command.
    java -jar <path_to_tool> <command> [options] <args>

Example:
    java -jar ~/Downloads/aor-stats-tool.jar git-stats https://github.com/livesandletters/aor.git

Valid commands:

NOTE: If at any time, the command line tool seems to be stuck, you can can quit early with the keyboard command CTRL + c. This will force the process to stop early. When this happens, the tool is unable to clean any temporary files and folders that it may be using. Any subsequent runs of the tool will ignore these temp files and folders, so they must be cleaned manually if space is an issue. The temporary folders used by this command line tool will all start with git_tmp followed by some random numbers.

In Windows, these temporary files and folders should be in C:\Users\<user_name>\AppData\Local\Temp\. Example temp folder: C:\Users\username\AppData\Local\Temp\git_tmp787764397816831906.

In Unix based systems, Mac and Linux, these temporary files and folders will be in /tmp. Example temp folder: /tmp/git_tmp787764397816831906.

As long as the tool is not running, all of these folders can be safely deleted by hand.


Collect Newest Stats

java -jar <path_to_tool.jar> stats <path_to_book1> <path_to_book2> ...

Example:
java -jar ~/Downloads/aor-stats-tool.jar stats ~/Documents/aor/book1 ~/Documents/aor/book2 ~/Documents/aor/book3

Given a set of directories containing AoR XML transcription files, the tool reports statistics about the transcriptions. Each directory is assumed to correspond to a book, each xml file in a book directory is assumed to correspond to a page, and the ordering of the xml files sorted alphabetically by name is assumed to match the page order of the book.

Generated Spreadsheets

The spreadsheets are just tables written in CSV (comma separated value) format which can be read by any spreadsheet program. They are also simple enough to just look at in a text editor. Each row has values separated by commas and a row ends in a newline.

book_totals.csv

book marginalia marginalia_words underlines underline_words marks mark_words symbols symbol_words drawings numerals books people locations TOTAL TOTAL_WORDS words/anno words/marg words/underline

The books table has columns with the specified headers in the following order:

  • book - Identifier for the book (directory name)
  • marginalia - Number of marginalia
  • marginalia_words - Number of words written in marginalia
  • underlines - Number of underlines. Includes underlines in printed text and marginalia.
  • underline_words - Number of words underlined. Includes both underlines in printed text and underlined words in marginalia.
  • marks - Number of marks
  • mark_words - Number of words marked. Not all marks can be associated with words.
  • symbols - Number of symbols
  • symbol_words - Number of symbol words. Not all symbols can be associated with words.
  • drawings - Number of drawings
  • numerals - Number of numerals
  • books - Number of books mentioned in marginalia
  • people - Number of people mentioned in marginalia
  • locations - Number of locations mentioned in marginalia
  • TOTAL - total number of annotations.
  • TOTAL_WORDS - total number of words associated with annotations. Simple sum of word counts for each annotation type.
  • words/anno - mean words per annotation across all annotation types
  • words/marg - mean words per marginalia
  • words/underline - mean works per underline

When counting words, each sequence of characters separated by white space is considered a word.

<book_title>.csv

The name of the file is the directory name holding the book. The following spreadsheets are included:

* Buchanan_AneDetectioun.csv
* Buchanan_MariaScotorumRegina.csv
* Castiglione.csv
* Castiglione_Newberry.csv
* Domenichi.csv
* Freigius.csv
* Frontinus.csv
* Livy.csv
* Machiavelli.csv
* Melancthon.csv
* Olaus.csv
* Smith.csv
page id marginalia marginalia_words underlines underline_words marks mark_words symbols symbol_words drawings numerals books people locations

For descriptions of most of the headers, see book_totals.csv.

  • page - non-standard page name. Follows file names of page image/transcription.
  • id - page number in the book. This can sometimes be the same as page.

Vocabulary spreadsheets

This mode of the tool will also generate several vocabulary spreadsheets: one CSV file for each language in each annotation type. All files follow the same naming convention

    vocab_<annotation_type>_<language_code>.csv

Files:
 * vocab_marginalia_EL.csv
 * vocab_marginalia_EN.csv
 * vocab_marginalia_ES.csv
 * vocab_marginalia_FR.csv
 * vocab_marginalia_IT.csv
 * vocab_marginalia_LA.csv
 * vocab_marks_EL.csv
 * vocab_marks_EN.csv
 * vocab_marks_IT.csv
 * vocab_marks_LA.csv
 * vocab_symbols_EL.csv
 * vocab_symbols_EN.csv
 * vocab_symbols_IT.csv
 * vocab_symbols_LA.csv
 * vocab_underlines_EL.csv
 * vocab_underlines_EN.csv
 * vocab_underlines_IT.csv
 * vocab_underlines_LA.csv

Annotation type can be: marginalia, underline, mark, or symbol. Language code is a two letter designation representing languages. See ISO-639-1 for a table containing these language codes.

All spreadsheets contain a list of words, each associated with the number of times that word was used in the AoR corpus.

word count

Annotation Stats

java -jar <path_to_tool.jar> annotation-stats <book1_dir> <book2_dir> ...
type name languages locations method signature image_id image_index book_id marginalia_books marginalia_people marginalia_places text translation

Annotation spreadsheet

This spreadsheet contains all annotations that have been transcribed across all books in the AOR corpus.

  • type - type of annotation, can be: marginalia, mark, underline, symbol, drawing, errata, numeral.
  • name - some annotation types can take a value only from a known list of values. Relevant annotations will have only ONE name.
  • languages - languages that appear in this annotation, whether from text in the annotation or text that has been reference in the printed book. An annotation can contain more than one language.
  • locations - where on the page does this annotation appear? Currently this will be only a general region of a page and not exact coordinates. An annotation can have multiple locations, for example, a marginalia can wrap around a page, starting in the page head and spilling over into the page right margin.
  • method - method used to mark the page: chalk, pen, etc
  • signature - a standard bibliographic identifier for the page. For more information, refer to the Transcribers Manual, Section 1.1
  • image_id - exact filename of the image of the page on which this annotation exists. This follows the naming scheme of images in the AOR archive (Archive File Naming).
  • image_index - index of the page in the book.
  • book_id - name of the book in the archive
  • marginalia_books - the standardized name(s) of any books referrenced in a marginalia annotation. Zero or more books can be listed here.*
  • marginalia_people - the standardized name(s) of any people referrenced in a marginalia annotation. Zero or more people can be listed here.*
  • marginalia_places - the standardized name(s) of any locations referrenced in a marginalia annotation. Zero or more locations can be listed.*
  • text - any text in the printed book that is referrenced by an annotation. For marginalia, the text of the marginal note is also included.
  • translation - for marginalia, the translation of the text of the marginal note.

*Standardized names may not appear in an annotation exactly as listed, instead the annotation might have a variation of the standardized name.


Validation

java -jar <path_to_tool.jar> validate -r <aor_directory_github_clone>

The tool can also validate all XML files in a given directory, and all subdirectories. It will list all of the directories check, and output all errors found.

The validate operation will check all XML files in the given directory. The -r flag can be used to tell the tool to check XML files in subdirectories.


Git Statistics

java -jar <path_to_tool.jar> git-stats <Github Repo URL>

This command will temporarily clone the Github repository and gather statistics for every commit. Output is in the form of two CSV files, books.csv and commits.csv. The output CSV files are placed in whatever directory you run the command.

The time for this command to finish will vary depending on the computer it runs on and when the command is run. Since it iterates through all commits in the repository, the time it takes to complete will increase slowly over time. Realistically, it will likely grow by no more than a second per commit added.

Note: this command will take a while to complete, possibly 15-20 minutes. As long as you are seeing new output on the command line, it is still working.

commits.csv

This spreadsheet gives the list of all commits made to the AoR Github repository. A brief overview of changes made to the data is given here including commit IDs to link this metadata with more detailed recordings of changes, authorship of changes, numbers representing the total number of changes made. Parent IDs are given to identify branching behavior in the repository.

commit id parent id(s) date author name author email commit message added modified deleted renamed copied unreadable
  • commit id - unique ID of the commit
  • parent id(s) - a comma separated list of all parent commits. In the case of branch merges, there are more than one parent.
  • date - date the commit was made
  • author name
  • author email
  • commit message - message left by the author describing the commit
  • added - number of files added in this commit across all books
  • modified - number of files modified (those files that had existed previously that have been changed)
  • deleted - number of files deleted in this commit
  • renamed - number of files renamed in this commit. Note, moving a file from one directory to another will count as a rename. Ex: /base/file1 -> /base/dir/file1
  • copied - number of files copied in this commit
  • unreadable - number of unreadable files in this commit across all books. These unreadable files necessarily do not contribute to the annotation or word counts because they could not be read or parsed by this tool. This usually occurs because of invalid XML.

books.csv

This large spreadsheet contains more detailed information regarding the history of the data. For each point in time found in commits.csv, this spreadsheet contains information about numbers of annotations and annotated/transcribed words for each book. For example, the first few entries will be for the most recent commit found in the Github repository, so they share the same commit ID. These first lines will show the annotation counts and modification information for each book and can be linked to a particular time stamp by referencing commits.csv

commit id book total annotations total annotated words marginalia marginalia words underlines underlined words marks marked words symbols symbol words drawings numerals books people locations added modified deleted renamed copied unreadable
  • commit id - links to a single commit in commits.csv
  • book - book id
  • total annotations - total number of annotations in a book, simple sum of marginalia + underlines + marks + symbols + drawings + numerals
  • total annotated words - total number of words associated with annotations. Simple sum of marginalia words + underlined words + marked words + symbol words
  • marginalia, underlines, marks, symbols, drawings, numerals - simple counts of the number of each type of annotation within a book
  • books - number of books referenced in the marginalia
  • people - number of people referenced in the marginalia
  • locations - number of locations referenced in the marginalia
  • added - number of files added to this book in this commit
  • modified - number of files modified in this book in this commit
  • deleted - number of files deleted in this book in this commit
  • renamed - number of files renamed in this book in this commit
  • copied - number of files copied to this book in this commit
  • unreadable - number of files in this book in this commit that were unreadable. These files are invalid XML, so cannot be read/parsed by this tool. These files are not counted and do not contribute to the annotation or word counts.