AoR Statistics

AoR Statistics tool

Required:

Java 7 or newer

Usage:

Using this Java utility should be basically the same on both Windows, Mac, and Linux. Being a command line tool, it must be run from the command line.

Download the JAR file, if not already done. On newer
Open up the Command Prompt if using Windows, or a terminal for unix based systems.
Navigate to the location you want the output files. In both Windows and Unix systems, you can use the 'cd' command to change directories.
Run the executable Java using the java -jar command.

    java -jar <path_to_tool> <command> [options] <args>

Example:
    java -jar ~/Downloads/aor-stats-tool.jar git-stats https://github.com/livesandletters/aor.git

Valid commands:

NOTE: If at any time, the command line tool seems to be stuck, you can can quit early with the keyboard command CTRL + c. This will force the process to stop early. When this happens, the tool is unable to clean any temporary files and folders that it may be using. Any subsequent runs of the tool will ignore these temp files and folders, so they must be cleaned manually if space is an issue. The temporary folders used by this command line tool will all start with git_tmp followed by some random numbers.

In Windows, these temporary files and folders should be in C:\Users\<user_name>\AppData\Local\Temp\. Example temp folder: C:\Users\username\AppData\Local\Temp\git_tmp787764397816831906.

In Unix based systems, Mac and Linux, these temporary files and folders will be in /tmp. Example temp folder: /tmp/git_tmp787764397816831906.

As long as the tool is not running, all of these folders can be safely deleted by hand.

Collect Newest Stats

java -jar <path_to_tool.jar> stats <path_to_book1> <path_to_book2> ...

Example:
java -jar ~/Downloads/aor-stats-tool.jar stats ~/Documents/aor/book1 ~/Documents/aor/book2 ~/Documents/aor/book3

Given a set of directories containing AoR XML transcription files, the tool reports statistics about the transcriptions. Each directory is assumed to correspond to a book, each xml file in a book directory is assumed to correspond to a page, and the ordering of the xml files sorted alphabetically by name is assumed to match the page order of the book.

Generated Spreadsheets

The spreadsheets are just tables written in CSV (comma separated value) format which can be read by any spreadsheet program. They are also simple enough to just look at in a text editor. Each row has values separated by commas and a row ends in a newline.

book_totals.csv

book	marginalia	marginalia_words	underlines	underline_words	marks	mark_words	symbols	symbol_words	drawings	numerals	books	people	locations		TOTAL	TOTAL_WORDS	words/anno	words/marg	words/underline

The books table has columns with the specified headers in the following order:

book - Identifier for the book (directory name)
marginalia - Number of marginalia
marginalia_words - Number of words written in marginalia
underlines - Number of underlines. Includes underlines in printed text and marginalia.
underline_words - Number of words underlined. Includes both underlines in printed text and underlined words in marginalia.
marks - Number of marks
mark_words - Number of words marked. Not all marks can be associated with words.
symbols - Number of symbols
symbol_words - Number of symbol words. Not all symbols can be associated with words.
drawings - Number of drawings
numerals - Number of numerals
books - Number of books mentioned in marginalia
people - Number of people mentioned in marginalia
locations - Number of locations mentioned in marginalia
TOTAL - total number of annotations.
TOTAL_WORDS - total number of words associated with annotations. Simple sum of word counts for each annotation type.
words/anno - mean words per annotation across all annotation types
words/marg - mean words per marginalia
words/underline - mean works per underline

When counting words, each sequence of characters separated by white space is considered a word.

<book_title>.csv

The name of the file is the directory name holding the book. The following spreadsheets are included:

* Buchanan_AneDetectioun.csv
* Buchanan_MariaScotorumRegina.csv
* Castiglione.csv
* Castiglione_Newberry.csv
* Domenichi.csv
* Freigius.csv
* Frontinus.csv
* Livy.csv
* Machiavelli.csv
* Melancthon.csv
* Olaus.csv
* Smith.csv

page	id	marginalia	marginalia_words	underlines	underline_words	marks	mark_words	symbols	symbol_words	drawings	numerals	books	people	locations

For descriptions of most of the headers, see book_totals.csv.

page - non-standard page name. Follows file names of page image/transcription.
id - page number in the book. This can sometimes be the same as page.

Vocabulary spreadsheets

This mode of the tool will also generate several vocabulary spreadsheets: one CSV file for each language in each annotation type. All files follow the same naming convention

    vocab_<annotation_type>_<language_code>.csv

Files:
 * vocab_marginalia_EL.csv
 * vocab_marginalia_EN.csv
 * vocab_marginalia_ES.csv
 * vocab_marginalia_FR.csv
 * vocab_marginalia_IT.csv
 * vocab_marginalia_LA.csv
 * vocab_marks_EL.csv
 * vocab_marks_EN.csv
 * vocab_marks_IT.csv
 * vocab_marks_LA.csv
 * vocab_symbols_EL.csv
 * vocab_symbols_EN.csv
 * vocab_symbols_IT.csv
 * vocab_symbols_LA.csv
 * vocab_underlines_EL.csv
 * vocab_underlines_EN.csv
 * vocab_underlines_IT.csv
 * vocab_underlines_LA.csv

Annotation type can be: marginalia, underline, mark, or symbol. Language code is a two letter designation representing languages. See ISO-639-1 for a table containing these language codes.

All spreadsheets contain a list of words, each associated with the number of times that word was used in the AoR corpus.

word	count

Annotation Stats

java -jar <path_to_tool.jar> annotation-stats <book1_dir> <book2_dir> ...

type	name	languages	locations	method	signature	image_id	image_index	book_id	marginalia_books	marginalia_people	marginalia_places	text	translation

Annotation spreadsheet

This spreadsheet contains all annotations that have been transcribed across all books in the AOR corpus.

type - type of annotation, can be: marginalia, mark, underline, symbol, drawing, errata, numeral.
name - some annotation types can take a value only from a known list of values. Relevant annotations will have only ONE name.
languages - languages that appear in this annotation, whether from text in the annotation or text that has been reference in the printed book. An annotation can contain more than one language.
locations - where on the page does this annotation appear? Currently this will be only a general region of a page and not exact coordinates. An annotation can have multiple locations, for example, a marginalia can wrap around a page, starting in the page head and spilling over into the page right margin.
method - method used to mark the page: chalk, pen, etc
signature - a standard bibliographic identifier for the page. For more information, refer to the Transcribers Manual, Section 1.1
image_id - exact filename of the image of the page on which this annotation exists. This follows the naming scheme of images in the AOR archive (Archive File Naming).
image_index - index of the page in the book.
book_id - name of the book in the archive
marginalia_books - the standardized name(s) of any books referrenced in a marginalia annotation. Zero or more books can be listed here.*
marginalia_people - the standardized name(s) of any people referrenced in a marginalia annotation. Zero or more people can be listed here.*
marginalia_places - the standardized name(s) of any locations referrenced in a marginalia annotation. Zero or more locations can be listed.*
text - any text in the printed book that is referrenced by an annotation. For marginalia, the text of the marginal note is also included.
translation - for marginalia, the translation of the text of the marginal note.

*Standardized names may not appear in an annotation exactly as listed, instead the annotation might have a variation of the standardized name.

Validation

java -jar <path_to_tool.jar> validate -r <aor_directory_github_clone>

The tool can also validate all XML files in a given directory, and all subdirectories. It will list all of the directories check, and output all errors found.

The validate operation will check all XML files in the given directory. The -r flag can be used to tell the tool to check XML files in subdirectories.

Git Statistics

java -jar <path_to_tool.jar> git-stats <Github Repo URL>

This command will temporarily clone the Github repository and gather statistics for every commit. Output is in the form of two CSV files, books.csv and commits.csv. The output CSV files are placed in whatever directory you run the command.

The time for this command to finish will vary depending on the computer it runs on and when the command is run. Since it iterates through all commits in the repository, the time it takes to complete will increase slowly over time. Realistically, it will likely grow by no more than a second per commit added.

Note: this command will take a while to complete, possibly 15-20 minutes. As long as you are seeing new output on the command line, it is still working.

commits.csv

This spreadsheet gives the list of all commits made to the AoR Github repository. A brief overview of changes made to the data is given here including commit IDs to link this metadata with more detailed recordings of changes, authorship of changes, numbers representing the total number of changes made. Parent IDs are given to identify branching behavior in the repository.

commit id	parent id(s)	date	author name	author email	commit message	added	modified	deleted	renamed	copied	unreadable

commit id - unique ID of the commit
parent id(s) - a comma separated list of all parent commits. In the case of branch merges, there are more than one parent.
date - date the commit was made
author name
author email
commit message - message left by the author describing the commit
added - number of files added in this commit across all books
modified - number of files modified (those files that had existed previously that have been changed)
deleted - number of files deleted in this commit
renamed - number of files renamed in this commit. Note, moving a file from one directory to another will count as a rename. Ex: /base/file1 -> /base/dir/file1
copied - number of files copied in this commit
unreadable - number of unreadable files in this commit across all books. These unreadable files necessarily do not contribute to the annotation or word counts because they could not be read or parsed by this tool. This usually occurs because of invalid XML.

books.csv

This large spreadsheet contains more detailed information regarding the history of the data. For each point in time found in commits.csv, this spreadsheet contains information about numbers of annotations and annotated/transcribed words for each book. For example, the first few entries will be for the most recent commit found in the Github repository, so they share the same commit ID. These first lines will show the annotation counts and modification information for each book and can be linked to a particular time stamp by referencing commits.csv

commit id	book	total annotations	total annotated words	marginalia	marginalia words	underlines	underlined words	marks	marked words	symbols	symbol words	drawings	numerals	books	people	locations	added	modified	deleted	renamed	copied	unreadable

commit id - links to a single commit in commits.csv
book - book id
total annotations - total number of annotations in a book, simple sum of marginalia + underlines + marks + symbols + drawings + numerals
total annotated words - total number of words associated with annotations. Simple sum of marginalia words + underlined words + marked words + symbol words
marginalia, underlines, marks, symbols, drawings, numerals - simple counts of the number of each type of annotation within a book
books - number of books referenced in the marginalia
people - number of people referenced in the marginalia
locations - number of locations referenced in the marginalia
added - number of files added to this book in this commit
modified - number of files modified in this book in this commit
deleted - number of files deleted in this book in this commit
renamed - number of files renamed in this book in this commit
copied - number of files copied to this book in this commit
unreadable - number of files in this book in this commit that were unreadable. These files are invalid XML, so cannot be read/parsed by this tool. These files are not counted and do not contribute to the annotation or word counts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly