Skip to content
jabrah edited this page Aug 31, 2016 · 30 revisions

In progress text for data release, not an actual data release announcement.

See http://archaeologyofreading.org/downloads/ for the actual data release.


Introduction

A data release contains all archived data from the AoR project with the exception of images. In addition statistics derived from transcriptions and the way they were worked on are included.

The data is arranged in a file system hierarchy distributed as a BagIt package. Releases are numbered and named like aor_drX.zip where X is the number of the release.

Data format

The top level directory contains data concerning the collection as a whole. A README file gives a short explanation of the release and contains a changelog. Spreadsheets are included in UTF-8 encoded CSV format. XML schema and DTDs that were used when creating the transcriptions are also included.

BASE
 |
 ├──books.csv
 ├──corpus.csv
 ├──locations.csv
 ├──people.csv
 ├──README
 ├──XMLschema/
 |    └── (schemas/DTDs)
 ├──books/
 |    ├──PrincetonRB16th12/
 |    |    ├── (XML transcriptions) EX: PrincetonRB16th12.aor.002r.xml
 |    |    ├── PrincetonRB16th12.description.xml
 |    |    └── PrincetonRB16th12.images.csv
 |    ├──PrincetonRB16th11/
 |    |    └── (book contents)
 |    ├──PrincetonRB16th99a/
 |    |    └── (book contents)
 |    ├──UclCastiglione1541/
 |    |    └── (book contents)
 |    ├──Newberry27495/
 |    |    └── (book contents)
 |    ├──FolgersHa2/
 |    |    └── (book contents)
 |    ├──PrincetonK6233/
 |    |    └── (book contents)
 |    ├──HoughtonSTC11402/
 |    |    └── (book contents)
 |    ├──PrincetonPA6452/
 |    |    └── (book contents)
 |    ├──PrincetonU101/
 |    |    └── (book contents)
 |    ├──PrincetonPA8550/
 |    |    └── (book contents)
 |    ├──PrincetonDL45/
 |    |    └── (book contents)
 |    └──PrincetonPE1137/
 |         └── (book contents)
 |
 └──stats/
       ├──annotations.csv
       ├──book_totals.csv
       ├──PrincetonRB16th12.csv
       ├──PrincetonRB16th11.csv
       ├──PrincetonRB16th99a.csv
       ├──UclCastiglione1541.csv
       ├──Newberry27495.csv
       ├──FolgersHa2.csv
       ├──PrincetonK6233.csv
       ├──HoughtonSTC11402.csv
       ├──PrincetonPA6452.csv
       ├──PrincetonU101.csv
       ├──PrincetonPA8550.csv
       ├──PrincetonDL45.csv
       ├──PrincetonPE1137.csv
       ├──vocab_marginalia_EL.csv
       ├──vocab_marginalia_EN.csv
       ├──vocab_marginalia_ES.csv
       ├──vocab_marginalia_FR.csv
       ├──vocab_marginalia_IT.csv
       ├──vocab_marginalia_LA.csv
       ├──vocab_marks_EL.csv
       ├──vocab_marks_EN.csv
       ├──vocab_marks_IT.csv
       ├──vocab_marks_LA.csv
       ├──vocab_symbols_EL.csv
       ├──vocab_symbols_EN.csv
       ├──vocab_symbols_IT.csv
       ├──vocab_symbols_LA.csv
       ├──vocab_underlines_EL.csv
       ├──vocab_underlines_EN.csv
       ├──vocab_underlines_IT.csv
       └──vocab_underlines_LA.csv

Spreadsheets

books.csv

The books spreadsheets lists books that are mentioned in annotations. The first column is used as the standard title of the book. The other colummns contain title variants and bibliographic information.

corpus.csv

The corpus spreadsheet contains metadata about the books which are part of the project. In particular the identifier column gives the identifier of the book in the archive.

locations.csv

The locations spreadsheets lists locations that are mentioned in annotations. The first column is used as the standard location name.

people.csv

The people spreadsheet lists people that are mentioned in annotations. The first column is used as the standard person name.

Books

The books subdirectory contains a directory for each book in the collection. Each book directory contains metadata about the book, transcriptions, and the list of images in the book. Each file is prefixed by the book identifier which is also the name of the book directory. For example the directory PrincetonK6233 refers Princeton's Paratitla and contains files like PrincetonK6233.description.xml.

BOOK_ID.description.xml

The book description file contains metadata about the book in a custom XML format. A schema is in progress, but not yet available.

BOOK_ID.images.csv

The images spreadsheet contains an ordered list of page images. The first column is the image identifier, the second is the image width, and the third is the image height. The page images are in reading order.

BOOK_ID.aor.IMAGE.xml

The IMAGE portion of the name is the image identifier from the images spreadsheet without the initial prefix of the book identifier and file extension. These files contain detailed information about annotations on the corresponding page in a custom XML format. See the transcriber's manual for more information.

Statistics

The stats sub-directory contains statistics derived from our transcriptions as well as the history of our work on the transcriptions. It contains two subdirectories containing different data.

  • stats/ sub-directory contains data regarding the current state of the transcription data. The data in this folder is described here (https://github.com/jhu-digital-manuscripts/rosa2/wiki/AoR-Statistics#collect-newest-stats).
    • annotations.csv - Collects data for all annotations from all of the AoR books.
    • book_totals.csv - Collects data from other spreadsheets for convenience. This spreadsheet includes total counts for each annotation type and word counts for each book.
    • BOOK_ID.csv - There is one of these spreadsheets for each book in the corpus. It contains similar data compared to book_totals.csv but accumulates data per page is restricted to a single book.
    • vocab_ANNOTATION-TYPE_LANGUAGE-CODE.csv - There is one of this type of file for each annotation type and for each language. Example: vocab_marginalia_EN.csv OR vocab_underline_LA.csv. These spreadsheets track the number of times words appear in the transcribed data.
      • Annotation types: marginalia, mark, symbol, underline, drawing, errata, numeral
      • Language codes follow the TWO letter abbreviation standard ISO 639-1. Common language codes in this project are: EN, EL, ES, FR, IT, LA

Changes between Data Releases 2 and 3

  • stats/history removed: Git history is no longer included
  • stats/latest collapsed: stats files moved direcly into stats directory
  • annotations.csv now included with stats
  • README changed to plain text file from a markdown file

Download

Clone this wiki locally