Skip to content
This repository has been archived by the owner on Jan 5, 2021. It is now read-only.

File Specifications

Nathan Tallman edited this page Nov 4, 2019 · 39 revisions

File Layout in CHO Batches

Jump to examples.

  • These file specifications are an extension of Penn State's local digital object guidelines.
  • These file specifications apply to batches imported and exported only. CHO will natively store files using a machine-efficient layout and naming convention. For preservation storage CHO will use the Oxford Common File Layout to store individual work and collection packages in (1.x feature).
  • Batches
    • Batches can contain one or more works. Batches can only directly contain works, not entire collections or just file sets or files.
    • All works have to be in their own directory inside the bag's data directory.
  • Works
    • Simple works contain all their file sets in one directory.
    • Nested works contain two or more sub-directories, which include file sets. [Subject to change, 1.x feature.]
  • File Sets.
    • Normal file sets require a _preservation or _service file to be present. CHO derives normal file sets from the unique characters in the filenames, minus the extension and base. This is usually an incremental counter such as 0000_000.
    • Representative file sets require an _access file to be present. CHO derives representative file sets from filenames with identifiers, suffixes, and no incremental counters. Representative file sets do not need metatdata as they are already described with work-level metadata.
  • File Suffixes (Classes map to PCDM Use [RDF])
    • _preservation = preservation master (master preservation file, e.g. tiff, wav, avi)
    • _preservation-redacted = redacted preservation master
    • _service = service master, access master (master service file for viewers, e.g. jp2, dv)
    • _access = access derivative for user download (e.g. pdf, jpg)
    • _thumb = display thumbnail
    • _text = plain text, transcription, OCR output
    • _caption = time-coded captions for a media file (e.g. vtt, srt)
    • _media = images of the original physical media or housing

Example Batches

The following are file listings from example CHO batches, which consist of a ZIPped bag and a CSV of metadata for one or more works and file sets.

Simple Work: One File Set, Photograph [MVP]

E.g. One-Sided Photograph

|-- ingest/
    |-- pines_2018-09-03.zip/
        |-- bag-info.txt
        |-- bagit.txt
        |-- manifest-md5.txt
        |-- tagmanifest-md5.txt
        |-- data/
            |-- pst_9999999999/
                |-- pst_9999999999_preservation.tif
                |-- pst_9999999999_preservation-redacted.tif
                |-- pst_9999999999_service.jp2
                |-- pst_9999999999_text.txt
                |-- pst_9999999999_thumb.jpg
batch_id* alternate_ids*+ title*+ work_type* home_collection*
pines_2018-09-03 pst_9999999999 Edna Bergleton Thorpe Still_Image pst_99999

* Required for work metadata.

+ Required for file set metadata.

Simple Work: Many File Sets, Documents [MVP]

E.g. Folder of Manuscript Materials

|-- ingest/
    |-- pstsc_99999_ntt7_2018-10-17.zip/
        |-- bag-info.txt
        |-- bagit.txt
        |-- manifest-md5.txt
        |-- tagmanifest-md5.txt
        |-- data/
            |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c/
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01_preservation.tif
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01_preservation-redacted.tif
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01_service.jp2
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01_text.txt
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01_thumb.jpg
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_02_preservation.tif
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_02_service.jp2
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_02_text.txt
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_02_thumb.jpg
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_01_preservation.tif
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_01_service.jp2
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_01_text.txt
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_01_thumb.jpg
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_02_preservation.tif
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_02_service.jp2
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_02_text.txt
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_02_thumb.jpg
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_access.pdf
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_text.txt
                |-- pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_thumb.jpg
batch_id* alternate_ids*+ title*+ work_type* home_collection*
pstsc_99999_ntt7_2018-10-17 pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c Percival Horace Johnson correspondence to Allen Anderson Document pstsc_99999
pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_01 Page 1
pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00001_02 Page 2
pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_01 Page 3
pstsc_99999_1e172d6ff9h8d032c7e7a8241a56793c_00002_02 Page 4

* Required for work metadata.

+ Required for file set metadata.

Simple Work: Many File Sets, A/V [MVP]

E.g. Multi-Sided Audio Recording, Joined for Access

|-- ingest/
    |-- birdsong_ntt7_2017-12-18/
        |-- bag-info.txt
        |-- bagit.txt
        |-- manifest-md5.txt
        |-- tagmanifest-md5.txt
        |-- data/
            |-- pstalt_birds02/
                |-- pstalt_birds02_00001_01_preservation.wav
                |-- pstalt_birds02_00001_01_front.jpg
                |-- pstalt_birds02_00001_02_preservation.wav
                |-- pstalt_birds02_00001_02_back.jpg
                |-- pstalt_birds02_service.flac
                |-- pstalt_birds02_access.mp3
                |-- pstalt_birds02_text.txt
                |-- pstalt_birds02_thumb.jpg
batch_id* alternate_ids*+ title*+ work_type* home_collection*
birdsong_ntt7_2017-12-18 pstalt_birds02 Altoona area birdsong recording by Wally Walton Audio pstal_birds
pstalt_birds02_00001_01 Cassette 1, Side 1
pstalt_birds02_00001_02 Cassette 1, Side 2

* Required for work metadata.

+ Required for file set metadata.


NESTED WORKS ARE NOT PART OF MVP AND THE SPECS BELOW MAY NOT HAVE BEEN UPDATED SINCE INITIAL DRAFTING, WILL NEED UPDATING BEFORE 1.X SPRINTING

Nested Work: Folder of manuscript materials [1.x]

|-- choStaging/
    |-- batchID/
        |-- bag-info.txt
        |-- bagit.txt
        |-- manifest-md5.txt
        |-- tagmanifest-md5.txt
        |-- data/
            |-- workID/
                |-- workID_00001/
                    |-- workID_00001_01_preservation.tif
                    |-- workID_00001_01_preservation-redacted.tif
                    |-- workID_00001_01_service.jp2
                    |-- workID_00001_01_thumb.jpg
                    |-- workID_00001_02_preservation.tif
                    |-- workID_00001_02_service.jp2
                    |-- workID_00001_02_thumb.jpg
                |-- workID_00002/
                    |-- workID_00002_01_preservation.tif
                    |-- workID_00002_01_service.jp2
                    |-- workID_00002_01_thumb.jpg
                    |-- workID_00002_02_preservation.tif
                    |-- workID_00002_02_service.jp2
                    |-- workID_00002_02_thumb.jpg
                |-- workID_service.pdf
                |-- workID_text.txt
                |-- workID_thumb.jpg
alternate_ids home_collection batch_id work_type title
workID collectionID batchID Document Simple Work
workID_00001 collectionID|workID batchID Document Nested Work 1
workID_00002 collectionID|workID batchID Document Nested Work 2

Nested Work: Two-sided audio recording [1.x]

|-- choStaging/
    |-- batchID/
        |-- bag-info.txt
        |-- bagit.txt
        |-- manifest-md5.txt
        |-- tagmanifest-md5.txt
        |-- data/
            |-- workID/
                |-- workID_00001/
                    |-- workID_00001_front.jpg
                    |-- workID_00001_preservation.wav
                    |-- workID_00001_service.flac
                |-- workID_00002/
                    |-- workID_00002_back.jpg
                    |-- workID_00002_preservation.wav
                    |-- workID_00002_service.flac
                |-- workID_service.flac
                |-- workID_access.mp3
                |-- workID_text.txt
                |-- workID_thumb.jpg

alternate_ids home_collection batch_id work_type title
workID collectionID batchID Audio Simple Work
workID_00001 collectionID|workID batchID Audio Nested Work 1
workID_00002 collectionID|workID batchID Audio Nested Work 2