# Preparing Digital Objects for Transfer and Storage Using BagIt

This notebook presents a short introduction to one challenge of digital curation -
how do you know what was sent to you? how do you know if what you have was really what was sent? -
how those questions can be answered in part with preservation metadata - what that is, and how it can be made useful -
and then demonstrates briefly how basic metadata might be created with the help of the 
_BagIt_ protocol, which is widely used among digital cultural heritage institutions.
The approach here demonstrates the use of the python `bagit` library, a tool that is useful in creating and 
validating digital objects in the broader context of preparing data for storage, monitoring it for 
stability over time, and creating trustworthy copies and transfers of digital information. 
While this tool is not a standard piece of office software, it does offer a relatively simple
way to create essential preservation information about groups of digital files, which can be
used by very large cultural heritage organizations (such as the Library of Congress, California
Digital Library) but also by smaller organizations (you can create this information for any directory
of files using your laptop and this Jupyter notebook).

## Assumptions

This demo was created with a few assumptions. In particular, I'm assuming that you have some level of
comfort with the following:

- working with files on a computer (Windows, Mac, Linux, or other filesystem environments),
- that you can find and open them from an Explorer (Windows) or Finder (Mac) graphical interface, 
- that you have done a some basic navigation in a shell environment (e.g., Terminal or Bash shell or GitBash) including finding files and directories,  
- you have a familiarity with the Python programming language, including using it from an interactive notebook environment (Jupyter notebooks), and
- downloading and operating Jupyter notebooks and repos from GitHub.

All of the concepts, file actions, Python code, and tools are explained below, but understanding the above will help you work through the materials. The activities outlined here are based on a larger sequence of digital curation activities and tools that I developed in 2018 and 2019 while teaching digital curation courses at the University of Maryland iSchool.

## Introduction - How do you create and provide authentic, verifiable digital content? 

To be trustworhty institutions, libraries and archives endeavor to provide reliable
and authentic content. 
The goal of providing authentic information - information that is indeed what it is purported to be,
or that is what it says it is - is a particular challenge when information is being sent between
people or institutions.
This is particularly challenging in the digital environment,
where the stuff of the content (bits and bytes), the infrastructure to store the content
(drives and servers), and the transmission of the content is generally not visible or human readable.
One role of preservation metadata, as discussed in the context of BagIt, is to assist us in
answering the seemingly simple questions that these values raise:

> "Did you get what I sent you?"

> "Do you have the data that I sent you last year? Are you sure it's the same?"

![Data transfer graphic from WikiMedia Commons.](https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Data-transfer.svg/256px-Data-transfer.svg.png)

Let's say that one of you has a group of files. It might be the digitized pages of a book, a computer-generated copy of the text (perhaps generated by Optical Character Recognition, or OCR), and bibliographic information (publication date, 
author(s), editor(s), copyright, etc). Although these may be discrete files, they are related and
only together create a coherent digital copy of the book. Or perhaps you have digitized a sound recording, and you 
have a group of sound files that are the tracks of the recording, a digital version of liner notes or other
information like a picture of the artist, and publication information. These are a group of sound files, images, and
text information. Or perhaps you have been collecting information posted on social media about Covid vaccination, which might include images, text captions, usernames, posting dates, and accompanying comments. Each of these things, whether a book, an audio recording, or an Instagram or Twitter post, may have multiple files that together create a digital object.

![Sample digital objects for a "book"](images/book-files.png)

![Sample files for a "website"](images/a-website.png)

Now, let's say that you want to preserve these for access later. Possibly for access by yourself, but likely for broader usage, in an archive or library of the future. Along with a general description of what the things are, you want to 
make sure that you include a list of the digital elements that you're including with the deposit. You want to make sure that there is a complete list of the entities that comprise this object, and you also want to have a way to confirm that the items you've assembled together are received altogether by the organization that you want to send it to. And in five or ten years, or more, you want to have a way to confirm that there hasn't been any problems or events that have compromised
the data that you're including. 

## Context: In the Curation Lifecycle

Before going further to talk about file metadata and what it looks like in BagIt, 
let's place this in the context of the larger realm of digital curation and preservation. 
To start, make a brief review of the digital curation lifecycle:

![Digital Curation Lifecycle](images/lifecycle_web.png)

BagIt is an approach that can be used to create and check preservation metadata
for files that are being prepared for a digital repository. Where would this be 
important for the curation lifecycle?

- In terms of what we are talking about (the inner orange and yellow circles), the activities 
under discussion here are focused largely on "digital objects," which might include things like discrete
files, groups of files, and other generally static entities; in terms of the model, this does not include databases
or other digital things that might be queried or dynamically generated.
- Along with the digital object, we expect to have some level of description: in this case,
more than information about what the thing is, think of things like filenames, file size,
date of creation, who has access to a resource and what they can do with it. Basically information that 
may be usefully described as administrative and technical metadata; that is, who has access and 
what information about the nature of the thing. (We won't much be talking about descrpitive information
in the sense of describing the content here.)
- Moving out to the Curate/Preserve ring, most of what is going on here is done to preserve existing digital files. At the point that we are packaging files and creating technical metadata, we might assume that there has already been a choice somewhere along the line that these materials are ones that we want to manage as part of our collection, repository, or other organizational unit. 
- Finally, in the outer magenta-colored ring, which lists digital curation actions, the areas where we might most likely expect to find structures, metadata, and tools like those under discussion today, are the "Create or Receive," "Ingest," "Preservation Action," and "Store." It's possible that fixity information might be used in other areas as well, but it would likely not be as closely connected to a structure like BagIt.    

## What is Fixity?

Fixity refers to whether or not a piece of information has changed or not. For digital objects, the process of creating "[checksums](https://en.wikipedia.org/wiki/Checksum)," which are algorithmically calculated values created based on the sequence of bits indexed in a file, is the most widely used technique to establish and evaluate file fixity. Various [algorithms](https://en.wikipedia.org/wiki/Cryptographic_hash_function), developed in cryptography (for encoding information), may be used, including `MD5`, `SHA-1`, `SHA-2`, and others. Fixity is often treated as a metadata property, and fixity can be assessed using that metadata. 

Fixity information can be useful in answering questions like: 

- "Is this file that I'm receiving the same as the one that you sent to me?" 
- "Is this file that we are storing now the same as the one that was received previously (whether a few days, months, or years later)?"

BagIt helps us to create fixity information by providing all of the tools that you need to create checksums and connect them to the digital information that you are sending or storing.

## One Answer: BagIt, a specification and accompanying tools

[BagIt](https://en.wikipedia.org/wiki/BagIt) is a packaging specification used by many digital libraries to
assemble, document, transfer, and verify digital content. The idea is to place selected content
into a "bag" that conforms to the [BagIt specification](https://tools.ietf.org/html/draft-kunze-bagit-14),
which contains a basic description (bag-info.txt), a checksum manifest,
and data payload. The general concept is that the package is "self describing" (the description
acts like a "tag" on a physical bag naming the contents) and verifiable (using the
checksum manifest).

BagIt is explained and defined in a standalone Request for Comments document from the 
Internet Engineering Taskforce, so it is not only used in libraries and archvies, but also
recognized and defined in a robust way that is recognized by a broader professional community.

![Screenshot of the BagIt RFC 8493 page, at https://datatracker.ietf.org/doc/html/rfc8493, as of January 2022](images/bagit-rfc-8493.png)

### Use cases

The specification was developed around 2005 and 2006 by the California Digital Library and
Library of Congress, when they were planning to transfer a large amount of archived web content.
The packaging standard was also soon adopted as the basic storage protocol for all digital
assets created by the National Digital Newspaper Program, a US-wide program that created
digital versions of previously microfilmed newspapers. That program receives content from scanning
partners all of the country and aggregates and standardizes the content for an online web presentation.
The resources is browsable at [Chronicling America](http://chroniclingamerica.loc.gov/), and 
a sample BagIt bag for the project is  visible [here](http://chroniclingamerica.loc.gov/data/batches/batch_az_acacia_ver01/data/).

Check the resources at the end of this notebook for more examples.

### Using and Assembling Bags

This format has been adopted in many settings. Various digital libraries, including the
[California Digital Library](https://www.cdlib.org/cdlinfo/2008/07/02/bagit-transferring-digital-content/),
[Digital Preservation Network](https://docs.google.com/document/d/1JqKMFn9KfeIMAAEdOGQr6LZPqNWx8Qubi12uoUXi2QU/edit),
and others, have adopted the specification as a generalized format
for submission and transfer of information (a SIP). Moreover, various tools, like
[Exactly](https://www.weareavp.com/products/exactly/) have been developed that use the BagIt standard.

When you assemble the files, it's useful to name them according to a standard pattern,
which would be determined by the repository where you plan to store the bag.

- For example, the State Archives of North Carolina suggest that all BagIt bags should
have a top folder name with that ends in `_bag` as a quick indicator that a folder
represents a bag (note that while useful, this naming convention does not guarantee
that the folder in quesiton is a **valid** BagIt bag). 
- The Library of Congress Chronicling America project receives bags created by state-level
agencies who coordinate the scanning of microfilm, creation of specific metadata for each issue
and newspaper title, and are then bagged (packaged) in batches corresponding to microfilm reels
then sent to the Library of Congress where they are validated.

Put information into a structure that lets you store it in a coherent way. For example,
it may be sufficient to have all files in one folder. Perhaps you have a series of images in both
high-resolution preservation files and lower-resolution access derivatives.
Other cases may suggest a hierarchical structure of folders. For example,
you may want a series of folders and subfolders that correspond to the chapters of a
book or other smaller semantic units; or perhaps large groupings demand deeper and
more nested hierarchies (eg, a [pairtree](https://confluence.ucop.edu/display/Curation/PairTree) structure).

Once you have the files organized as you wish to store them, you are ready to create the 
bag. Although a relatively basic process once everything is set up, this step accomplishes 
the complicated work of generating the `bag-info.txt` file and requisite data integrity checksums.
This notebook will demonstrate how to make a bag from an existing directory, but first,
let's take a closer look at what BagIt actually looks like.

### BagIt structure

Data packaged according to the BagIt specification 1.x follows a specific structure:

```
<main folder>/
 | bag-info.txt
 | bagit.txt
 | manifest-sha256.txt
 | tagmanifest-sha256.txt
 \--- data/
 | [payload files]
```

Viewed in a file navigator window (e.g., Finder, Explorer), a bag appears as a folder or directory, and may appear something like this:

![Image of how a BagIt bag may appear on a graphical file system viewer.](images/bag-structure.png)

The `data` subfolder contains the content. The other files in the main folder contain
information about the contents:

- `bagit.txt` identifies the version of BagIt specification  
- `bag-info.txt` contains all metadata entered by the packager
- `manifest-sha256.txt` contains a manifest of all the files in the `data` folder and their checksums
- `tagmanifest-sha256.txt` contains a manifest of the files and checksums of the contents of the main folder

The purpose of the "Bag Info" file is to transmit metadata about the package, including things like
who sent it, pointers to more robust descriptive information such as a library catalog, and 
information about the specific package like when it was bagged, size, and what tool created it.

### BagInfo Metadata

You can think of the "Bag Info" as a name tag on the bag, which has information about who sent it and also what's in the bag. A well-formed bag should include as much information on the “tag” as possible, since this is where we can include information about the source, provenance, and description of the data. According to the BagIt specification, this information should be contained in the `bag-info.txt` file and located in the top level of the bag. 
Bag information is stored in plain text format, and metadata is structured in key:value pairs, where the label
(key) is separated from the information (value) by a colon. 
You can see these key:value pairs if you open the file, and they will look like this:

```
key1-label: value1 value
key2-label: value2 value
key3-label: value3 123
```

* Note that this data pattern is similar to how Python stores information in dictionaries ('KEY':'VALUE')

The Bag Info fields are intended to be "human readable" (that is, they can be opened and read as a text
file), and the fields are not validated against any external schema by the `bagit` tools. Later on, we will display the contents of a sample `bag-info.txt`file, which may contain any labels that you want to add, but at this point it is important to note that some Bag Info tags are reserved, and they may take data from specific sources:

| Reserved label (key name) | Description | Required? | Repeatable? | Source | Example |
| --- | --- | --- | --- | --- | --- |
|<td colspan="6">**Descriptive Information (about the bag, provenance, contents, creators, reference identifiers)**</td>|
|Source-Organization: | Organization transferring the content.|||Boilerplate/Curator|Imagination Library|
|Organization-Address: | Mailing address of the source organization.|||Boilerplate/Curator| Fairbanks, Alaska|
|Contact-Name: | Person at the source organization who is responsible for the content transfer.|||Boilerplate/Curator|Jesse Johnston|
|Contact-Email: | Email address of the contact.|||Boilerplate/Curator|hello@some.email|
|External-Description: | A brief explanation of the contents and provenance.|||Customized per bag or larger collection/grouping|Digitized items from an interesting collection.|
|External-Identifier:| A sender-supplied identifier for the bag. For example, a call number, URI, or ARK identifier|||Customized per bag or larger collection/grouping|ark:/67531/metadc488207|
|Bagging-Date: | Date (YYYY-MM-DD) that the content was prepared for transfer.||NO|System|2021-12-02|
|<td colspan="6">**Technical information (bag size, number of files, etc)**</td>|
|Bag-Size: | The size or approximate size of the bag being transferred, followed by an abbreviation such as MB (megabytes), GB (gigabytes), or TB (terabytes).|NO|NO|Bagging application|42600 MB, 42.6 GB, or .043 TB|
|Payload-Oxum: |The "octetstream sum" of the payload, which is intended for the purpose of quickly detecting incomplete bags before performing checksum validation.  This is strictly an optimization, and implementations MUST perform the standard checksum validation process before proclaiming a bag to be valid. This element MUST NOT be present more than once and, if present, MUST be in the form `OctetCount.StreamCount`, where _OctetCount_ is the total number of octets (8-bit bytes) across all payload file content and _StreamCount_ is the total number of payload files.|YES|NO|Bagging application| 1.1|

_Notes:_ 

- Generally bag key names should not include spaces, so you will see that individual words are separated by hyphens.
- Some of these fields should be created by a person (the digital curator preparing the bag), some should be created automatically by the system
- The above table is based on the RFC 8493 spec, but there is more information in the RFC. For example, there are additional reserved bag labels, the above are selected for demonstratin purposes.
- If you were going to be using BagIt a lot, it would be a good idea to read the full [RFC 8493](https://datatracker.ietf.org/doc/html/rfc8493).

#### Sample baginfo

A sample BagInfo file might look like this:

```
Source-Organization: Data Curation Training Pros, via Library of Congress (LC)
Contact-Name: Anonymous
Contact-Email: hello@some.email
External-Description: These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.
External-Identifier: myfiles:documents/test/files/1234
Bagging-Date: 2022-01-08
Payload-Oxum: 26923687.23
```

### Manifests

In addition to the "tag" information in the `bag-info.txt` file, a complete BagIt object
must also include lists of the files you put in the bag. These lists, or **manifests**
in the terminology of BagIt, are titled with the word manifest, then the name of the checksum
algorithm that was used to create them. So, for example, `manifest-md5.txt` is the manifest of all
files in the `data` directory, with their checksums as generated by the MD5 algorithm.
The `bagit` tool creates this information for you, which provides a complete list of
what was in the bag when you packed it, and if someone is unpacking it in another place and time, they
should be able to use this information to confirm that the bag has what you put in it. In addition,
the tool will create checksums for each file in the bag, which means that you can also use the tool 
to check to see if any of the files have changed. Keep in mind that, should something be missing
or changed, this doesn't tell you what that might be or why, but it does tell you that you need to confirm
with the person or organization who sent the information to you. 

The manifest is a text file, and each line in the manifest represents a file in the bag. 
The first information on  each line is the checksum, and 
the second information is the full path for the corresponding file.

A two-file bag with the file manifest might look something like this:

```
myfirstbag/
   |
   |   manifest-md5.txt
   |    (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
   |    (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
   |
   |   bagit.txt
   |    (BagIt-version: 1.0                                           )
   |    (Tag-File-Character-Encoding: UTF-8                           )
   |
   \--- data/
        |
        |   27613-h/images/q172.png
        |    (... image bytes ...                                     )
        |
        |   27613-h/images/q172.txt
        |    (... OCR text ...                                        )
```
In addition to the content files in the bag, the bag should contain a manifest for the bag tags.
This file is present to ensure that the tag information has not been changed, either
by some error in the transfer or storage process or perhaps by mistake at some point.

The next notebook, `01-using-bagit-ipynb`, demonstrates how to 
use the Python `bagit` library to create a valid BagIt object with 
file manifests, checksums, and the data folder structure.