# Creating Basic Technical Metadata for Preservation with BagIt

## Assumptions

This demo was created with a few assumptions. In particular, I'm assuming that you have some level of
comfort with the following:

- working with files on a computer (Windows, Mac, Linux, or other filesystem environments),
- that you can find and open them from an Explorer (Windows) or Finder (Mac) graphical interface, 
- that you have done a some basic navigation in a shell environment (e.g., Terminal or Bash shell or GitBash) including finding files and directories,  
- you have a familiarity with the Python programming language, including using it from an interactive notebook environment (Jupyter notebooks), and that
- downloading and operating Jupyter notebooks and repos from GitHub.

All of the concepts, file actions, Python code, and tools are explained below, but understanding the above will help you work through the materials. The activities outlined here are based on a larger sequence of digital curation activities and tools that I developed in 2018 and 2019 while teaching digital curation courses at the University of Maryland iSchool.

## Introduction

Let's say that one of you has a group of files. It might be the digitized pages of a book, a computer-generated copy of the text (perhaps generated by Optical Character Recognition, or OCR), and bibliographic information (publication date, 
author(s), editor(s), copyright, etc). Although these may be discrete files, they are related and
only together create a coherent digital copy of the book. Or perhaps you have digitized a sound recording, and you 
have a group of sound files that are the tracks of the recording, a digital version of liner notes or other
information like a picture of the artist, and publication information. These are a group of sound files, images, and
text information. Or perhaps you have been collecting information posted on social media about Covid vaccination, which might include images, text captions, usernames, posting dates, and accompanying comments. Each of these things, whether a book, an audio recording, or an Instagram or Twitter post, may have multiple files that together create a digital object.

![file-image here]

Now, let's say that you want to preserve these for access later. Possibly for access by yourself, but likely for broader usage, in an archive or library of the future. Along with a general description of what the things are, you want to 
make sure that you include a list of the digital elements that you're including with the deposit. You want to make sure that there is a complete list of the entities that comprise this object, and you also want to have a way to confirm that the items you've assembled together are received altogether by the organization that you want to send it to. And in five or ten years, or more, you want to have a way to confirm that there hasn't been any problems or events that have compromised
the data that you're including. 

## Metadata and where it fits in the lifecycle

This notebook presents a short introduction to preservation metadata - what it is, why it is important, 
and when it may be most useful -
and then demonstrates briefly how basic metadata might be created with the help of the storage 
protocol _BagIt_, widely used among digital cultural heritage institutions. The approach here is 
to demonstrate the use of the python `bagit` library, a tool that is useful in creating and 
validating digital objects in the broader context of preparing data for storage, monitoring it for 
stability over time, and creating trustworthy copies and transfers of digital information. 

To begin, let's place this in the context of the larger realm of digital curation and preservation. 
To start, make a brief review of the digital curation lifecycle:

![Digital Curation Lifecycle](images/lifecycle_web.png)


- In terms of what we are talking about (the inner orange and yellow circles), the activities 
under discussion here are focused largely on "digital objects," which might include things like discrete
files, groups of files, and other generally static entities; in terms of the model, this does not include databases
or other digital things that might be queried or dynamically generated.
- Along with the digital object, we expect to have some level of description: in this case,
more than information about what the thing is, think of things like filenames, file size,
date of creation, who has access to a resource and what they can do with it. Basically information that 
may be usefully described as administrative and technical metadata; that is, who has access and 
what information about the nature of the thing. (We won't much be talking about descrpitive information
in the sense of describing the content here.)
- Moving out 
- The areas where we might most likely expect to find preservation metadata being created or used are  

## What is packaging?

OAIS describes packaging information as

> that information which, either actually or logically, binds,
identifies and relates the Content Information and PDI. (2-7)

Whereas descriptive information provides useful tags or text for discovering digital information,
the packaging information structures the information used for preservation (such as
fixity checksums, unique identifiers, provenance, etc) and the "content" (the digital bitstream
or content intended for preservation).

To bring information into a digital preservation environment that follows the general
models and functions outlined by the OAIS, the information needs to be formed into a
submission information package (SIP). If this information is large-scale and an archive
hopes to manage it, then consistently structured incoming information makes it more likely
that a system of rules can be used to manage the archival storage consistently and
reliably. For example, the University of North Texas defines their information packages according to
this [prospectus](https://www.library.unt.edu/sites/default/files/documents/digital-libraries-uploads/Appendix_M_UNT_Libraries_OAIS_Information_Package_Specification.pdf). Examples of the information assets that make up
information packages that are ingested and stored for the National Digital Newspaper Program,
which provides the data for the historic newspaper database [Chronicling America](https://chroniclingamerica.loc.gov/), are [described here](http://www.loc.gov/ndnp/guidelines/examples.html).
Some disciplines develop their own tools or standard protocols, and other digital repositories
use tools to prepare information, such as Bagger and Data Accessioner.

## What Is BagIt?

BagIt is a packaging specification used by many digital libraries to assemble, document,
transfer, and verify digital content. The idea is to place selected content
into a "bag" that conforms to the [BagIt specification](https://tools.ietf.org/html/draft-kunze-bagit-14),
which contains a basic description (bag-info.txt), a checksum manifest,
and data payload. The general concept is that the package is "self describing" (the description
acts like a "tag" on a physical bag naming the contents) and verifiable (using the
checksum manifest).

This format has been adopted in many settings. Various digital libraries, including the
[California Digital Library](https://www.cdlib.org/cdlinfo/2008/07/02/bagit-transferring-digital-content/),
[Digital Preservation Network](https://docs.google.com/document/d/1JqKMFn9KfeIMAAEdOGQr6LZPqNWx8Qubi12uoUXi2QU/edit),
and others, have adopted the specification as a generalized format
for submission and transfer of information (a SIP). Moreover, various tools, like
[Exactly](https://www.weareavp.com/products/exactly/) have been developed that use the BagIt standard.

### BagIt structure

Data packaged according to the BagIt specification 1.x follows this general schema

```
<main folder>/
 | bag-info.txt
 | bagit.txt
 | manifest-sha256.txt
 | tagmanifest-sha256.txt
 \--- data/
 | [payload files]
```

The `data` subfolder contains the content. The other files in the main folder contain
information about the contents:
 * `bag-info.txt` contains all metadata entered by the packager (information you configure in the `-profile.json` file)
 * `manifest-sh256.txt` contains a manifest of all the files in the `data` folder and their checksums
 * `tagmanifest-sha256.txt` contains a manifest of the files and checksums of the contents of the main folder
 * `bagit.txt` identifies the version of BagIt specification  

### Assembling files

Bagger allows you to gather files into bags in two ways: using the GUI interface to select files,
or assembling the files and then generating the bag structure based on the way you already have theme
assembled. We're going to use this latter way.

Put information into a structure that lets you store it in a coherent way. For example,
it may be sufficient to have all files in one folder. Perhaps you have a series of images in both
high-resolution preservation files and lower-resolution access derivatives.
Other cases may suggest a hierarchical structure of folders. For example,
you may want a series of folders and subfolders that correspond to the chapters of a
book or other smaller semantic units; or perhaps large groupings demand deeper and
more nested hierarchies (eg, a [pairtree](https://confluence.ucop.edu/display/Curation/PairTree) structure).

When you assemble the files, it's useful to name them according to a standard protocol,
which would be determined by the repository where you plan to store the bag.
For example, the State Archives of North Carolina suggest that all BagIt bags should
have a top folder name with that ends in `_bag` as a quick indicator that a folder
represents a bag (note that while useful, this naming convention does not guarantee
that the folder in quesiton is a *valid* BagIt bag).

Once you have the files organized as you wish to store them, you are ready to create the 
bag. Although a relatively basic process once everything is set up, this step accomplishes 
the complicated work of generating the baginfo.txt file and requisite data integrity checksums.

## Setup

This activity will use a Python library to prepare a group of sample files according to the BagIt specification. Various other ways of creating bags are discussed below in this notebook. 

In [2]:
# If you don't have bagit installed, install following instructions at https://github.com/LibraryOfCongress/bagit-python
# Alternatively, you can use the magic command on the line below by removing the hashtag and running the cell.
# (When the command below runs, you will see response output appear below this cell as the program downloads and installs.)
#!pip install bagit

Collecting bagit
  Downloading bagit-1.8.1-py2.py3-none-any.whl (35 kB)
Installing collected packages: bagit
Successfully installed bagit-1.8.1


To begin this activity, set up by importing the library:

In [3]:
import bagit

## Bag the Files

The purpose of the “bag” is to create information about the file structure, basic information that can demonstrate that the information has not changed, and to provide basic context (information about where the files came from, who filed them, and what they are). It is an open specification, so there are few requirements about how the files are structured. In this case, I am taking all of the files within a specific folder, using the Python bagit library to generate the fixity information, and explaining each step throughout the rest of this notebook:

### BagInfo Metadata

A well-formed bag should include as much information on the “tag” as possible, since this is where we can include information about the source and provenance of the data. This “BagInfo” information can be added using arguments in the functions that create bags. This example creates a dictionary of the bag information called `my_BagInfo`, which will be inserted as an argument during bag creation. If you use this code, replace information below with you the information appropriate to the project you’re working on.

In [4]:
# create baginfo data
my_BagInfo = {
    'Contact-Name' : 'Jesse Johnston',
    'Contact-Email': 'morskyjezek@gmail.com',
    'External-Description': 'NEH Grant data files downloaded from NEH in December 2020.',
    'Source-Organization': 'National Endowment for the Humanities (NEH)',
    'Source-URL': 'https://securegrants.neh.gov/open/data/',
    'Collected-Date': '2020-12-14'
}

print('Bag Info:\n',my_BagInfo)

Bag Info:
 {'Contact-Name': 'Jesse Johnston', 'Contact-Email': 'morskyjezek@gmail.com', 'External-Description': 'NEH Grant data files downloaded from NEH in December 2020.', 'Source-Organization': 'National Endowment for the Humanities (NEH)', 'Source-URL': 'https://securegrants.neh.gov/open/data/', 'Collected-Date': '2020-12-14'}


### Bagging the files

Now that we have the tool available via the library that we imported and created basic metadata for the bag, we can move on to “bag” the files. In this case, the files that we wanted to bag were in a directory named neh-grants-data-2012. We use the make_bag() function to make the bag, and we pass in as arguments the location of the files that we want to bag, the bag info (`my_BagInfo` dictionary), and in this case designated the utf-8 text encoding:

In [None]:
# create the bag
bag = bagit.make_bag('neh-grants-data-202012', bag_info = my_BagInfo, encoding='utf-8')

### Examine the Bag

To get an idea what's in the bag, you can explore the `bag` object and its data. 
For example, use the `entries.items()` method to display a list of the files and fixity information:

In [5]:
line_count = 0
for path, fixity in bag.entries.items():
    line_count += 1
    print("%s. path:%s sha256:%s" % (line_count, path, fixity['sha256']))

NameError: name 'bag' is not defined

You can also read the file contents of the `neh-grants-data-202012/manifest-sha256.txt` thus:

In [None]:
!cat `neh-grants-data-202012/manifest-sha256.txt`

zsh:1: no such file or directory: neh-grants-data-202012/manifest-sha256.txt


### Validation

One benefit of using this packaging approach is that it is simple, in the sense that 
it only exists as files on a disc or server and does not require any specialized software
to see the files or decompress them. In addition, this approach allows you as a digital curator,
librarian, or archivist, to receive, store, and preserve digital assets even when you may not 
have all of the information about what these assets are or how they might be used. In the words
of the BagIt spec, the contents of a bag are "opaque", that is, it is possible to verify that the 
content is accurate whether or not you can display it, render it viewable or processable with software,
or the contents are subject to rights management or proprietary restrictions.

The specification and structure of a BagIt bag make it possible to check the contents 
without "seeing" them. This is made possible because we can see if the bag is **complete**,
and we can also check to see if the bag is **valid**. 

* A **complete** bag is one that has all of the required 
elements of a bag: a BagIt declaration (`bagit.txt`), a payload (the `data` directory), and a payload manifest
(the list of files and checksums, located in the top-level directory, probably called 
something like `manifest-sha256.txt`). 
* A **valid** bag is one that is complete and for which it is possible to check each file in the
payload, calculate a checksum for it, and verify that the checksum is the same, indicating that
the contents has not changed. 

To assess the bag that was created above, we can again use the `bagit` library, which has an `is_valid()` function.
This function will check to see if the bag is indeed an object that we can validate is a well-formed BagIt object:

In [None]:
# check to see if the bag is valid
if bag.is_valid():
    print("yay :)")
else:
    print("boo :(")

## Conclusion

Now we have created basic fixity and preservation metadata for this group of files, two types of fixity signatures (sha256 and sha512), and an inventory of the files (see `neh-grants-data-202012/manifest-sha256.txt`). This will serve as a basic description of the original files. In the next activities, we will continue to work with this information, but the original files will remain unchanged and available for further consultation or work beyond this project. All of our work will be to extract, transform, and clean the data that we pull from these files, which will be the basis for further computation, visualization, or other analysis.

## Resources

See these additional resources for more detailed information:
* State Archives of North Carolina, "[Bagger GUI User Guide](https://files.nc.gov/dncr-archives/documents/files/using_bagger.pdf)" (Updated 2012, v. 1.5), available as of March 2018.
* M. Phillips, ["What do we put in our BagIt bag-info.txt files?"](https://vphill.com/journal/post/4142/) (2015).
* UNT Libraries, UNT OAIS Information Package Specification (2015), https://www.library.unt.edu/sites/default/files/documents/digital-libraries-uploads/Appendix_M_UNT_Libraries_OAIS_Information_Package_Specification.pdf