# Preparing Digital Objects for Transfer and Storage Using BagIt

This notebook presents a short introduction to one challenge of digital curation -
how do you know what was sent to you? how do you know if what you have was really what was sent? -
how those questions can be answered in part with preservation metadata - what that is, and how it can be made useful -
and then demonstrates briefly how basic metadata might be created with the help of the 
_BagIt_ protocol, which is widely used among digital cultural heritage institutions.
The approach here demonstrates the use of the python `bagit` library, a tool that is useful in creating and 
validating digital objects in the broader context of preparing data for storage, monitoring it for 
stability over time, and creating trustworthy copies and transfers of digital information. 
While this tool is not a standard piece of office software, it does offer a relatively simple
way to create essential preservation information about groups of digital files, which can be
used by very large cultural heritage organizations (such as the Library of Congress, California
Digital Library) but also by smaller organizations (you can create this information for any directory
of files using your laptop and this Jupyter notebook).

## Assumptions

This demo was created with a few assumptions. In particular, I'm assuming that you have some level of
comfort with the following:

- working with files on a computer (Windows, Mac, Linux, or other filesystem environments),
- that you can find and open them from an Explorer (Windows) or Finder (Mac) graphical interface, 
- that you have done a some basic navigation in a shell environment (e.g., Terminal or Bash shell or GitBash) including finding files and directories,  
- you have a familiarity with the Python programming language, including using it from an interactive notebook environment (Jupyter notebooks), and that
- downloading and operating Jupyter notebooks and repos from GitHub.

All of the concepts, file actions, Python code, and tools are explained below, but understanding the above will help you work through the materials. The activities outlined here are based on a larger sequence of digital curation activities and tools that I developed in 2018 and 2019 while teaching digital curation courses at the University of Maryland iSchool.

## Introduction - Why is this important? What is the challenge we're trying to solve? 

> "Did you get what I sent you?"

> "Do you have the data that I sent you last year? Are you sure it's the same?"

![Data transfer graphic from WikiMedia Commons.](https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Data-transfer.svg/256px-Data-transfer.svg.png)

Let's say that one of you has a group of files. It might be the digitized pages of a book, a computer-generated copy of the text (perhaps generated by Optical Character Recognition, or OCR), and bibliographic information (publication date, 
author(s), editor(s), copyright, etc). Although these may be discrete files, they are related and
only together create a coherent digital copy of the book. Or perhaps you have digitized a sound recording, and you 
have a group of sound files that are the tracks of the recording, a digital version of liner notes or other
information like a picture of the artist, and publication information. These are a group of sound files, images, and
text information. Or perhaps you have been collecting information posted on social media about Covid vaccination, which might include images, text captions, usernames, posting dates, and accompanying comments. Each of these things, whether a book, an audio recording, or an Instagram or Twitter post, may have multiple files that together create a digital object.

![Sample digital objects for a "book"](images/book-files.png)

![Sample files for a "website"](images/a-website.png)

Now, let's say that you want to preserve these for access later. Possibly for access by yourself, but likely for broader usage, in an archive or library of the future. Along with a general description of what the things are, you want to 
make sure that you include a list of the digital elements that you're including with the deposit. You want to make sure that there is a complete list of the entities that comprise this object, and you also want to have a way to confirm that the items you've assembled together are received altogether by the organization that you want to send it to. And in five or ten years, or more, you want to have a way to confirm that there hasn't been any problems or events that have compromised
the data that you're including. 

## Metadata and where it fits in the lifecycle

Before going further to talk about file metadata and what it looks like in BagIt, 
let's place this in the context of the larger realm of digital curation and preservation. 
To start, make a brief review of the digital curation lifecycle:

![Digital Curation Lifecycle](images/lifecycle_web.png)


- In terms of what we are talking about (the inner orange and yellow circles), the activities 
under discussion here are focused largely on "digital objects," which might include things like discrete
files, groups of files, and other generally static entities; in terms of the model, this does not include databases
or other digital things that might be queried or dynamically generated.
- Along with the digital object, we expect to have some level of description: in this case,
more than information about what the thing is, think of things like filenames, file size,
date of creation, who has access to a resource and what they can do with it. Basically information that 
may be usefully described as administrative and technical metadata; that is, who has access and 
what information about the nature of the thing. (We won't much be talking about descrpitive information
in the sense of describing the content here.)
- Moving out to the Curate/Preserve ring, most of what is going on here is done to preserve existing digital files. At the point that we are packaging files and creating technical metadata, we might assume that there has already been a choice somewhere along the line that these materials are ones that we want to manage as part of our collection, repository, or other organizational unit. 
- Finally, in the outer magenta-colored ring, which lists digital curation actions, the areas where we might most likely expect to find structures, metadata, and tools like those under discussion today, are the "Create or Receive," "Ingest," "Preservation Action," and "Store." It's possible that fixity information might be used in other areas as well, but it would likely not be as closely connected to a structure like BagIt.    

## What is Fixity?

Fixity refers to whether or not a piece of information has changed or not. For digital objects, the process of creating "[checksums](https://en.wikipedia.org/wiki/Checksum)," which are algorithmically calculated values created based on the sequence of bits indexed in a file, is the most widely used technique to establish and evaluate file fixity. Various [algorithms](https://en.wikipedia.org/wiki/Cryptographic_hash_function), developed in cryptography (for encoding information), may be used, including `MD5`, `SHA-1`, `SHA-2`, and others. Fixity is often treated as a metadata property, and fixity can be assessed using that metadata. 

Fixity information can be useful in answering questions like: 

- "Is this file that I'm receiving the same as the one that you sent to me?" 
- "Is this file that we are storing now the same as the one that was received previously (whether a few days, months, or years later)?"

BagIt helps us to create fixity information by providing all of the tools that you need to create checksums and connect them to the digital information that you are sending or storing.

## What Is BagIt?

[BagIt](https://en.wikipedia.org/wiki/BagIt) is a packaging specification used by many digital libraries to
assemble, document, transfer, and verify digital content. The idea is to place selected content
into a "bag" that conforms to the [BagIt specification](https://tools.ietf.org/html/draft-kunze-bagit-14),
which contains a basic description (bag-info.txt), a checksum manifest,
and data payload. The general concept is that the package is "self describing" (the description
acts like a "tag" on a physical bag naming the contents) and verifiable (using the
checksum manifest).

BagIt is explained and defined in a standalone Request for Comments document from the 
Internet Engineering Taskforce, so it is not only used in libraries and archvies, but also
recognized and defined in a robust way that is recognized by a broader professional community.

![Screenshot of the BagIt RFC 8493 page, at https://datatracker.ietf.org/doc/html/rfc8493, as of January 2022](images/bagit-rfc-8493.png)

### Using and Assembling Bags

This format has been adopted in many settings. Various digital libraries, including the
[California Digital Library](https://www.cdlib.org/cdlinfo/2008/07/02/bagit-transferring-digital-content/),
[Digital Preservation Network](https://docs.google.com/document/d/1JqKMFn9KfeIMAAEdOGQr6LZPqNWx8Qubi12uoUXi2QU/edit),
and others, have adopted the specification as a generalized format
for submission and transfer of information (a SIP). Moreover, various tools, like
[Exactly](https://www.weareavp.com/products/exactly/) have been developed that use the BagIt standard.

When you assemble the files, it's useful to name them according to a standard pattern,
which would be determined by the repository where you plan to store the bag.

- For example, the State Archives of North Carolina suggest that all BagIt bags should
have a top folder name with that ends in `_bag` as a quick indicator that a folder
represents a bag (note that while useful, this naming convention does not guarantee
that the folder in quesiton is a **valid** BagIt bag). 
- The Library of Congress Chronicling America project receives bags created by state-level
agencies who coordinate the scanning of microfilm, creation of specific metadata for each issue
and newspaper title, and are then bagged (packaged) in batches corresponding to microfilm reels
then sent to the Library of Congress where they are validated.

Put information into a structure that lets you store it in a coherent way. For example,
it may be sufficient to have all files in one folder. Perhaps you have a series of images in both
high-resolution preservation files and lower-resolution access derivatives.
Other cases may suggest a hierarchical structure of folders. For example,
you may want a series of folders and subfolders that correspond to the chapters of a
book or other smaller semantic units; or perhaps large groupings demand deeper and
more nested hierarchies (eg, a [pairtree](https://confluence.ucop.edu/display/Curation/PairTree) structure).

Once you have the files organized as you wish to store them, you are ready to create the 
bag. Although a relatively basic process once everything is set up, this step accomplishes 
the complicated work of generating the `bag-info.txt` file and requisite data integrity checksums.
This notebook will demonstrate how to make a bag from an existing directory, but first,
let's take a closer look at what BagIt actually looks like.

### BagIt structure

Data packaged according to the BagIt specification 1.x follows a specific structure:

```
<main folder>/
 | bag-info.txt
 | bagit.txt
 | manifest-sha256.txt
 | tagmanifest-sha256.txt
 \--- data/
 | [payload files]
```

Viewed in a file navigator window (e.g., Finder, Explorer), a bag appears as a folder or directory, and may appear something like this:

![Image of how a BagIt bag may appear on a graphical file system viewer.](images/bag-structure.png)

The `data` subfolder contains the content. The other files in the main folder contain
information about the contents:

- `bagit.txt` identifies the version of BagIt specification  
- `bag-info.txt` contains all metadata entered by the packager
- `manifest-sha256.txt` contains a manifest of all the files in the `data` folder and their checksums
- `tagmanifest-sha256.txt` contains a manifest of the files and checksums of the contents of the main folder

The purpose of the "Bag Info" file is to transmit metadata about the package, including things like
who sent it, pointers to more robust descriptive information such as a library catalog, and 
information about the specific package like when it was bagged, size, and what tool created it.

### BagInfo Metadata

You can think of the "Bag Info" as a name tag on the bag, which has information about who sent it and also what's in the bag. A well-formed bag should include as much information on the “tag” as possible, since this is where we can include information about the source, provenance, and description of the data. According to the BagIt specification, this information should be contained in the `bag-info.txt` file and located in the top level of the bag. 
Bag information is stored in plain text format, and metadata is structured in key:value pairs, where the label
(key) is separated from the information (value) by a colon. 
You can see these key:value pairs if you open the file, and they will look like this:

```
key1-label: value1 value
key2-label: value2 value
key3-label: value3 123
```

* Note that this data pattern is similar to how Python stores information in dictionaries ('KEY':'VALUE')

The Bag Info fields are intended to be "human readable" (that is, they can be opened and read as a text
file), and the fields are not validated against any external schema by the `bagit` tools. Later on, we will display the contents of a sample `bag-info.txt`file, which may contain any labels that you want to add, but at this point it is important to note that some Bag Info tags are reserved, and they may take data from specific sources:

| Reserved label (key name) | Description | Required? | Repeatable? | Source | Example |
| --- | --- | --- | --- | --- | --- |
|<td colspan="6">**Descriptive Information (about the bag, provenance, contents, creators, reference identifiers)**</td>|
|Source-Organization: | Organization transferring the content.|||Boilerplate/Curator|Imagination Library|
|Organization-Address: | Mailing address of the source organization.|||Boilerplate/Curator| Fairbanks, Alaska|
|Contact-Name: | Person at the source organization who is responsible for the content transfer.|||Boilerplate/Curator|Jesse Johnston|
|Contact-Email: | Email address of the contact.|||Boilerplate/Curator|hello@some.email|
|External-Description: | A brief explanation of the contents and provenance.|||Customized per bag or larger collection/grouping|Digitized items from an interesting collection.|
|External-Identifier:| A sender-supplied identifier for the bag. For example, a call number, URI, or ARK identifier|||Customized per bag or larger collection/grouping|ark:/67531/metadc488207|
|Bagging-Date: | Date (YYYY-MM-DD) that the content was prepared for transfer.||NO|System|2021-12-02|
|<td colspan="6">**Technical information (bag size, number of files, etc)**</td>|
|Bag-Size: | The size or approximate size of the bag being transferred, followed by an abbreviation such as MB (megabytes), GB (gigabytes), or TB (terabytes).|NO|NO|Bagging application|42600 MB, 42.6 GB, or .043 TB|
|Payload-Oxum: |The "octetstream sum" of the payload, which is intended for the purpose of quickly detecting incomplete bags before performing checksum validation.  This is strictly an optimization, and implementations MUST perform the standard checksum validation process before proclaiming a bag to be valid. This element MUST NOT be present more than once and, if present, MUST be in the form `OctetCount.StreamCount`, where _OctetCount_ is the total number of octets (8-bit bytes) across all payload file content and _StreamCount_ is the total number of payload files.|YES|NO|Bagging application| 1.1|

_Notes:_ 

- Generally bag key names should not include spaces, so you will see that individual words are separated by hyphens.
- Some of these fields should be created by a person (the digital curator preparing the bag), some should be created automatically by the system
- The above table is based on the RFC 8493 spec, but there is more information in the RFC. For example, there are additional reserved bag labels, the above are selected for demonstratin purposes.
- If you were going to be using BagIt a lot, it would be a good idea to read the full [RFC 8493](https://datatracker.ietf.org/doc/html/rfc8493).

#### Sample baginfo

A sample BagInfo file might look like this:

```
Source-Organization: Data Curation Training Pros, via Library of Congress (LC)
Contact-Name: Anonymous
Contact-Email: hello@some.email
External-Description: These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.
External-Identifier: myfiles:documents/test/files/1234
Bagging-Date: 2022-01-08
Payload-Oxum: 26923687.23
```

### Manifests

In addition to the "tag" information in the `bag-info.txt` file, a complete BagIt object
must also include lists of the files you put in the bag. These lists, or **manifests**
in the terminology of BagIt, are titled with the word manifest, then the name of the checksum
algorithm that was used to create them. So, for example, `manifest-md5.txt` is the manifest of all
files in the `data` directory, with their checksums as generated by the MD5 algorithm.
The `bagit` tool creates this information for you, which provides a complete list of
what was in the bag when you packed it, and if someone is unpacking it in another place and time, they
should be able to use this information to confirm that the bag has what you put in it. In addition,
the tool will create checksums for each file in the bag, which means that you can also use the tool 
to check to see if any of the files have changed. Keep in mind that, should something be missing
or changed, this doesn't tell you what that might be or why, but it does tell you that you need to confirm
with the person or organization who sent the information to you. 

The manifest is a text file, and each line in the manifest represents a file in the bag. 
The first information on  each line is the checksum, and 
the second information is the full path for the corresponding file.

A two-file bag with the file manifest might look something like this:

```
myfirstbag/
   |
   |   manifest-md5.txt
   |    (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
   |    (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
   |
   |   bagit.txt
   |    (BagIt-version: 1.0                                           )
   |    (Tag-File-Character-Encoding: UTF-8                           )
   |
   \--- data/
        |
        |   27613-h/images/q172.png
        |    (... image bytes ...                                     )
        |
        |   27613-h/images/q172.txt
        |    (... OCR text ...                                        )
```
In addition to the content files in the bag, the bag should contain a manifest for the bag tags.
This file is present to ensure that the tag information has not been changed, either
by some error in the transfer or storage process or perhaps by mistake at some point.

## Setup

Now we're ready to bag some files. The second half of this notebook, in other words, demonstrate how 
you can use Python tools to create BagIt bags, from files on your computer. If you want to follow this notebook,
the instructions are based on the sample files in this Git repository. 

This activity will use a Python library to prepare a group of sample files according to the BagIt specification. 
First, if you don't already have the bagit library installed, you may need to get it. You can run the 
following cell to install it with pip, by uncommenting the last line (remove the `#`) and then running the cell.

In [None]:
# If you don't have bagit installed, install following instructions at https://github.com/LibraryOfCongress/bagit-python
# Alternatively, you can use the magic command on the line below by removing the hashtag and running the cell.
# (When the command below runs, you will see response output appear below this cell as the program downloads and installs.)
#!pip install bagit

To begin this activity, set up by importing the library:

In [2]:
import bagit

In this exercise, we will also use the system to automatically generate some data about the files.
For the most part, this is all done by the `bagit` library, but this activity will also demonstrate
generating date information from the system. In practice, this might be done through manual entry,
using a date picker tool in a software, or by generating date information from the system,
as shown here. To do this, use the `date` functions of Python's `datetime` module:

In [8]:
from datetime import date

## Bag the Files

The purpose of the “bag” is to create information about the file structure, basic information that can demonstrate that the information has not changed, and to provide basic context (information about where the files came from, who filed them, and what they are). It is an open specification, so there are few requirements about how the files are structured. In this case, I am taking all of the files within a specific folder, using the Python bagit library to generate the fixity information, and explaining each step throughout the rest of this notebook.

For demonstration, let's bag the files in the directory `sample-files`. To see how the directory looks now,
use a magic command (`!`) and the `ls`, which lists the contents of the directory. The `-F` flag adds a slash 
at the end of any contents that are a directory, which is a helpful visual indicator. This is like using a 
shell command from inside the notebook:

In [None]:
!ls -F sample-files/

- You should see five directories and one csv file

### Create BagInfo data

Using the Python bagit library, we can create “BagInfo” information by using a Python dictionary. This example creates a dictionary of the bag information called `my_BagInfo`, which will be inserted as an argument during bag creation. If you use this code, replace information below with you the information appropriate to the project you’re working on.

The `date` functions (imported earlier) will suffice to create date information. If you run this, the following block should return the current system date from your system. 

In [10]:
dateStamp = date.fromisoformat(str(date.today()))

print(dateStamp)

2022-01-08


Note that the above is a Python datetime object, so for purposes of our BagIt activity, that must be converted to a string:

In [None]:
type(dateStamp)

In [None]:
# convert the date stamp to a string, which should be formatted as YYYY-MM-DD by default
str(dateStamp)

In [16]:
# create baginfo data

my_BagInfo = {
    'Source-Organization': 'Data Curation Training Pros, via Library of Congress (LC)',
    'Contact-Name' : 'Anonymous', # <- type your name here
    'Contact-Email': 'hello@some.email', # <- type your email here
    'External-Description': 'These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.',
    'External-Identifier': 'myfiles:documents/test/files/1234', # <- this would be something like a call number or collection ID, if the content corresponds to a catalog description or digitized item
    'Source-URL': 'https://www.loc.gov/programs/web-archiving/about-this-program/', #this is a reference URL for the collection, in this case doesn't point to each individual file
    'Collected-Date': '2021-10-12',
    'Demonstration-Date': str(dateStamp) #string of date formatted following ISO date standard format YYYY-MM-DD
}

print('Bag Info:\n\n',
      my_BagInfo,
     '\n\nDatatype: ',type(my_BagInfo))

Bag Info:

 {'Source-Organization': 'Data Curation Training Pros, via Library of Congress (LC)', 'Contact-Name': 'Anonymous', 'Contact-Email': 'hello@some.email', 'External-Description': 'These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.', 'External-Identifier': 'myfiles:documents/test/files/1234', 'Source-URL': 'https://www.loc.gov/programs/web-archiving/about-this-program/', 'Collected-Date': '2021-10-12', 'Demonstration-Date': '2022-01-08'} 

Datatype:  <class 'dict'>


### Bagging the files: make_bag()

Now that we have created the basic metadata for the bag (the "Bag Info"), we can move on to “bag” the files. In this case, the files that we wanted to bag are in a directory named `sample-files`. To make a bag, the `bagit` library has a function called `make_bag()`. We can use `help()` to get information about the `make_bag()` function:

In [None]:
help(bagit.make_bag)

This displays what arguments the function takes. The only required information is the location of the files that you want to bag (ie, `bag_dir`), which can be provided as a file path. As a default, no `bag_info` is provided, but we will provide the information created above. As the help function describes it, running the function will "convert a given directory into a bag," which is the next step.

So: use the `make_bag()` function to make the bag, and we pass in as arguments the location of the files that we want to bag (`sample-files`) and the bag info (`my_BagInfo` dictionary):

In [None]:
# create the bag

my_bag = bagit.make_bag('sample-files', bag_info = my_BagInfo)

Now I've created a bag, which is accessible as a python object in the `my_bag` variable. 
(More about this later!)
But before we move on, think about the structure of the BagIt object we previously discussed. 
If you created a bag out of the `sample-files` directory, how do you think it has changed? 

- What files would you expect to see in the directory now?
- What additional folder or directory might you expect to see?
- Where would you expect to find the files that were bagged?

Now, take a look at the `sample-files` directory. If the above cell ran correctly and did not return any errors, you should see changes in the `sample-files` directory. 

In [None]:
# display the contents of sample-files directory
!ls -F sample-files

- What changes do you see? 

### What's in the Bag?

To get an idea what's in the bag, you can explore the `bag` object and its data. Using shell list command (`ls`), check to see if the required bagit structure and files have been created:  

- Note: run shell commands from the notebook by putting an exclamation point character at the beginning of the line

In [None]:
# check to see, is this bagit? Display the contents of the sample-files directory:
!ls -F sample-files/

In [None]:
# check to see, is this bagit? Is there a data directory? (aka "payload" in the BagIt docs)
!ls -F sample-files/data/

- the `data` directory should include the contents of the directory, which was previously named `sample-files`

In [None]:
# check to see, is this bagit? First test is whether or not there's a bagit declaraction. do you see bagit.txt?
!cat sample-files/bagit.txt

In [None]:
# is this bagit? are there bag tags, specified in the bag-info.txt file? do they appear to be valid key:value combinations?
!cat sample-files/bag-info.txt

- Is this the same information that you put in the bag info dictionary?
- What information is here that you wasn't in the `my_baginfo` dictionary?

You can also read the file contents of the `sample-files/manifest-sha256.txt`:

In [None]:
# is this bagit? is there a manifest that lists checksums and files? 
!cat sample-files/manifest-sha256.txt

- for further description of methods for python bagit objects, see the module documentation at https://github.com/LibraryOfCongress/bagit-python  

## Content Validation: is_valid()

One benefit of using this packaging approach is that it is simple, in the sense that 
it only exists as files on a disc or server and does not require any specialized software
to see the files or decompress them. In addition, this approach allows you as a digital curator,
librarian, or archivist, to receive, store, and preserve digital assets even when you may not 
have all of the information about what these assets are or how they might be used. In the words
of the BagIt spec, the contents of a bag are "opaque", that is, it is possible to verify that the 
content is accurate whether or not you can display it, render it viewable or processable with software,
or the contents are subject to rights management or proprietary restrictions.

The specification and structure of a BagIt bag make it possible to check the contents 
without "seeing" them. This is made possible because we can see if the bag is **complete**,
and we can also check to see if the bag is **valid**. 

* A **complete** bag is one that has all of the required 
elements of a bag: a BagIt declaration (`bagit.txt`), a payload (the `data` directory), and a payload manifest
(the list of files and checksums, located in the top-level directory, probably called 
something like `manifest-sha256.txt`). 
* A **valid** bag is one that is complete and for which it is possible to check each file in the
payload, calculate a checksum for it, and verify that the checksum is the same as the one listed in the manifest, indicating that the contents have not changed. 

To assess the bag that was created above, we can again use the `bagit` library, which has an `is_valid()` function.
This function will check to see if the bag is indeed an object that we can validate is a well-formed BagIt object. For demonstration, the next two cells use the `sample-bag-1-valid` folder, which is
an already-created bag included in the GitHub repo.

In [28]:
# load the bag
test_bag = bagit.Bag('sample-bag-1-valid/')

In [29]:
# check to see if the bag is valid
if test_bag.is_valid():
    print("yay :)")
else:
    print("boo :(")

yay :)


- what output did you get above?

In [30]:
validity = test_bag.is_valid()

print(validity, type(validity))

True <class 'bool'>


- Note that `is_valid()` returns a boolean value (True/False)
- in a script, this would allow you to do validity testing and create update or correction options

In [31]:
# what if the bag is not valid
not_a_valid_bag = bagit.Bag('sample-bag-2-invalid/')

not_a_valid_bag.is_valid()

False

**<<< Wrap Up if short on time >>>**

## Updating contents

In a working situation, for example when an object is being scanned, digital objects
and information may be frequently updated as files are added, metadata is updated, or
other changes are made to finalize content. Sometimes, this means that the payload
information may change after a bag is created. There are additional functions in the
bagit library to update Bag Info, and to update manifests. 

In [None]:
#add a file to the bag
!touch sample-files/data/new-file.txt

In [None]:
# load the changed bag by creating a bagit object with .Bag() method
changed_bag = bagit.Bag('sample-files')

# check validation
validation = changed_bag.is_valid()

# display result
print(validation)

### Update the bag info

Add or replace existing Bag Info metadata using the bag.info object like a Python dictionary:

In [None]:
# update bag info
changed_bag.info['Internal-Description'] = 'Updated and added new files.'

# save the updated bag object to filesystem
changed_bag.save()

- check the baginfo.txt file 

In [None]:
!cat sample-files/bag-info.txt

### Update the bag manifests

The `.save()` method used above does not update the bag manifests. So, if you try to validate
the changed bag at this point, it will still return a `False` result. The bag is not valid yet.

In [None]:
changed_bag.is_valid()

The `save` method does not automatically update manifests because there may be
cases when you just want to update the Bag Info information. 
To update the manifests, add a `manifests=True` argument to the save call:

In [None]:
changed_bag.save(manifests=True)

Now check again to validate the bag:

In [None]:
changed_bag.is_valid()

- What was your result? What result did you expect? 
- If you did not get the result you expected, can you trace back in the cells to see what happened?

**<<< Skip if short on time >>>**

### BagIt tool functions allow for more options

It's also possible to check the bag using the `my_bag` data object that was created above. For example, use the `entries.items()` method to display a list of the files and fixity information. The output should be
somewhat similar to the previous cell, which showed the contents of the SHA-256 manifest.

In [None]:
# retrieve and display path and fixity information from the bagit python object my_bag

line_count = 0

for path, fixity in my_bag.entries.items():
    line_count += 1
    print("%s. sha256:%s path:%s" % (line_count, fixity['sha256'], path))

A more extensive lesson on this topic would include further explanation of tools
within `bagit` that a digital curator may use to check bags, how to research
errors that may occur, and how to update bag manifests when content is changed.

## Conclusion

The above activity showed the steps in creating fixity and basic descriptive information - **metadata** - for this group of files. Using an agreed-upon file packaging specification, like BagIt, allows digital curators 
to create information packages that contain basic information about the contents, and can 
help organizations exchanging content to ensure that the content that was sent was the content that was received.
Moreover, keeping this information together can allow a repository, its maintainers, and its users, to 
be able to have some assurance that information received now is the same as that originally received.

### Strengths

- Adopted and used within many digital collection workflows in major libraries, including the Library of Congress, California Digital Library, large research libraries, and some state and government archives.
- Can be easily opened and understood by standard operating systems and software on the Web, on desktop computers, laptops, and networked systems.
- Relatively simple to add files to update, add, and delete information prior to transfer
- Does not require or rely upon file encoding, compression, or proprietary software to create or to open
- Reliably create and confirm the fixity of information and the completeness of digital objects in a group, according to accompanying information.
- The file structure helps to group descriptive metadata with the content it describes, using standard filesystem tools available on most operating systems (Windows, Mac, Linux). 
- Content transmitted and received can easily be opened with standard tools, which could be useful for providing information to patrons and users. 

### Weaknesses

- Does not work well in a production or processing environment. If you are actively creating files, or adding them
to a folder, you don't want to create and update BagIt information every time you make a change. 
- The information in bag-info.txt is uncontrolled. While the specification does recommend some labels,
there is a large room for ambiguity or confusion to be created when different units or organizations
are creating metadata or bags in various workflows. 
- For robust projects with highly structured and consistent
packages (like the Chronicling America project), bag validation must be accompanied by additional quality 
checks to understand if content is, indeed, "complete" (beyond just matching what was sent). In other words, if you 
are sent incomplete or faulty data, BagIt is not a tool that will help you solve that problem, though it might
(or might not) be one that can help you to identify the problem.
- If used in a digital repository workflow, additional documentation and strategies must accompany the process (see below for how some libraries have managed this sort of distributed environment).

![Data Integrity guidance at Library of Congress, screenshot from January 2022, which illustrates data integrity monitoring process that relies on BagIt, see https://www.loc.gov/programs/digital-collections-management/inventory-and-custody/data-integrity-management/](images/loc-data-integrity.png)

## Resources

See these additional resources for more detailed information:
* B. Lazorchak, ["From There to Here, from Here to There, Digital Content is Everywhere!"](https://blogs.loc.gov/thesignal/2012/01/from-there-to-here-from-here-to-there-digital-content-is-everywhere/), _The Signal_ (3 January 2012).
* State Archives of North Carolina, "[Bagger GUI User Guide](https://files.nc.gov/dncr-archives/documents/files/using_bagger.pdf)" (Updated 2012, v. 1.5), available as of March 2018.
* M. Phillips, ["What do we put in our BagIt bag-info.txt files?"](https://vphill.com/journal/post/4142/) (2015).
* UNT Libraries, UNT OAIS Information Package Specification (2015), https://www.library.unt.edu/sites/default/files/documents/digital-libraries-uploads/Appendix_M_UNT_Libraries_OAIS_Information_Package_Specification.pdf