# Using Bagit

This notebook contains all of the steps to use the python `bagit` library
to create a valid BagIt bag.

## Setup

Now let's look into how we can create a BagIt object for some sample files. 
This notebook will demonstrate how 
to do that using a Python module called `bagit`, from files on your computer. If you want to follow this notebook,
the instructions explain the process, step by step, for a folder of sample files in this Git repository. 

If you don't already have the bagit library installed, you may need to get it. You can run the 
following cell to install it with pip, by uncommenting the last line (remove the `#`) and then running the cell.

In [None]:
# If you don't have bagit installed, install following instructions at https://github.com/LibraryOfCongress/bagit-python
# Alternatively, you can use the magic command on the line below by removing the hashtag and running the cell.
# (When the command below runs, you will see response output appear below this cell as the program downloads and installs.)
#!pip install bagit

To begin this activity, set up by importing the library:

In [None]:
import bagit

In this exercise, we will also use the system to automatically generate some data about the files.
For the most part, this is all done by the `bagit` library, but this activity will also demonstrate
generating date information from the system. In practice, this might be done through manual entry,
using a date picker tool in a software, or by generating date information from the system,
as shown here. To do this, use the `date` functions of Python's `datetime` module:

In [None]:
from datetime import date

## Bag the Files

The purpose of the “bag” is to create information about the file structure, basic information that can demonstrate that the information has not changed, and to provide basic context (information about where the files came from, who filed them, and what they are). It is an open specification, so there are few requirements about how the files are structured. In this case, I am taking all of the files within a specific folder, using the Python bagit library to generate the fixity information, and explaining each step throughout the rest of this notebook.

For demonstration, let's bag the files in the directory `sample-files`. To see how the directory looks now,
use a magic command (`!`) and the `ls`, which lists the contents of the directory. The `-F` flag adds a slash 
at the end of any contents that are a directory, which is a helpful visual indicator. This is like using a 
shell command from inside the notebook:

In [None]:
!ls sample-files/

- You should see five folders and one csv file

### Create BagInfo data

Using the Python bagit library, we can create “BagInfo” information by using a Python dictionary. This example creates a dictionary of the bag information called `my_BagInfo`, which will be inserted as an argument during bag creation. If you use this code, replace information below with you the information appropriate to the project you’re working on.

### Small bonus: automate date info

The `date` functions (imported earlier) will suffice to create date information. If you run this, the following block should return the current system date from your system. 

In [None]:
dateStamp = date.today()

print(dateStamp)

Note that the above is a Python datetime object, so for purposes of our BagIt activity, that must be converted to a string:

In [None]:
type(dateStamp)

In [None]:
# convert the date stamp to a string, which should be formatted as YYYY-MM-DD by default
str(dateStamp)

In [None]:
# create baginfo data

my_BagInfo = {
    'Source-Organization': 'Data Curation Training Pros, via Library of Congress (LC)',
    'Contact-Name' : 'Anonymous', # <- type your name here
    'Contact-Email': 'hello@some.email', # <- type your email here
    'External-Description': 'These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.',
    'External-Identifier': 'myfiles:documents/test/files/1234', # <- this would be something like a call number or collection ID, if the content corresponds to a catalog description or digitized item
    'Source-URL': 'https://www.loc.gov/programs/web-archiving/about-this-program/', #this is a reference URL for the collection, in this case doesn't point to each individual file
    'Collected-Date': '2021-10-12',
    'Demonstration-Date': str(dateStamp) #string of date formatted following ISO date standard format YYYY-MM-DD
}

print('Bag Info:\n\n',
      my_BagInfo,
     '\n\nDatatype: ',type(my_BagInfo))

### Bagging the files: make_bag()

Now that we have created the basic metadata for the bag (the "Bag Info"), we can move on to “bag” the files. In this case, the files that we wanted to bag are in a directory named `sample-files`. To make a bag, the `bagit` library has a function called `make_bag()`. We can use `help()` to get information about the `make_bag()` function:

In [None]:
help(bagit.make_bag)

This displays what arguments the function takes. The only required information is the location of the files that you want to bag (ie, `bag_dir`), which can be provided as a file path. As a default, no `bag_info` is provided, but we will provide the information created above. As the help function describes it, running the function will "convert a given directory into a bag," which is the next step.

So: use the `make_bag()` function to make the bag, and we pass in as arguments the location of the files that we want to bag (`sample-files`) and the bag info (`my_BagInfo` dictionary):

In [None]:
# create the bag

try:
    my_bag = bagit.make_bag('sample-files', bag_info = my_BagInfo)
    print('Success!')
except:
    print('No bag was created :(')

If the cell runs and you don't see the error message, this created a bag,
which is accessible as a python object in the `my_bag` variable. 
(More about this later!)
But before we move on, think about the structure of the BagIt object we previously discussed. 
If you created a bag out of the `sample-files` directory, how do you think it has changed? 

- What files would you expect to see in the directory now?
- What additional folder or directory might you expect to see?
- Where would you expect to find the files that were bagged?

Now, take a look at the `sample-files` directory. If the above cell ran correctly and did not return any errors, you should see changes in the `sample-files` directory. 

In [None]:
# display the contents of sample-files directory
!ls sample-files

- What changes do you see? 

### What's in the Bag?

To get an idea what's in the bag, you can explore the `bag` object and its data. Use the shell list command (`ls`) to see if the required bagit structure and files have been created:  

- Note: run shell commands from the notebook by putting an exclamation point character at the beginning of the line

In [None]:
# check to see, is this bagit? Display the contents of the sample-files directory:
!ls sample-files/

In [None]:
# check to see, is this bagit? First test is whether or not there's a bagit declaraction. do you see bagit.txt?
!cat sample-files/bagit.txt

In [None]:
# is this bagit? are there bag tags, specified in the bag-info.txt file? do they appear to be valid key:value combinations?
!cat sample-files/bag-info.txt

- Is this the same information that you put in the bag info dictionary?
- What information is here that you wasn't in the `my_baginfo` dictionary?

You can also read the file contents of the `sample-files/manifest-sha256.txt`:

In [None]:
# is this bagit? is there a manifest that lists checksums and files? 
!cat sample-files/manifest-sha256.txt

In [None]:
# check to see, is this bagit? Is there a data directory? (aka "payload" in the BagIt docs)
!ls sample-files/data/

- the `data` directory should include the contents of the directory, which was previously named `sample-files`

- for further description of methods for python bagit objects, see the module documentation at https://github.com/LibraryOfCongress/bagit-python  

A more extensive lesson on this topic would include further explanation of tools
within `bagit` that a digital curator may use to check bags, how to research
errors that may occur, and how to update bag manifests when content is changed.

## Conclusion

The above activity showed the steps in creating fixity and basic descriptive information - **metadata** - for this group of files. Using an agreed-upon file packaging specification, like BagIt, allows digital curators 
to create information packages that contain basic information about the contents, and can 
help organizations exchanging content to ensure that the content that was sent was the content that was received.
Moreover, keeping this information together can allow a repository, its maintainers, and its users, to 
be able to have some assurance that information received now is the same as that originally received.

### Strengths

- Adopted and used within many digital collection workflows in major libraries, including the Library of Congress, California Digital Library, large research libraries, and some state and government archives.
- Can be easily opened and understood by standard operating systems and software on the Web, on desktop computers, laptops, and networked systems.
- Relatively simple to add files to update, add, and delete information prior to transfer
- Does not require or rely upon file encoding, compression, or proprietary software to create or to open
- Reliably create and confirm the fixity of information and the completeness of digital objects in a group, according to accompanying information.
- The file structure helps to group descriptive metadata with the content it describes, using standard filesystem tools available on most operating systems (Windows, Mac, Linux). 
- Content transmitted and received can easily be opened with standard tools, which could be useful for providing information to patrons and users. 

### Weaknesses

- Does not work well in a production or processing environment. If you are actively creating files, or adding them
to a folder, you don't want to create and update BagIt information every time you make a change. 
- The information in bag-info.txt is uncontrolled. While the specification does recommend some labels,
there is a large room for ambiguity or confusion to be created when different units or organizations
are creating metadata or bags in various workflows. 
- For robust projects with highly structured and consistent
packages (like the Chronicling America project), bag validation must be accompanied by additional quality 
checks to understand if content is, indeed, "complete" (beyond just matching what was sent). In other words, if you 
are sent incomplete or faulty data, BagIt is not a tool that will help you solve that problem, though it might
(or might not) be one that can help you to identify the problem.
- If used in a digital repository workflow, additional documentation and strategies must accompany the process (see below for how some libraries have managed this sort of distributed environment).

![Data Integrity guidance at Library of Congress, screenshot from January 2022, which illustrates data integrity monitoring process that relies on BagIt, see https://www.loc.gov/programs/digital-collections-management/inventory-and-custody/data-integrity-management/](images/loc-data-integrity.png)

## Resources

See these additional resources for more detailed information:
* B. Lazorchak, ["From There to Here, from Here to There, Digital Content is Everywhere!"](https://blogs.loc.gov/thesignal/2012/01/from-there-to-here-from-here-to-there-digital-content-is-everywhere/), _The Signal_ (3 January 2012).
* State Archives of North Carolina, "[Bagger GUI User Guide](https://files.nc.gov/dncr-archives/documents/files/using_bagger.pdf)" (Updated 2012, v. 1.5), available as of March 2018.
* M. Phillips, ["What do we put in our BagIt bag-info.txt files?"](https://vphill.com/journal/post/4142/) (2015).
* UNT Libraries, UNT OAIS Information Package Specification (2015), https://www.library.unt.edu/sites/default/files/documents/digital-libraries-uploads/Appendix_M_UNT_Libraries_OAIS_Information_Package_Specification.pdf