# Using BagIt

This notebook is set up with blanks for a teaching demo. If you want to see the 
entire notebook, with examples of how each cell looks when it has completed,
see [a completed Jupyter notebook example here](https://github.com/morskyjezek/bagit-walkthrough-lcwa/blob/main/01a-using-bagit-completed.ipynb).

## Learning Objectives

After completing this lesson, students should be able to:

* Implement the BagIt specification, and examine the file system to see how a BagIt object is structured.
* Identify and use shell tools (`ls`, `cat`) to conduct initial checking and validation of a BagIt object.
* Use the `bagit` Python module to create a BagIt bag, which includes fixity, manifest, and basic descriptive information.

## Some Python Assumptions

- On your computer, Python is working and you know how to access it and run it
- While you can run these tools from a regular `.py` file, this demonstration is built using a "Jupyter notebook," which is a format that has a series of boxes ("cells"), which can run active Python code, the output of the code, or display text and visuals for explanation (you don't need to be an expert in this system)
- You know what a Python library is and how to install it
- You have a general understanding of how to use JSON in Python (no need to be an expert)
- You understand that Python treats dates and times as special data types (as do most computer systems)
- If you want to run this demo later in VSCode, you will need to install and activate the "[Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter)" extension

## Setup

Now let's look into how we can create a BagIt object for some sample files. 
This notebook will demonstrate how 
to do that using a Python module called `bagit`, from files on your computer. If you want to follow this notebook,
the instructions explain the process, step by step, for a folder of sample files in this Git repository. 

If you don't already have the bagit library installed, you may need to get it. You can run the 
following cell to install it with pip, by uncommenting the last line (remove the `#`) and then running the cell.

In [None]:
# If you don't have bagit installed, install following instructions at https://github.com/LibraryOfCongress/bagit-python
# Alternatively, you can use the magic command on the line below by removing the hashtag and running the cell.
# (When the command below runs, you will see response output appear below this cell as the program downloads and installs.)
#!pip install bagit

To begin, import the bagit library:

In [None]:
# import the bagit library


If you want to explore the "dynamic" creation of dates, as shown below, you will also need the `date` function from the `datetime` library:

In [None]:
# to demonstrate automated creation of metadata, also import the date function


You will also need the `json` library so that you can import the descriptive metadata template,
which is saved as a `.json` file. ðŸ˜º

## Bag the Files

Now, you will work to transform the `PKG-legacy-files` directory into a BagIt information package. In general, these are the steps: 

1. Look around: what's there now?
2. Create descriptive metadata ("BagInfo")
3. Bag the Files (`make_bag()`)
4. Look around: what happened?

### 1. Look at what's there

We will use the bagit tool to create a valid BagIt object from the directory called `PKG-legacy-files`. First, take a look at what's in this directory.

- Note: run shell commands from the notebook by putting an exclamation point character at the beginning of the line

In [None]:
# use the list command to see what's there


The output of the above cell depends what you are asking the list command to list.
If you are listing `PKG-legacy-files`:

- You should see nine files of various formats

But for the lab, when you are working on `PKG-web-files-small`, you should see:

- You should see five folders and one csv file

### 2. Create Descriptive Metadata (BagInfo)

The bagit library helps to create basic description information called "BagInfo," which is stored in a file called `bag-info.txt`. In the python environment, this information is stored in a python dictionary, which is later by using a Python dictionary. This example uses a variable called `my_BagInfo` for the bag information. Once this metadata is created, it will be added automatically during bag creation. If you use the code below, replace the placeholders in the bag information wit information appropriate to the project youâ€™re working on.

This demonstration imports the bag information template from the `bag-info-template.json` file.

In [None]:
my_BagInfo

Update the `my_BagInfo` information: add your name!

_Use the following cell if you are having trouble importing the template from the JSON file._

In [None]:
# create baginfo data; if you can't import the JSON file, 
# run this cell after you remove the three ticks above and below the baginfo 
'''
{
    "Source-Organization": "Data Curation Training Pros, via Library of Congress (LC)",
    "Contact-Name" : "TYPE YOUR NAME HERE",
    "Contact-Email": "your@email.here",
    "External-Description": "These are sample files from the Library of Congress Web Archives that we wanted to structure in BagIt for practice.",
    "External-Identifier": "myfiles:documents/test/files/1234",
    "Source-URL": "https://www.loc.gov/programs/web-archiving/about-this-program/",
    "Collected-Date": "2021-10-12"
}
'''

Confirm the bag information:

In [None]:
my_BagInfo

In [None]:
print('Datatype: ',type(my_BagInfo))

### Automate date info

The `date` functions (imported earlier) will suffice to create date information. If you run the `.today()` function, python should be able to identify the current system date from your system.

In [None]:
# create the dateStamp variable, check the type


Then, add the date to the bag information in a variable called `Demonstration-Date`.

And confirm that the information has been added.

In [None]:
my_BagInfo

### 3. Bagging the Files

The bagit module includes a function called `make_bag()` to create BagIt objects from a specific path or directory. We will set up the function by providing as arguments the location of the files that we want to bag (`PKG-legacy-files`), with the `bag_info` option to create unique descriptive information using the `my_BagInfo` dictionary:

In [None]:
# create the bag; note that the tool does not give feedback, so use a try/except 
# to create the effect of giving a response message
try:
    # <-- insert make_bag() here
    print('Success!')
except:
    print('No bag created ðŸ˜¿')

If the cell runs and you don't see the error message, this created a bag,
or to put this in digital curation terms, an information package that conforms to the BagIt specification.
This has actually changed the files on your disk so they are now a BagIt bag.
Think about the structure of the BagIt object we previously discussed. 
How would do expect that the directory has changed? 

- What files would you expect to see now in the directory that was converted into a BagIt structure?
- What additional folder or directory might you expect to see?
- Where would you expect to find the files that were bagged?

Now, take a look at the directory. If the above cell ran correctly and did not return any errors, you should see the changes we discussed.

Note that this is now also accessible as a python object in the `my_bag` variable. 
(That can be useful if you are validating or updating information, but more about that on another day!)

In [None]:
# display the contents of PKG-legacy-files directory


- What changes do you see? 

### Step by Step: What's in the Bag?

To get an idea if this is a complete bag, you can explore the BagIt object and its data using shell commands: 

* Use the shell list command (`ls`) to see if the required bagit structure and files have been created
* Use the `cat` command to display the contents of a file
* Use the `wc` command to count bytes, words, or lines of a file

_Hint: remember that you can use the `!` at the beginning of a line to run a shell program within the notebook._

In [None]:
# check to see, is this bagit? Display the contents of the PKG-legacy-files directory:


In [None]:
# check to see, is this bagit? First test is whether or not there's a bagit declaration. do you see bagit.txt?


In [None]:
# is this bagit? are there bag tags, specified in the bag-info.txt file? do they appear to be valid key:value combinations?


- Is this the same information that you put in the bag info dictionary?
- What information is here that you wasn't in the `my_baginfo` dictionary?

You can also read the file contents of the `sample-files/manifest-sha256.txt`:

In [None]:
# is this bagit? is there a manifest that lists checksums and files? how many lines?


In [None]:
# check to see, is this bagit? Is there a data directory? (aka "payload" in the BagIt docs)


- the `data` directory should include the contents of the directory, which was previously named `sample-files`

- for further description of methods for python bagit objects, see the module documentation at https://github.com/LibraryOfCongress/bagit-python  

A more extensive lesson on this topic would include further explanation of tools
within `bagit` that a digital curator may use to check bags, how to research
errors that may occur, and how to update bag manifests when content is changed.

## Conclusion

The above activity demonstrates the steps to create fixity information, file manifests, and associated descriptive information - **basic preservation metadata** - for a group of files. Using an agreed-upon file packaging specification, like BagIt, allows digital curators 
to create information packages that contain basic information about the contents, and can 
help organizations exchanging content to ensure that the content that was sent was the content that was received.
Moreover, keeping this information together can allow a repository, its maintainers, and its users, to 
be able to have some assurance that information received now is the same as that originally received.

## Resources

See these additional resources for more detailed information:
* B. Lazorchak, ["From There to Here, from Here to There, Digital Content is Everywhere!"](https://blogs.loc.gov/thesignal/2012/01/from-there-to-here-from-here-to-there-digital-content-is-everywhere/), _The Signal_ (3 January 2012).
* State Archives of North Carolina, "[Bagger GUI User Guide](https://files.nc.gov/dncr-archives/documents/files/using_bagger.pdf)" (Updated 2012, v. 1.5), available as of March 2018.
* M. Phillips, ["What do we put in our BagIt bag-info.txt files?"](https://vphill.com/journal/post/4142/) (2015).
* UNT Libraries, UNT OAIS Information Package Specification (2015), https://www.library.unt.edu/sites/default/files/documents/digital-libraries-uploads/Appendix_M_UNT_Libraries_OAIS_Information_Package_Specification.pdf