# 01. Inventory and metadata

This file demonstrates how to use Python and the bagger module 
to create metadata about a group of files, inventory the files, 
create fixity information for each of the original files, and 
then to create a BagIt object for the files. 

**Note that this operation has already been performed in this git repository.
There is _no need to perform these steps again_, since that would create another layer
of bag metadata.**

See the reference for the BagIt Python tool at https://github.com/LibraryOfCongress/bagit-python.

To begin, you'll need to ensure that the bagit module for Python is installed on your system:

## Setup

In [17]:
# If you don't have bagit installed, install following instructions at https://github.com/LibraryOfCongress/bagit-python
# Alternatively, you can use the magic command on the line below by removing the hashtag and running the cell.
# (When the command below runs, you will see response output appear below this cell as the program downloads and installs.)
#!pip install bagit

When the bagit module is ready, import it:

In [3]:
import bagit

## Bag the files

The purpose of the "bag" is to create information about the file structure, basic information that can demonstrate that the information has not changed, and to provide basic context (information about where the files came from, who filed them, and what they are). It is an open specification, so there are few requirements about how the files are structured. In this case, I am taking all of the files within a specific folder, using the Python bagit library to generate the fixity information, and explaining each step throughout the rest of this notebook:

A well-formed bag should include as much information on the "tag" as possible, 
since this is where we can include information about the source and provenance of the 
data. This "BagInfo" information can be added using arguments in the functions that 
create bags. This example creates a dictionary of the bag information called `my_BagInfo`,
which will be inserted as an argument during bag creation. If you use this code, 
replace information below with you the information appropriate to the project you're working on.

In [11]:
# create baginfo data
my_BagInfo = {
    'Contact-Name' : 'Jesse Johnston',
    'Contact-Email': 'morskyjezek@gmail.com',
    'External-Description': 'NEH Grant data files downloaded from NEH in December 2020.',
    'Source-Organization': 'National Endowment for the Humanities (NEH)',
    'Source-URL': 'https://securegrants.neh.gov/open/data/',
    'Collected-Date': '2020-12-14'
}

print('Bag Info:\n',my_BagInfo)

Bag Info:
 {'Contact-Name': 'Jesse Johnston', 'Contact-Email': 'morskyjezek@gmail.com', 'External-Description': 'NEH Grant data files downloaded from NEH in December 2020.', 'Source-Organization': 'National Endowment for the Humanities (NEH)', 'Source-URL': 'https://securegrants.neh.gov/open/data/', 'Collected-Date': '2020-12-14'}


Now that we have the tool working and created basic metadata for the bag, we can move on to "bag" the files. In this case, the files that we wanted to bag were in a directory named `neh-grants-data-2012`. We use the `make_bag()` function to make the bag, and we pass in as arguments the location of the files that we want to bag, the bag info (`my_BagInfo` dictionary), and in this case designated the `utf-8` text encoding:

In [12]:
# create the bag
bag = bagit.make_bag('neh-grants-data-202012', bag_info = my_BagInfo, encoding='utf-8')

Now, we can use the `is_valid()` function to see if the bag object is ready, and if it is indeed an object that we can validate is a well-formed BagIt object:

In [13]:
# check to see if the bag is valid
if bag.is_valid():
    print("yay :)")
else:
    print("boo :(")

yay :)


In [16]:
# to get an idea what's in the bag, display a list of the files and fixity information
line_count = 0
for path, fixity in bag.entries.items():
    line_count += 1
    print("%s. path:%s sha256:%s" % (line_count, path, fixity['sha256']))

1. path:data/NEH_Grants1960s.csv sha256:20b521a035307e01fdfe00288806eb80498c9e9d19f02ffa3b5da5a4bfbaab41
2. path:data/NEH_Grants1960s.zip sha256:8ae10fac88a052c0b1c7f9f35028dd2fe58950a43038c6000f9de4ed20d3e3e9
3. path:data/NEH_Grants1960s_Flat.zip sha256:7c3c3464769e9f0918ac6afc271dd4c7cfb7d6098ea6cd510b9cd6ec674cf1b9
4. path:data/NEH_Grants1970s.csv sha256:f1bc38c2124e6dc03fef8663a07b4496131aeba4c50ccd94266403571e097064
5. path:data/NEH_Grants1970s.zip sha256:8fae9380ae095414bb7a2722deac19ac25b250255e4a5ad8ebf50f20ee909777
6. path:data/NEH_Grants1970s_Flat.zip sha256:f671023d9552b95724bfe2b66d643cb7f4a7041180642f1cad971d776194d7a5
7. path:data/NEH_Grants1980s.csv sha256:46c78195f14be0c99ca605550bd0389d9fcd6a9d698ea38f019fab639c16766c
8. path:data/NEH_Grants1980s.zip sha256:a6739f1d981895915debacb65a86ee972d70c16ab3ecd1aeb340e9e64664173d
9. path:data/NEH_Grants1980s_Flat.zip sha256:60d63627583b0c175535a0bb6d6044256b721d3f31ba214db825edebc29ac498
10. path:data/NEH_Grants1990s.csv sha256

Now we have created basic metadata for this group of files, two types of fixity signatures (sha256 and sha512), and an inventory of the files (see `neh-grants-data-202012/manifest-sha256.txt`). This will serve as a basic description of the original files. In the next activities, we will continue to work with this information, but the original files
will remain unchanged and available for further consultation or work beyond this project. All of our work will be to extract, transform, and clean the data that we pull from these files, which will be the basis for further computation, visualization, or other analysis. 