# Extract (Harvest) Item Metadata and Files

This notebook demonstrates the next step in the generalized workflow
to develop your reference digital collections. 
We will walk through each step in class, but you will need to adapt these demonstrations
so that they work for your own collections, so this work will also be self-guided.

## ETL Workflow

The overarching process here follows a generalized "Extract - Transform - Load" workflow.This is an abstract model for pulling data from one system, transporting, cleaning, and outputting to another system.
While often in reference to database work and data engineering, the goals here are the same for our digital collections:
**extract** the metadata from the Library of Congress, change (**transform**) it into structures that makes sense to our collection systems (CollectionBuilder and Omeka),
then ingest (**load**) that data and associated content into the systems.

## Learning objectives

After completing the course assignment, you should: 

* Have a conceptual and a practical understanding of how collection metadata is made available by a REST API.
* Be able to explain the concept of metadata extraction and transformation.
* Create a structure for documenting metadata practices in a collection or repository (a Metadata Application Profile) and implement that structure for transformations. 
* Use programming to work with data supplied by an API in JSON format, to manage and transform useful parts of that data into CSV format.
* Create ingest-ready collection metadata that conforms to Dublin Core and other digital collection metadata standards, which can be used to load content into another site (in this case, an Omeka S site). 

## Introduction: Get Item Metadata and Content

This notebook outlines the second steps of the extract process to demonstrate gathering items. In this notebook:

**Get item metadata**. Using the list from the previous step (extract collection list), use that as a source to query each item in the collection to get details about it. Save the JSON responses locally so we can extract information from them in the next steps. (In this example, you will create a maximum of 62 item files, but it is likely that some will not be accessible or available. This number may vary when you run this code yourself since the website may have different response rates.)

### Setup

In [1]:
import csv
import json
import requests

# for working with files
import os
from os.path import join

A helper function to regenerate the collection list from the previously created CSV file.

In [2]:
def regenerate_collection_list(collection_csv):
    """
    Reads a CSV file and returns the data as a dictionary.
    
    Parameters:
    collection_csv (str): The path to the CSV file

    Returns:
    dict: A dictionary where each key is a column header and each value is a list of column values.
    """

    coll_items = list()

    with open(collection_csv, 'r', newline='', encoding='utf-8') as f:
        data = csv.DictReader(f)

        for row in data:
            row_dict = dict()
            for field in data.fieldnames:
                row_dict[field] = row[field]
            coll_items.append(row_dict)

        return coll_items

In [3]:
collection_csv = os.path.join('..','collection-site-materials','collection_set_list.csv')

collection_set_list = regenerate_collection_list(collection_csv)

In [4]:
collection_set_list[0]

{'image': '/static/portals/free-to-use/public-domain/libraries/libraries-1.jpg',
 'link': '/resource/cph.3f05183/',
 'title': 'For greater knowledge, on more subjects, use your library more often. Illinois WPA Arts Project, 1936-1941. Prints & Photographs Division'}

## Get metadata for individual items 

Now that you have the list of what is in the set, this can serve as your baseline collection information. Next, you want to get more complete information about each item. Details about these items are available on individual item pages, so now we have to look at a different location, as specified in the `'link'` fields of the item list.

In [5]:
# update baseURL
baseURL = 'https://www.loc.gov'
parameters = {
    'fo' : 'json'
}

The task now is to request metadata for each item. So that the data is reusable, save it locally as a JSON file. In the next blocks, you will create individual files for each item, which will save to a directory named `item-metadata` in the `collection-site-materials` directory. 

If you don't have that directory, you will first need to create it. 

In [6]:
# run this cell to confirm that you have a location for the JSON files
item_metadata_directory = os.path.join('..','collection-site-materials','item-metadata')

if os.path.isdir(item_metadata_directory):
    print(item_metadata_directory,'exists')
else:
    os.mkdir(item_metadata_directory)
    print('created',item_metadata_directory)

../collection-site-materials/item-metadata exists


Now, with the `collection_set_list`, use the included links to query the API for metadata for each item:

In [7]:
item_count = 0
error_count = 0
file_count = 0

data_directory = 'collection-site-materials'
item_metadata_directory = 'item-metadata'
item_metadata_file_prefix = 'item_metadata'
json_suffix = '.json'

for item in collection_set_list:
    if item['link'] == 'link':
        continue
    # these resource links could redirect to item pages, but currently don't work
    if '?' in item['link']:
        resource_ID = item['link']
        short_ID = item['link'].split('/')[2]
        item_metadata = requests.get(baseURL + resource_ID, params={'fo':'json'})
        print('requested',item_metadata.url,item_metadata.status_code)
        if item_metadata.status_code != 200:
            print('requested',item_metadata.url,item_metadata.status_code)
            error_count += 1
            continue
        try:
            item_metadata.json()
        except: #basically this catches all of the highsmith photos with hhh in the ID
            error_count += 1
            print('no json found')
            continue
        fout = os.path.join('..',data_directory, item_metadata_directory, str(item_metadata_file_prefix + '-' + short_ID + json_suffix))
        with open(fout, 'w', encoding='utf-8') as json_file:
            json_file.write(json.dumps(item_metadata.json()['item']))
            file_count += 1
            print('wrote', fout)
        item_count += 1
    else:
        resource_ID = item['link']
        short_ID = item['link'].split('/')[2]
        item_metadata = requests.get(baseURL + resource_ID, params={'fo':'json'})
        print('requested',item_metadata.url,item_metadata.status_code)
        if item_metadata.status_code != 200:
            print('requested',item_metadata.url,item_metadata.status_code)
            error_count += 1
            continue
        try:
            item_metadata.json()
        except:
            error_count += 1
            print('no json found')
            continue
        fout = os.path.join('..',data_directory, item_metadata_directory, str(item_metadata_file_prefix + '-' + short_ID + json_suffix))
        with open(fout, 'w', encoding='utf-8') as json_file:
            json_file.write(json.dumps(item_metadata.json()['item']))
            file_count += 1
            print('wrote', fout)
        item_count += 1

print('--- mini LOG ---')
print('items requested:',item_count)
print('errors:',error_count)
print('files written:',file_count)

requested https://www.loc.gov/resource/cph.3f05183/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-cph.3f05183.json
requested https://www.loc.gov/resource/highsm.20336/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-highsm.20336.json
requested https://www.loc.gov/resource/fsa.8d24709/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-fsa.8d24709.json
requested https://www.loc.gov/resource/highsm.36052/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-highsm.36052.json
requested https://www.loc.gov/resource/highsm.51772/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-highsm.51772.json
requested https://www.loc.gov/resource/cph.3b43255/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-cph.3b43255.json
requested https://www.loc.gov/resource/highsm.20483/?fo=json 200
wrote ../collection-site-materials/item-metadata/item_metadata-hi

## Get files for items

This section uses the file processing libraries and saves an image file for each item.

First, check to confirm that there is a directory in the `collection-site-materials` folder:

In [8]:
main_dir = os.path.join('/','Users','jajohnst','Desktop','si676-2025-data')
project_dir = 'collection-site-materials'
files_dir = 'item-files'
metadata_dir = 'item-metadata'

files_loc = os.path.join(main_dir,project_dir,files_dir)
print('Checking for',files_loc)

# check directory
if os.path.isdir(files_loc):
    print('Files directory exists')
else:
    os.mkdir(files_loc)
    print('Created file directory:',files_loc)

Checking for /Users/jajohnst/Desktop/si676-2025-data/collection-site-materials/item-files
Files directory exists


Now, since you have the local files, it is no longer necessary to
harvest the data using requests. You can use the `glob` library to
search for files with filters and wildcards similar to the `ls` command. 

In [9]:
import glob

Create a list of the item metadata files:

In [10]:
search_for_metadata_here = os.path.join('..',project_dir,metadata_dir)

print(search_for_metadata_here)

metadata_file_list = glob.glob(search_for_metadata_here + '/*.json')

print(metadata_file_list)

../collection-site-materials/item-metadata
['../collection-site-materials/item-metadata/item_metadata-cph.3c18157.json', '../collection-site-materials/item-metadata/item_metadata-ppbd.00600.json', '../collection-site-materials/item-metadata/item_metadata-mrg.00785.json', '../collection-site-materials/item-metadata/item_metadata-cph.3f05183.json', '../collection-site-materials/item-metadata/item_metadata-g3851e.ct006252.json', '../collection-site-materials/item-metadata/item_metadata-highsm.43863.json', '../collection-site-materials/item-metadata/item_metadata-ppmsca.18016.json', '../collection-site-materials/item-metadata/item_metadata-highsm.20497.json', '../collection-site-materials/item-metadata/item_metadata-fsa.8b14169.json', '../collection-site-materials/item-metadata/item_metadata-ppmsca.35590.json', '../collection-site-materials/item-metadata/item_metadata-mrg.00788.json', '../collection-site-materials/item-metadata/item_metadata-highsm.34640.json', '../collection-site-material

Identify the Image URLs from the stored data:

In [11]:
item_image_urls = list()
count = 0

for item in metadata_file_list:
    with open(item, 'r', encoding='utf-8') as f:
        metadata = json.load(f)
        # noted this resource for working out index out of range errors: https://rollbar.com/blog/how-to-fix-python-list-index-out-of-range-error-in-for-loops/
        image_url_no = len(metadata['image_url'])
        image_url = metadata['image_url'][-1]
        item_image_urls.append(image_url)
        count += 1

print(f'Identified { str(count) } image URLs')

Identified 59 image URLs


In [12]:
item_image_urls

['https://tile.loc.gov/storage-services/service/pnp/cph/3c10000/3c18000/3c18100/3c18157v.jpg#h=824&w=1024',
 'https://tile.loc.gov/storage-services/service/pnp/ppbd/00600/00600v.jpg#h=1024&w=765',
 'https://tile.loc.gov/storage-services/service/pnp/mrg/00700/00785v.jpg#h=697&w=1024',
 'https://tile.loc.gov/storage-services/service/pnp/cph/3f00000/3f05000/3f05100/3f05183v.jpg#h=1024&w=705',
 'https://tile.loc.gov/image-services/iiif/service:gmd:gmd385:g3851:g3851e:ct006252/full/pct:25/0/default.jpg#h=1205&w=1684',
 'https://tile.loc.gov/image-services/iiif/service:pnp:highsm:43800:43863/full/pct:25/0/default.jpg#h=2117&w=2822',
 'https://tile.loc.gov/storage-services/service/pnp/ppmsca/18000/18016v.jpg#h=755&w=1024',
 'https://tile.loc.gov/image-services/iiif/service:pnp:highsm:20400:20497/full/pct:50/0/default.jpg#h=2395&w=2053',
 'https://tile.loc.gov/storage-services/service/pnp/fsa/8b14000/8b14100/8b14169v.jpg#h=783&w=1024',
 'https://tile.loc.gov/storage-services/service/pnp/ppmsca

Now, create an updated set list with the image URLs:

In [13]:
collection_set_list_with_images = list()

for item in metadata_file_list:
    with open(item, 'r', encoding='utf-8') as item_info:
        item_metadata = json.load(item_info)

        # add the metadata into a dictionary for each item
        item_metadata_dict = dict()
        item_metadata_dict['item_URI'] = item_metadata['id']
        try:
            item_metadata_dict['lccn'] = item_metadata['library_of_congress_control_number']
        except:
            item_metadata_dict['lccn'] = None
        item_metadata_dict['title'] = item_metadata['title']
        item_metadata_dict['image_URL_large'] = item_metadata['image_url'][-1]
        
        # add the metadata to the main list
        collection_set_list_with_images.append(item_metadata_dict)

print(collection_set_list_with_images[0])

{'item_URI': 'http://www.loc.gov/item/97511671/', 'lccn': '97511671', 'title': 'Carnegie Library, Sheldon, Iowa', 'image_URL_large': 'https://tile.loc.gov/storage-services/service/pnp/cph/3c10000/3c18000/3c18100/3c18157v.jpg#h=824&w=1024'}


Finally, make the image requests:

In [14]:
item_count = 0
error_count = 0
file_count = 0

img_file_prefix = 'img_'

for item in collection_set_list_with_images:
        image_URL = item['image_URL_large']
        short_ID = item['item_URI'].split('/')[-2]
        print('... requesting',image_URL)
        item_count += 1

        # if found, save image
        r = requests.get(image_URL)
        if r.status_code == 200:
            img_out = os.path.join('..',project_dir,files_dir,str(img_file_prefix + short_ID + '.jpg'))
            with open(img_out, 'wb') as file:
                file.write(r.content)
                print('Saved',img_out)
                file_count += 1


print('--- mini LOG ---')
print('files requested:',item_count)
print('errors:',error_count)
print('files written:',file_count)

... requesting https://tile.loc.gov/storage-services/service/pnp/cph/3c10000/3c18000/3c18100/3c18157v.jpg#h=824&w=1024
Saved ../collection-site-materials/item-files/img_97511671.jpg
... requesting https://tile.loc.gov/storage-services/service/pnp/ppbd/00600/00600v.jpg#h=1024&w=765
Saved ../collection-site-materials/item-files/img_2015647967.jpg
... requesting https://tile.loc.gov/storage-services/service/pnp/mrg/00700/00785v.jpg#h=697&w=1024
Saved ../collection-site-materials/item-files/img_2017702899.jpg
... requesting https://tile.loc.gov/storage-services/service/pnp/cph/3f00000/3f05000/3f05100/3f05183v.jpg#h=1024&w=705
Saved ../collection-site-materials/item-files/img_98508155.jpg
... requesting https://tile.loc.gov/image-services/iiif/service:gmd:gmd385:g3851:g3851e:ct006252/full/pct:25/0/default.jpg#h=1205&w=1684
Saved ../collection-site-materials/item-files/img_87694100.jpg
... requesting https://tile.loc.gov/image-services/iiif/service:pnp:highsm:43800:43863/full/pct:25/0/defaul