# Extracting and Transforming Metadata

This activity takes a few steps, so this is a bit long. But, we're going to go through this in class together, then you should be able to use this code as building blocks to make a query to a second collection.

The main steps here are as follows:

1. Get collections list - using the requests library, make a request to the library of congress API to get the list of items in the "Free to Use" libraries collection. Write this to a local file called `collection_items_list.csv`
1. Get item metadata - using the list from the previous step, use that a source to query each item in the collection to get details about it. Save the JSON responses locally so we can extract information from them in the next steps. (In this example, you will have around 60 files, but a maximum of 62 as of September 2022. This number may vary when you run this code yourself since the website may have different response rates.)
1. Draft a metadata crosswalk - this is an exploratory activity and you will need to take some time examining one or two sample responses from the previous step to identify the attributes that you want to extract (the goal is to identify the information that you want to import to your Omeka site collection, essentially we are going to recreate the collection), to see how to extract these from the JSON, and to write a test transformation in the next step.
1. 

# Get collection list

In [1]:
import csv
import json
import requests

In [26]:
endpoint = 'https://www.loc.gov/free-to-use'
parameters = {
    'fo' : 'json'
}

In [3]:
collection = 'libraries'

In [4]:
collection_list_response = requests.get(endpoint + collection, params=parameters)

In [6]:
collection_json = collection_list_response.json()

In [14]:
for k in collection_json['content']['set']['items']:
    print(k)

{'image': '/static/portals/free-to-use/public-domain/libraries/libraries-1.jpg', 'link': '/resource/cph.3f05183/', 'title': 'For greater knowledge, on more subjects, use your library more often. Illinois WPA Arts Project, 1936-1941. Prints & Photographs Division'}
{'image': '/static/portals/free-to-use/public-domain/libraries/libraries-2.jpg', 'link': '/resource/highsm.20336/', 'title': 'Noyes Library for Young Children. Kensington, Maryland. Photo by Carol M. Highsmith,  2011. Prints & Photographs Division'}
{'image': '/static/portals/free-to-use/public-domain/libraries/libraries-3.jpg', 'link': '/resource/fsa.8d24709/', 'title': 'Bethune-Cookman College. Students in the library reading room, Daytona Beach, Florida. Gordon Parks, 1943. Prints & Photographs Division'}
{'image': '/static/portals/free-to-use/public-domain/libraries/libraries-4.jpg', 'link': '/resource/highsm.36052/', 'title': 'Public library in Antonito,  Colorado, near the New Mexico border. Photo by Carol M. Highsmith,

In [16]:
collection_items_file = 'collection_items_list.csv'
headers = ['image','link','title']

with open(collection_items_file, 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    for item in collection_json['content']['set']['items']:
        writer.writerow(item)
    print('wrote',collection_items_file)

wrote collection_items.csv


# Get item metadata

In [29]:
# update endpoint info
endpoint = 'https://www.loc.gov'
parameters = {
    'fo' : 'json'
}

In [34]:
item_count = 0
error_count = 0
item_metadata_file_start = 'item_metadata'
json_suffix = '.json'

with open(collection_items_file, 'r', encoding='utf-8', newline='') as f:
    reader = csv.DictReader(f, fieldnames=headers)
    for item in reader:
        if item['link'] == 'link':
            continue
        # these resource links could redirect to item pages, but currently don't work
        if '?' in item['link']:
            resource_ID = item['link']
            short_ID = item['link'].split('/')[2]
            item_metadata = requests.get(endpoint + resource_ID + '&fo=json')
            print('requested',item_metadata.url,item_metadata.status_code)
            if item_metadata.status_code != 200:
                print('requested',item_metadata.url,item_metadata.status_code)
                error_count += 1
                continue
            try:
                item_metadata.json()
            except:
                error_count += 1
                print('no json found')
                continue
            fout = item_metadata_file_start + '-' + short_ID + json_suffix
            with open(fout, 'w', encoding='utf-8') as json_file:
                json_file.write(json.dumps(item_metadata.json()['item']))
                print('wrote', fout)
            item_count += 1
        else:
            resource_ID = item['link']
            short_ID = item['link'].split('/')[2]
            item_metadata = requests.get(endpoint + resource_ID, params=parameters)
            print('requested',item_metadata.url,item_metadata.status_code)
            if item_metadata.status_code != 200:
                print('requested',item_metadata.url,item_metadata.status_code)
                error_count += 1
                continue
            try:
                item_metadata.json()
            except:
                error_count += 1
                print('no json found')
                continue
            fout = item_metadata_file_start + '-' + short_ID + json_suffix
            with open(fout, 'w', encoding='utf-8') as json_file:
                json_file.write(json.dumps(item_metadata.json()['item']))
                print('wrote', fout)
            item_count += 1

print('items requested:',item_count)
print('errors:',error_count)

requested https://www.loc.gov/resource/cph.3f05183/?fo=json 200
wrote item_metadata-cph.3f05183.json
requested https://www.loc.gov/resource/highsm.20336/?fo=json 200
wrote item_metadata-highsm.20336.json
requested https://www.loc.gov/resource/fsa.8d24709/?fo=json 200
wrote item_metadata-fsa.8d24709.json
requested https://www.loc.gov/resource/highsm.36052/?fo=json 200
wrote item_metadata-highsm.36052.json
requested https://www.loc.gov/resource/highsm.51772/?fo=json 200
wrote item_metadata-highsm.51772.json
requested https://www.loc.gov/resource/cph.3b43255/?fo=json 200
wrote item_metadata-cph.3b43255.json
requested https://www.loc.gov/resource/highsm.20483/?fo=json 200
wrote item_metadata-highsm.20483.json
requested https://www.loc.gov/resource/highsm.29207/?fo=json 200
wrote item_metadata-highsm.29207.json
requested https://www.loc.gov/resource/fsa.8b32222/?fo=json 200
wrote item_metadata-fsa.8b32222.json
requested https://www.loc.gov/resource/highsm.64003/?fo=json 200
wrote item_metad

# Write a metadata crosswalk

Below is a start. This is going to get a bit complicated, but identify at least 10 fields that you want to move into the new site. Consider using DublinCore, but also at least one field from another schema, I would suggest MODS (more of a bibliographic schema and allows for more granularity than DublinCore), which is also supported by Omeka. Plus, you should be able to find MODS information for most (if not all) items in any of these sets. For example, looking at resource `highsm.20336`, note the last field in the item metadata is a URL to an `item` page: https://www.loc.gov/item/2012630017/. That item page links to MODS and DublinCore records.


| source field name | source field path/dict name | target        | target namespace | notes |
|-------------------|-----------------------------|---------------|------------------|-------|
| title | item['title'] | dc:title | DC Element | Title provided by the orginal metadata, could also be mapped to MODS:titleInfo:title or other fields in other namespaces | 
| date              | item['date']                | dc:date       | DC Element | This is a 4-digit year, corresponds to date of creation in most cases   |
| LC call number    | item['item']['call_number']  | dc:identifier | DC Element | Alphanumeric string. A Library of Congress number, should record for source/provenance reasons.|
| LC control number | item['item']['control_number'] | dc:identifier @type=lccn | DC Element with attribute | Corresponds to the Library of Congress Control Number (can be checked at http://lccn.loc.gov/ |
| creator           | item['creator']             | dc:creator    | DC Element | Should be a name. May be repeated. If possible, are various roles needed? Such as 'photographer', 'author', etc |
| description | item['description'] / item['summary'] | mods:physicaldescription / dc:abstract | MODS | In the source data, this seems most like physical description, although it might correspond to dc:format or dc:type. Content in the record may come from a controlled vocabulary, such as LC Genre & Form Thesaurus. |
| mime_type | | | DC |
| notes (may be multiple) | item['notes'] (array) | dcterms:abstract | DC Terms | This appears to be closest to a "summary" or description of the content of the items. |
| source_collection | | | | |
| rights | | | | |
| place | | | | |
| image (link to the full image) | | | | |
| languages | | | | |
| subject_heading | | mods:subject | mods | | 
| format, physical | item['formats'][0]['title'] / also look at item['type'] | mods:physicalDescription:form | Description of the original physical format of this item (photograph, book, poster) | Note: this may not be present or in the same place for the different types of objects in the collection |
| format | item['format'] | dc:format | DC Element | The basic type of the digital surrogate (e.g., 'image' or 'text' | |

# Transformation Part 1: Testing

In this step, search the metadata files, extract target fields, write to CSV.

First, develop a search pattern for identifying the desired JSON files. Here, you create a list of the files that you want to transform, called `list_of_item_metadata_files`. 

**Reminder:** This step builds on your regular expression and shell skills! (Note, however, these are technically file path expansions, not actual regular expressions, but the general idea of creating a pattern and asking the computer to respond with a list of results that meet your criteria, is similar.)

In [37]:
import glob
import os
from os.path import join

In [42]:
current_loc = os.getcwd()

print(current_loc)

/Users/rickypunzalan/Desktop/Desktop - RICARDO’s iMac/digcur/activities


In [46]:
for file in glob.glob('item_metadata-*.json'):
    print(file)

item_metadata-cph.3c18157.json
item_metadata-ppbd.00600.json
item_metadata-mrg.00785.json
item_metadata-cph.3f05183.json
item_metadata-g3851e.ct006252.json
item_metadata-highsm.43863.json
item_metadata-ppmsca.18016.json
item_metadata-ppmsca.17588.json
item_metadata-highsm.20497.json
item_metadata-fsa.8b14169.json
item_metadata-ppmsca.35590.json
item_metadata-mrg.00788.json
item_metadata-highsm.34640.json
item_metadata-highsm.20336.json
item_metadata-fsa.8c22565.json
item_metadata-cph.3b41963.json
item_metadata-ppmsca.15412.json
item_metadata-highsm.32720.json
item_metadata-mrg.00432.json
item_metadata-hhh.hi0135.photos.json
item_metadata-hhh.il0998.sheet.json
item_metadata-det.4a17925.json
item_metadata-highsm.24333.json
item_metadata-ppmscd.00084.json
item_metadata-hhh.ks0072.photos.json
item_metadata-highsm.64003.json
item_metadata-highsm.41101.json
item_metadata-highsm.31350.json
item_metadata-hhh.ok0012.sheet.json
item_metadata-hhh.sc0767.photos.json
item_metadata-hhh.ri0071.photos

In [56]:
list_of_item_metadata_files = list() 
for file in glob.glob('item_metadata-*.json'):
    list_of_item_metadata_files.append(file)

In [57]:
len(list_of_item_metadata_files)

60

In [61]:
# quick duplicate check
list_of_item_metadata_files.sort()

for file in list_of_item_metadata_files:
    print(file)

item_metadata-cph.3b41963.json
item_metadata-cph.3b43255.json
item_metadata-cph.3c18157.json
item_metadata-cph.3f05168.json
item_metadata-cph.3f05183.json
item_metadata-det.4a17925.json
item_metadata-det.4a23603.json
item_metadata-ds.06507.json
item_metadata-ds.06560.json
item_metadata-fsa.8b14169.json
item_metadata-fsa.8b32222.json
item_metadata-fsa.8c22565.json
item_metadata-fsa.8d24709.json
item_metadata-g3851e.ct006252.json
item_metadata-hhh.ak0345.photos.json
item_metadata-hhh.dc0121.photos.json
item_metadata-hhh.hi0135.photos.json
item_metadata-hhh.il0998.sheet.json
item_metadata-hhh.ks0072.photos.json
item_metadata-hhh.me0057.photos.json
item_metadata-hhh.nj0089.photos.json
item_metadata-hhh.nv0134.photos.json
item_metadata-hhh.ok0012.sheet.json
item_metadata-hhh.ri0071.photos.json
item_metadata-hhh.sc0767.photos.json
item_metadata-highsm.04362.json
item_metadata-highsm.18384.json
item_metadata-highsm.18402.json
item_metadata-highsm.20216.json
item_metadata-highsm.20336.json
ite

In [107]:
# try first with one file, can you open the json, can you see what elements are in the json?
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as item:
    # what are we looking at?
    print('file:',list_of_item_metadata_files[0],'\n')
    
    # load the item data
    item_data = json.load(item)
    
    for element in item_data.keys():
        print(element,':',item_data[element])
    
    # can you get the date?
    print('\ndate:',item_data['date'], type(item_data['date']))
    # can you get the format?
    print('\nformat:',item_data['item']['format'][0], type(item_data['format']))

file: item_metadata-cph.3b41963.json 

_version_ : 1709345645727318016
access_restricted : False
aka : ['https://www.loc.gov/pictures/item/91787443/', 'http://www.loc.gov/item/91787443/', 'http://www.loc.gov/pictures/item/91787443/', 'https://www.loc.gov/pictures/collection/cph/item/91787443/', 'http://www.loc.gov/pictures/collection/cph/item/91787443/', 'http://www.loc.gov/resource/cph.3b41963/', 'http://lccn.loc.gov/91787443', 'http://hdl.loc.gov/loc.pnp/cph.3b41963']
call_number : SSF - Libraries--Georgia--Cordele <item> [P&P]
campaigns : []
control_number : 
created : 2016-04-21T09:17:00Z
created_published : ['[ca. 1916]']
created_published_date : [ca. 1916]
date : 1916
dates : [{'1916': 'https://www.loc.gov/search/?dates=1916/1916&fo=json'}]
description : ['1 photographic print. | Photo shows a group of children posed on and in front of steps, roof and dome draped with stars and stripes banners. A Carnegie grant for $10,000 in 1903 funded this building, with an  additional $7,556 

## Test: Try it with one example

First, try to set up the extract process with one example. This may get more complicated later since you don't know yet if every item has the same metadata attributes in the JSON. But start with some basics and build up from there. 

For a first pass, look out for these items, and find where in the JSON you can locate them:

* 'item_id'
* 'title'
* 'date' 
* 'source_url'
* 'phys_format'
* 'dig_format'
* 'rights'

_Hint: use the JSON viewer in JupyterLab, use an extension in VSCode, or use a browser to look through sample JSON. The block below uses item `cph.3b41963`._

You may need to use try/except patterns to create workarounds for cases where some items may not have exactly the same attributes that you've identified in your test cases.

In [167]:
# set up the containers to create the csv of all the item fields
# file for csv to read out
collection_info_csv = 'collection_items_data.csv'

# set up a list for the columns in your csv; in future, this should be more automated but this works for now as you set up the crosswalk
headers = ['source_file', 'item_id', 'title', 'date', 'source_url', 'phys_format', 'dig_format', 'rights']

# try first with one file
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as data:
    # load the item data
    item_data = json.load(data)
    
    # extract the data you want
    # for checking purposes, add in the source of the info
    source_file = str(file)
    # make sure there's some unique and stable identifier
    try:
        item_id = item_data['library_of_congress_control_number']
    except:
        item_id = item_data['url'].split('/')[-2]
    title = item_data['title']
    date = item_data['date']
    source_url = item_data['url']
    try:
        phys_format = item_data['format'][0]
    except:
        phys_format = 'Not found'
    try:
        dig_format = item_data['online_format'][0]
    except:
        dig_format = 'Not found'
    mime_type = item_data['mime_type']
    try:
        rights = item_data['rights_information']
    except:
        rights = 'Undetermined'


    # dictionary for the rows
    row_dict = dict()
    
    # look for the item metadata, assign it to the dictionary; 
    # start with some basic elements likely (already enumerated in the headers list) :
    # source file
    row_dict['source_file'] = source_file
    # identifier
    row_dict['item_id'] = item_id
    # title
    row_dict['title'] = title
    # date
    row_dict['date'] = date
    # link
    row_dict['source_url'] = source_url
    # format
    row_dict['phys_format'] = phys_format
    # digital format
    row_dict['dig_format'] = dig_format
    #rights
    row_dict['rights'] = rights 
    print('created row dictionary:',row_dict)

    # write to the csv
    with open(collection_info_csv, 'w', encoding='utf-8') as fout:
        writer = csv.DictWriter(fout, fieldnames=headers)
        writer.writeheader()
        writer.writerow(row_dict)
        print('wrote',collection_info_csv)

created row dictionary: {'source_file': 'item_metadata-ppmscd.00084.json', 'item_id': '91787443', 'title': 'Carnegie Library, Cordele, Georgia', 'date': '1916', 'source_url': 'https://www.loc.gov/item/91787443/', 'phys_format': {'photo, print, drawing': 'https://www.loc.gov/search/?fa=original_format:photo,+print,+drawing&fo=json'}, 'dig_format': 'image', 'rights': 'No known restrictions on publication.'}
wrote collection_items_data.csv


Looking for the url to a medium sized image of the item in question:

In [182]:
collection_info_csv = 'collection_items_data.csv'

# set up a list for the columns in your csv; in future, this should be more automated but this works for now as you set up the crosswalk
headers = ['source_file', 'item_id', 'title', 'date', 'source_url', 'phys_format', 'dig_format', 'rights']

# try first with one file
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as data:
    # load the item data
    item_data = json.load(data)
    
    print(item_data['image_url'][3])

https://tile.loc.gov/storage-services/service/pnp/cph/3b40000/3b41000/3b41900/3b41963r.jpg#h=515&w=640


# Transformation Part 2: Write your CSV

The goal of this final step is to create a CSV file, which will be possible to import into your Omeka site. It may seem like it's taken a long time to get to this point... but remember, when this works you will be importing about 60 items into the site at one time, so if you can get all of this to work for an even larger set of materials, you will be saving quite a lot of time in the future when you need to import items (unless, of course, you get the items piecemeal, which will need a different workflow, but let's leave that aside for now).

Now, try to extend this to the whole set by looping through each of the desired JSON files:

In [185]:
# for purposes of demonstration, use this block to make sure there isn't already a list file:
import os.path
import os

if os.path.isfile('collection_items_data.csv'):
    os.unlink('collection_items_data.csv')
    print('removed collection_items_data.csv')

# clear row_dict
row_dict = ()

removed collection_items_data.csv


In [172]:
from datetime import date

date_string_for_today = date.today().strftime('%Y-%m-%d') # see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

In [186]:
# set up the containers to create the csv & counters 
# file for csv to read out
collection_info_csv = 'collection_items_data.csv'
file_count = 0
items_written = 0
error_count = 0

# add in a couple of extras for Omeka, including item type and date uploaded

# set up a list for the columns in your csv; in future, this should be more automated but this works for now as you set up the crosswalk
headers = ['item_type', 'date_uploaded', 'source_file', 'item_id', 'title', 'date', 'source_url', 'phys_format', 'dig_format', 'rights', 'image_url']

# now, adapt the previous loop to open each file:
for file in list_of_item_metadata_files:
    file_count += 1
    print('opening',file)
    with open(file, 'r', encoding='utf-8') as item:
        # load the item data
        try:
            item_data = json.load(item)
        except:
            print('error loading',file)
            error_count += 1
            continue

        # extract/name the data you want
        # item type
        item_type = 'Item'
        # date uplaoded
        date_uploaded = date_string_for_today
        # for checking purposes, add in the source of the info
        source_file = str(file)
        # make sure there's some unique and stable identifier
        try:
            item_id = item_data['library_of_congress_control_number']
        except:
            item_id = item_data['url'].split('/')[-2]
        title = item_data['title']
        date = item_data['date']
        source_url = item_data['url']
        try:
            phys_format = item_data['format'][0]
        except:
            phys_format = 'Not found'
        try:
            dig_format = item_data['online_format'][0]
        except:
            dig_format = 'Not found'
        mime_type = item_data['mime_type']
        try:
            rights = item_data['rights_information']
        except:
            rights = 'Undetermined'
        try:
            image_url = item_data['image_url'][3]
        except:
            image_url = 'Did not identify a URL.'

        # dictionary for the rows
        row_dict = dict()

        # look for the item metadata, assign it to the dictionary; 
        # start with some basic elements likely (already enumerated in the headers list) :
        # item type
        row_dict['item_type'] = item_type
        # date uploaded
        row_dict['date_uploaded'] = date_uploaded
        # source filename
        row_dict['source_file'] = source_file
        # identifier
        row_dict['item_id'] = item_id
        # title
        row_dict['title'] = title
        # date
        row_dict['date'] = date
        # link
        row_dict['source_url'] = source_url
        # format
        row_dict['phys_format'] = phys_format
        # digital format
        row_dict['dig_format'] = dig_format.capitalize()
        #rights
        row_dict['rights'] = rights
        #image
        row_dict['image_url'] = image_url

        # write to the csv
        with open(collection_info_csv, 'a', encoding='utf-8') as fout:
            writer = csv.DictWriter(fout, fieldnames=headers)
            if items_written == 0:
                writer.writeheader()
            writer.writerow(row_dict)
            items_written += 1
            print('adding',item_id)

print('\n\n--- LOG ---')
print('wrote',collection_info_csv)
print('with',items_written,'items')
print(error_count,'errors (info not written)')

opening item_metadata-cph.3b41963.json
adding 91787443
opening item_metadata-cph.3b43255.json
adding 89710983
opening item_metadata-cph.3c18157.json
adding 97511671
opening item_metadata-cph.3f05168.json
adding 98508385
opening item_metadata-cph.3f05183.json
adding 98508155
opening item_metadata-det.4a17925.json
adding 2016809661
opening item_metadata-det.4a23603.json
adding 2016815290
opening item_metadata-ds.06507.json
adding 2014650180
opening item_metadata-ds.06560.json
adding 2014647618
opening item_metadata-fsa.8b14169.json
adding 2017762724
opening item_metadata-fsa.8b32222.json
adding 2017770391
opening item_metadata-fsa.8c22565.json
adding 2017815837
opening item_metadata-fsa.8d24709.json
adding 2017843202
opening item_metadata-g3851e.ct006252.json
adding 87694100
opening item_metadata-hhh.ak0345.photos.json
adding ak0345
opening item_metadata-hhh.dc0121.photos.json
adding dc0121
opening item_metadata-hhh.hi0135.photos.json
adding hi0135
opening item_metadata-hhh.il0998.sheet.