# Extract and Transform OAI-PMH

This notebook provides starter Python code for retrieving and transforming OAI-PMH metadata.

It starts with a sample object from Wayne State University's collections: https://cdm17409.contentdm.oclc.org/digital/collection/rencen/id/315/rec/3. 

## Endpoint formulation

The OAI-PMH endpoint is found by locating the base URL for the repository,
then appending `oai` and the API action is activated by the `oai.php` script.
For the above, then, the OAI-PMH endpoint is:

`https://cdm17409.contentdm.oclc.org/oai/oai.php`

### Item metadata URL formulation:

OAI-PMH creates a standard object identifier, which places the object within a repository, an item set (collection), and a numerical identifier. Various elements of the identifier are separated by a colon (`:`), but note the slash (`/`) that is present in the object ID. A possible formula is as follows (where the plus (`+`) indicates concatenation of the string):

`oai: + repo_ID + : + set/item_ID`

Thus, there are three elements: a designation of the `oai` as a main namespace,
a repository identifier, and an object identifier (that includes set designation).

A sample identifier (for the above item) would be: `oai:cdm17409.contentdm.oclc.org:rencen/3`

In [167]:
import requests
from lxml import etree

## Get the data for one item

This is mostly for testing and illustration.
By requesting the data for one item, you can look at the data structures and items, which is important for reference.
The following code retrieves item meatdata for item `3` in the `rencen` dataset from the above OAI endpoint. 

**Item information:**

- Item set: `rencen`
- Item ID (within set): `3`
- goal URL for request: https://cdm17409.contentdm.oclc.org/oai/oai.php?verb=GetRecord&identifier=oai:cdm17409.contentdm.oclc.org:rencen/3&metadataPrefix=oai_dc

In [52]:
set = 'rencen'
item_ID = '3'
endpoint = 'https://cdm17409.contentdm.oclc.org/oai/oai.php'
verb = 'GetRecord'
repo_ID = 'cdm17409.contentdm.oclc.org'

In [53]:
parameters = {
    'identifier': 'oai:' + repo_ID + ':' + set + '/' + item_ID,
    'verb': verb,
    'metadataPrefix': 'oai_dc'
}

In [54]:
# make the request
item = requests.get(endpoint, params=parameters)

In [55]:
item.url

'https://cdm17409.contentdm.oclc.org/oai/oai.php?identifier=oai%3Acdm17409.contentdm.oclc.org%3Arencen%2F3&verb=GetRecord&metadataPrefix=oai_dc'

In [25]:
r_metadata = etree.fromstring(item.content)

In [26]:
for element in r_metadata:
    print(element.tag, element.attrib)

{http://www.openarchives.org/OAI/2.0/}responseDate {}
{http://www.openarchives.org/OAI/2.0/}request {'verb': 'GetRecord', 'identifier': 'oai:cdm17409.contentdm.oclc.org:rencen/3', 'metadataPrefix': 'oai_dc'}
{http://www.openarchives.org/OAI/2.0/}GetRecord {}


In [27]:
# find the data:
ns = {
    'oai': 'http://www.openarchives.org/OAI/2.0/'
}

metadata = r_metadata.findall('.//oai:metadata', ns)

for item in metadata:
    print(etree.tostring(item, encoding='utf-8').decode('utf-8'))

<metadata xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>View of the atrium in the completed Renaissance Center.</dc:title>
<dc:description>Image of the atrium within the Renaissance Center, looking up at a large concrete structure. Trees, plants, and people can be seen in this picture of a restaurant.</dc:description>
<dc:identifier>rencen_03e</dc:identifier>
<dc:date>1977</dc:date>
<dc:type>still image</dc:type>
<dc:format>photographs</dc:format>
<dc:subject>Renaissance Center (Detroit, Mich.)</dc:subject>
<dc:coverage>Detroit, Michigan</dc:coverage>
<dc:coverage>1970s</dc:coverage>
<dc:relation>Building the Detroit Renaissance Center</dc:relatio

## Extract IDs for all items in the set

The OAI-PMH protocol allows for the retrieval of information about all
items in a set, using the `ListIdentifiers` verb.

For large sets, usually over 200 items, the responses are controlled by
a paginated response. The next page can be requesting using the `resumptionToken`
provided at the end of the response.
See https://gist.github.com/rlskoeser/880a6f9f20bbaf9202fb for ideas on how to use the Token.

In [58]:
endpoint = 'https://cdm17409.contentdm.oclc.org/oai/oai.php'

set = 'rencen'
verb = 'ListIdentifiers'
resumptionToken = None

In [None]:
# jajohnst's initial code - close but not really working
# doesn't handle the while loop exit correctly,
# doesn't correctly update the 
def get_oai_set_ids(oai_url, oai_set=None, verb='ListIdentifiers'):
    '''This function takes an endpoint URL for an OAI-PMH repository
    and then requests all of the items in a given set. If no set is provided,
    then the set is assumed to be None. For large sets, the function will 
    use the response's resmptionToken.
    
    Returns a list of identifiers for the set.'''

    identifiers = list()

    parameters = {
        'verb': verb,
        'metadataPrefix': 'oai_dc'
        }
    if oai_set is not None:
        parameters['set'] = oai_set

    # make the requests
    r = requests.get(oai_url, params=parameters)

    # parse the responses
    response_xml = etree.fromstring(r.content)

    # get the Token
    resumptionToken = response_xml.find('.//{http://www.openarchives.org/OAI/2.0/}resumptionToken').text
    print('...',resumptionToken)

    while resumptionToken:
        for identifier in response_xml.iterfind('.//{http://www.openarchives.org/OAI/2.0/}identifier'):
            identifiers.append(identifier.text)

'''
    # if resumptionToken, recurse
    if resumptionToken is not None:
        # get the identifiers
        for identifier in response_xml.iterfind('.//{http://www.openarchives.org/OAI/2.0/}identifier'):
            identifiers.append(identifier.text)
        parameters['resumptionToken'] = resumptionToken

    # if no resumptionToken, just get the identifiers
    else:
        for identifier in response_xml.iterfind('.//{http://www.openarchives.org/OAI/2.0/}identifier'):
            identifiers.append(identifier.text)
'''            
    # record and return the identifiers
    return identifiers

In [107]:
# changed and updated by ClaudeAI to add an exit function for the while loop
def get_oai_set_ids(oai_url, oai_set=None, verb='ListIdentifiers'):
    '''This function takes an endpoint URL for an OAI-PMH repository
    and then requests all of the items in a given set. If no set is provided,
    then the set is assumed to be None. For large sets, the function will 
    use the response's resumptionToken.
    
    Returns a list of identifiers for the set.'''

    identifiers = list()

    parameters = {
        'verb': verb,
        'metadataPrefix': 'oai_dc'
    }
    if oai_set is not None:
        parameters['set'] = oai_set

    # make the initial request
    r = requests.get(oai_url, params=parameters)
    print('initial request ...',r.url)
    response_xml = etree.fromstring(r.content)

    # process first response
    for identifier in response_xml.iterfind('.//{http://www.openarchives.org/OAI/2.0/}identifier'):
        identifiers.append(identifier.text)
    
    # get the initial resumptionToken
    resumption_token_element = response_xml.find('.//{http://www.openarchives.org/OAI/2.0/}resumptionToken')
    # handle None responses, if no resumptionToken
    resumptionToken = resumption_token_element.text if resumption_token_element is not None else None

    while resumptionToken:
        # update parameters with resumptionToken
        parameters = {
            'verb': verb,
            'resumptionToken': resumptionToken
        }

        # make the next request
        r = requests.get(oai_url, params=parameters)
        print('requesting ...',resumptionToken)
        response_xml = etree.fromstring(r.content)

        # process the response
        for identifier in response_xml.iterfind('.//{http://www.openarchives.org/OAI/2.0/}identifier'):
            identifiers.append(identifier.text)

        # get the next resumptionToken
        resumption_token_element = response_xml.find('.//{http://www.openarchives.org/OAI/2.0/}resumptionToken')
        
        # If no resumption token, set to None to exit the loop
        resumptionToken = resumption_token_element.text if resumption_token_element is not None else None

    # return the identifiers
    return identifiers

In [108]:
identifiers = get_oai_set_ids(endpoint, oai_set='rencen')

initial request ... https://cdm17409.contentdm.oclc.org/oai/oai.php?verb=ListIdentifiers&metadataPrefix=oai_dc&set=rencen
requesting ... rencen:200:rencen:0000-00-00:9999-99-99:oai_dc


In [109]:
len(identifiers)

323

Looks like it works!

## Request metadata for each of the items

This function uses the identifiers list to request the full information for each item.
Then, that information is entered into a CSV or JSON. 


In [161]:
def oai_get_item_info_from_list(endpoint, identifiers, verb='GetRecord', metadataPrefix='oai_dc'):
    '''Function to get full item metadata for a list of supplied identifiers.
    Requires lxml to parse XML responses.
    Supply: 
      - endpoint (a valid OAI-PMH endpoint URL)
      - identifiers (python list)
      - verb defaults to 'GetRecord' but another 'verb' argument may be provided
      - metadataPrefix defaults to 'oai_dc' but may be provided

    Returns data in a dictionary that can be converted into CSV or JSON.'''

    set_metadata_dict = dict()

    parameters = {
        'verb': verb,
        'metadataPrefix': metadataPrefix
    }

    # namespaces for XML parsing
    ns = {
        'oai': 'http://www.openarchives.org/OAI/2.0/',
        'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
        'dc': 'http://purl.org/dc/elements/1.1/'
    }

    # make requests
    for identifier in identifiers:
        parameters['identifier'] = identifier
        r = requests.get(endpoint, params=parameters)
        print(f"Retrieving ... {identifier}")
        
        # Check if request was successful
        if r.status_code != 200:
            print(f"Error retrieving {identifier}: Status code {r.status_code}")
            continue

        try:
            response_xml = etree.fromstring(r.content)
            
            # Find the metadata section
            item_data = response_xml.find('.//oai:metadata', namespaces=ns)
            
            if item_data is None:
                print(f"No metadata found for {identifier}")
                continue

            # Create a dictionary for this specific identifier
            item_metadata = {}
            
            # Iterate through all DC fields
            for dc_field in item_data.iterfind('.//dc:*', namespaces=ns):
                # Extract the local name (without namespace)
                local_name = etree.QName(dc_field.tag).localname
                
                # Create dcterms-prefixed key
                key = f"dcterms:{local_name}"
                
                # Handle multiple values for the same field
                if key in item_metadata:
                    # If the key already exists, convert to list or append
                    if isinstance(item_metadata[key], list):
                        item_metadata[key].append(dc_field.text)
                    else:
                        item_metadata[key] = [item_metadata[key], dc_field.text]
                else:
                    item_metadata[key] = dc_field.text
            
            # Store the metadata for this identifier
            set_metadata_dict[identifier] = item_metadata

        except etree.XMLSyntaxError:
            print(f"XML parsing error for {identifier}")
        except Exception as e:
            print(f"Unexpected error processing {identifier}: {e}")

    return set_metadata_dict

In [158]:
oai_get_item_info_from_list(endpoint=endpoint, identifiers=identifiers[:5])

{'oai:cdm17409.contentdm.oclc.org:rencen/0': {'dcterms:title': 'View of ongoing construction of the Renaissance Center towers',
  'dcterms:description': 'Image of the building progress on two of the four towers of the Renaissance Center.  Construction equipment is clearing visible.',
  'dcterms:identifier': ['rencen_05b',
   'http://cdm17409.contentdm.oclc.org/cdm/ref/collection/rencen/id/0'],
  'dcterms:date': ['1973', '1973'],
  'dcterms:type': 'still image',
  'dcterms:format': 'photographs',
  'dcterms:subject': 'Construction industry; Construction equipment; Scaffolding; Cranes, derricks, etc.; Renaissance Center (Detroit, Mich.)',
  'dcterms:coverage': ['Detroit, Michigan', '1970s'],
  'dcterms:relation': 'Building the Detroit Renaissance Center',
  'dcterms:rights': 'Users can cite and link to these materials without obtaining permission. Users can also use the materials for non-commercial educational and research purposes in accordance with fair use. For other uses or to obtain

In [162]:
rencen_set_info = oai_get_item_info_from_list(endpoint=endpoint, identifiers=identifiers)

Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/0
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/1
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/2
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/3
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/4
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/5
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/6
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/7
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/8
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/9
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/10
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/11
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/12
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/13
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/14
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/15
Retrieving ... oai:cdm17409.contentdm.oclc.org:rencen/16
Retrieving ... oai:cdm17409.contentdm.ocl

### Save the dict

In case you want to use it later as a feeder file or for later transformation.

In [166]:
# write the data to a local file for reference
import json

metadata_file = 'rencen_set_info.json'

with open(metadata_file, 'w', encoding='utf-8') as f:
    json.dump(rencen_set_info, f)