# Playing with OAI-PMH

Some possible exercises with OAI-PMH

In [2]:
import requests
import xml.etree.ElementTree as ET 

This notebook uses the Renaissance Center set from Wayne State Digital Collections.
See https://cdm17409.contentdm.oclc.org/digital/collection/rencen/search

In [3]:
endpoint = 'https://cdm17409.contentdm.oclc.org/oai/oai.php'

Use the initial, basic OAI verbs:

* **Identify** - provide information about the repository
* **ListSets** - list the sets of content that are available in the repository
* **ListRecords** - provide metadata for all of the records in a given format from a given set

## Identify

In [4]:
identify = requests.get(endpoint, params={'verb':'Identify'})

In [5]:
identify.status_code

200

In [6]:
identify.text[:200]

'<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.'

In [7]:
root = ET.fromstring(identify.text)
#root = tree.getroot() note that in the above, no need for .getroot() when parsing from a string

In [8]:
type(root)

xml.etree.ElementTree.Element

In [9]:
root.tag

'{http://www.openarchives.org/OAI/2.0/}OAI-PMH'

In [10]:
ns = {
    'oai': 'http://www.openarchives.org/OAI/2.0/'
}

In [11]:
for element in root.iter():
    print(element.tag)

{http://www.openarchives.org/OAI/2.0/}OAI-PMH
{http://www.openarchives.org/OAI/2.0/}responseDate
{http://www.openarchives.org/OAI/2.0/}request
{http://www.openarchives.org/OAI/2.0/}Identify
{http://www.openarchives.org/OAI/2.0/}repositoryName
{http://www.openarchives.org/OAI/2.0/}baseURL
{http://www.openarchives.org/OAI/2.0/}protocolVersion
{http://www.openarchives.org/OAI/2.0/}adminEmail
{http://www.openarchives.org/OAI/2.0/}earliestDatestamp
{http://www.openarchives.org/OAI/2.0/}deletedRecord
{http://www.openarchives.org/OAI/2.0/}granularity


In [12]:
Identity = root.find('oai:Identify', namespaces=ns)

In [13]:
Identity.tag

'{http://www.openarchives.org/OAI/2.0/}Identify'

In [14]:
for element in Identity:
    print(element.tag, element.text)

{http://www.openarchives.org/OAI/2.0/}repositoryName CONTENTdm Server Repository
{http://www.openarchives.org/OAI/2.0/}baseURL http://cdm17409.contentdm.oclc.org/oai/oai.php
{http://www.openarchives.org/OAI/2.0/}protocolVersion 2.0
{http://www.openarchives.org/OAI/2.0/}adminEmail digitalcollections@wayne.edu
{http://www.openarchives.org/OAI/2.0/}earliestDatestamp 2022-05-04
{http://www.openarchives.org/OAI/2.0/}deletedRecord transient
{http://www.openarchives.org/OAI/2.0/}granularity YYYY-MM-DD


dump all that into a dictionary! 

In [15]:
Repo_Identity_Info = dict()

for element in Identity:
    Repo_Identity_Info[element.tag.split('}')[1]] = element.text

In [16]:
Repo_Identity_Info

{'repositoryName': 'CONTENTdm Server Repository',
 'baseURL': 'http://cdm17409.contentdm.oclc.org/oai/oai.php',
 'protocolVersion': '2.0',
 'adminEmail': 'digitalcollections@wayne.edu',
 'earliestDatestamp': '2022-05-04',
 'deletedRecord': 'transient',
 'granularity': 'YYYY-MM-DD'}

Now, any time you need to refer to the basic information about this Repository,
you can refer back to the `Repo_Identity_Info`.

## ListSets

Now, list the sets...

In [17]:
sets = requests.get(Repo_Identity_Info['baseURL'], params={'verb': 'ListSets'})

In [18]:
sets.status_code

200

In [19]:
sets.text[:100]

'<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xs'

In [20]:
Repo_Sets = ET.fromstring(sets.text)

In [21]:
for element in Repo_Sets:
    print(element.tag)

{http://www.openarchives.org/OAI/2.0/}responseDate
{http://www.openarchives.org/OAI/2.0/}request
{http://www.openarchives.org/OAI/2.0/}ListSets


In [22]:
SetList = Repo_Sets.find('oai:ListSets', namespaces=ns)

Now, we know that for each set, there is an element called `setSpec`, 
which is a unique string to identify the set, and a `setName`, which 
is a human readable namestring. Let's pull those into a JSON file:

In [23]:
len(SetList)

26

It looks like there's 26 sets!

In [24]:
for set in SetList:
    setID = set.find('oai:setSpec', namespaces=ns).text
    name  = set.find('oai:setName', namespaces=ns).text
    print(setID, ':', name)

vmc : Virtual Motor City
rencen : Building the Detroit Renaissance Center
auto-industry : Changing Face of the Auto Industry
cooper : Dennis Glen Cooper Collection
det-focus : Detroit Focus Quarterly
det-sun-jrnl : Detroit Sunday Journal
digital-dress : Digital Dress Collection
eloise-ramsey : Eloise Ramsey Collection of Literature for Young People
first-heart : First U.S. Human-to-Human Heart Transplant
nightingale : Florence Nightingale Collection
herman-miller : Herman Miller Consortium Collection
lgbt-detroit : LGBT Detroit Records
lincoln-ltrs : The Lincoln Letters
made-in-mich : Made in Michigan Writers Series
mot-programs : Michigan Opera Theatre Archive Programs Collection
mot-images : Michigan Opera Theatre Performance Images
cass-gilbert : Selected Cass Gilbert Architectural Drawings of the Detroit Public Library
shakespeare : Shakespeare Lear Project
toni-swanger : Toni Swanger Papers
ufw-image : United Farm Workers Image Gallery
van-riper : Van Riper Family Correspondence
w

In [25]:
import json

In [26]:
setCount = 0
setList = list()

for set in SetList:
    setID = set.find('oai:setSpec', namespaces=ns).text
    setInfo = {
        'number': setCount,
        'setID' : set.find('oai:setSpec', namespaces=ns).text,
        'name'  : set.find('oai:setName', namespaces=ns).text
    }
    setList.append(setInfo)
    print(f'added {setID}')
    setCount += 1

with open('setList.json', 'a', encoding='utf-8') as f:
    json.dump(setList, f, indent=2)

added vmc
added rencen
added auto-industry
added cooper
added det-focus
added det-sun-jrnl
added digital-dress
added eloise-ramsey
added first-heart
added nightingale
added herman-miller
added lgbt-detroit
added lincoln-ltrs
added made-in-mich
added mot-programs
added mot-images
added cass-gilbert
added shakespeare
added toni-swanger
added ufw-image
added van-riper
added wayne-open
added wsu-buildings
added wsu-life
added wpa-music
added dte-aerial


Now, you can load set information from the file `setList.json`.

In [27]:
saved_setList = json.load(open('setList.json'))

type(saved_setList)

for item in saved_setList:
    print(item['setID'])

vmc
rencen
auto-industry
cooper
det-focus
det-sun-jrnl
digital-dress
eloise-ramsey
first-heart
nightingale
herman-miller
lgbt-detroit
lincoln-ltrs
made-in-mich
mot-programs
mot-images
cass-gilbert
shakespeare
toni-swanger
ufw-image
van-riper
wayne-open
wsu-buildings
wsu-life
wpa-music
dte-aerial


## ListRecords

Finally, use the verb `ListRecords` to view information for each of the items in a given set. 
This one can be slightly more complicated since some repositories have hundreds of items
in a given set, and typically the OAI-PMH endpoint will provide information in paginated
results. To work through the individual pages of the response, you will need to look
for the `resumptionToken` element, which can be provided back to the requests URL
as a parameter to receive the next page of results.

This is also more complex since these records will typically be shared in DublinCore 
fields, so an additional namespace record must be added.

In [28]:
Records = requests.get(Repo_Identity_Info['baseURL'], 
                       params={
                           'verb': 'ListRecords', 
                           'set': 'rencen', 
                           'metadataPrefix': 'oai_dc'
                           })

In [29]:
Records.url

'http://cdm17409.contentdm.oclc.org/oai/oai.php?verb=ListRecords&set=rencen&metadataPrefix=oai_dc'

In [30]:
Records.status_code

200

In [31]:
Records.text[:100]

'<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xs'

In [32]:
recordRoot = ET.fromstring(Records.text)

for element in recordRoot:
    print(element.tag)

{http://www.openarchives.org/OAI/2.0/}responseDate
{http://www.openarchives.org/OAI/2.0/}request
{http://www.openarchives.org/OAI/2.0/}ListRecords


In [33]:
records = recordRoot.find('oai:ListRecords', ns)

In [34]:
len(records)

201

Take a look at the first item in the list of records to get an idea of the structure:

In [35]:
item = ET.tostring(records[0], default_namespace='http://www.openarchives.org/OAI/2.0/', xml_declaration=True, encoding='utf-8').decode('utf-8')
print(item)

<?xml version='1.0' encoding='utf-8'?>
<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns1="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><header><identifier>oai:cdm17409.contentdm.oclc.org:rencen/0</identifier><datestamp>2023-03-20</datestamp><setSpec>rencen</setSpec></header>
<metadata>
<ns1:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>View of ongoing construction of the Renaissance Center towers</dc:title>
<dc:description>Image of the building progress on two of the four towers of the Renaissance Center.  Construction equipment is clearing visible.</dc:description>
<dc:identifier>rencen_05b</dc:identifier>
<dc:date>1973</dc:date>
<dc:type>still image</dc:type>
<dc:format>photographs</dc:format>
<dc:coverage>Detroit, Michigan</dc:coverage>
<dc:coverage>1970s</dc:coverage>
<dc:relation>Building the 

Write out for inspection:

In [36]:
print(str(item))

<?xml version='1.0' encoding='utf-8'?>
<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns1="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><header><identifier>oai:cdm17409.contentdm.oclc.org:rencen/0</identifier><datestamp>2023-03-20</datestamp><setSpec>rencen</setSpec></header>
<metadata>
<ns1:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>View of ongoing construction of the Renaissance Center towers</dc:title>
<dc:description>Image of the building progress on two of the four towers of the Renaissance Center.  Construction equipment is clearing visible.</dc:description>
<dc:identifier>rencen_05b</dc:identifier>
<dc:date>1973</dc:date>
<dc:type>still image</dc:type>
<dc:format>photographs</dc:format>
<dc:coverage>Detroit, Michigan</dc:coverage>
<dc:coverage>1970s</dc:coverage>
<dc:relation>Building the 

One thing that the individual record suggests is that we will need a more complex namespace dicionary. Let's add in the new namespaces:

In [37]:
ns = {
    'oai'   : 'http://www.openarchives.org/OAI/2.0/',
    'dc'    : 'http://purl.org/dc/elements/1.1/',
    'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/'
}

This would also be a good chance to practice your XPath! 

Let's try: First, loop through the records, and find the titles and identifiers, then
find all of the DublinCore metadata elements.

In [38]:
for elem in recordRoot.iter():
    print(elem.tag)

{http://www.openarchives.org/OAI/2.0/}OAI-PMH
{http://www.openarchives.org/OAI/2.0/}responseDate
{http://www.openarchives.org/OAI/2.0/}request
{http://www.openarchives.org/OAI/2.0/}ListRecords
{http://www.openarchives.org/OAI/2.0/}record
{http://www.openarchives.org/OAI/2.0/}header
{http://www.openarchives.org/OAI/2.0/}identifier
{http://www.openarchives.org/OAI/2.0/}datestamp
{http://www.openarchives.org/OAI/2.0/}setSpec
{http://www.openarchives.org/OAI/2.0/}metadata
{http://www.openarchives.org/OAI/2.0/oai_dc/}dc
{http://purl.org/dc/elements/1.1/}title
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/1.1/}identifier
{http://purl.org/dc/elements/1.1/}date
{http://purl.org/dc/elements/1.1/}type
{http://purl.org/dc/elements/1.1/}format
{http://purl.org/dc/elements/1.1/}coverage
{http://purl.org/dc/elements/1.1/}coverage
{http://purl.org/dc/elements/1.1/}relation
{http://purl.org/dc/elements/1.1/}rights
{http://purl.org/dc/elements/1.1/}date
{http://purl.org/dc/

In [39]:
for title in recordRoot.iter('{http://purl.org/dc/elements/1.1/}title'):
    print(title)

<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33002fe50>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330037770>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330037db0>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330038450>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330038a90>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33003a130>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33003a770>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33003adb0>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33003e450>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa33003ea90>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330040130>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330040770>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330040db0>
<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7fa330042450>
<Element '{http://pu

In [40]:
ns

{'oai': 'http://www.openarchives.org/OAI/2.0/',
 'dc': 'http://purl.org/dc/elements/1.1/',
 'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/'}

Use XPath to identify elements in the tree; the title, for example:

In [41]:
for record in records:
    title = record.find('oai:metadata/oai_dc:dc/dc:title', ns).text
    print(title)

View of ongoing construction of the Renaissance Center towers
View of the ground floor atrium in the completed Renaissance Center.
The Renaissance Center towers as viewed from Jefferson Avenue
View of the atrium in the completed Renaissance Center.
View of an unidentified building near the Renaissance Center construction site
View of the Renaissance Center construction site with earth moving equipment
View of unidentified buildings near the Renaissance Center construction site
The Renaissance Center towers with derricks
View of the future site of Hart Plaza and a drill rig, south of the Renaissance Center site
View of support beams and construction equipment at the Renaissance Center building site
View of construction workers preparing the building site of the Renaissance Center
View of early stages of construction of the Renaissance Center complex of buildings
View of the Lafayette Tool and Die parking lot near the Renaissance Center construction site
View of Jefferson Avenue near the

AttributeError: 'NoneType' object has no attribute 'text'

Build up a loop to look through all of the DublinCore elements in each record:

In [42]:
for record in records:
    id = record.find('oai:metadata/oai_dc:dc/dc:identifier', ns).text
    print('Item',id)
    for dc in record.iterfind('oai:metadata//dc:*', ns):
        print('  ',dc.text,dc.tag)

Item rencen_05b
   View of ongoing construction of the Renaissance Center towers {http://purl.org/dc/elements/1.1/}title
   Image of the building progress on two of the four towers of the Renaissance Center.  Construction equipment is clearing visible. {http://purl.org/dc/elements/1.1/}description
   rencen_05b {http://purl.org/dc/elements/1.1/}identifier
   1973 {http://purl.org/dc/elements/1.1/}date
   still image {http://purl.org/dc/elements/1.1/}type
   photographs {http://purl.org/dc/elements/1.1/}format
   Detroit, Michigan {http://purl.org/dc/elements/1.1/}coverage
   1970s {http://purl.org/dc/elements/1.1/}coverage
   Building the Detroit Renaissance Center {http://purl.org/dc/elements/1.1/}relation
   Users can cite and link to these materials without obtaining permission. Users can also use the materials for non-commercial educational and research purposes in accordance with fair use. For other uses or to obtain high resolution images, please contact the copyright holder. {ht

AttributeError: 'NoneType' object has no attribute 'text'

## More XPath Practice

There aren't a lot of attributes to practice with, but one
important one that is useful for working with OAI-PMH is the `resumptionToken` value.
This is a string value that the server shares on paginated results. If you are working to 
get all of the results in a set that is split into multiple pages, you will need it.

The following code builds on the **ListRecords** section above. 

In [45]:
Records = requests.get(Repo_Identity_Info['baseURL'], 
                       params={
                           'verb': 'ListRecords', 
                           'set': 'rencen', 
                           'metadataPrefix': 'oai_dc'
                           })

Knowing that the record metadata is in the `ListRecords` element, you can just ask for that element directly:

In [49]:
records = ET.fromstring(Records.text).find('oai:ListRecords', ns)

len(records)

201

Note that there appear to be 201 records in this list. But if you look at the collection online, there are 343 items in this set. To get the full list, you will need the resumptionToken. 
That is given as the last subelement of the `ListRecords` element.

In [53]:
resumptionToken = records[-1].text

print(resumptionToken)

rencen:200:rencen:0000-00-00:9999-99-99:oai_dc


To get the full list, you can create a looping function that requests from the OAI-PMH endpoint as long as there are records:

In [79]:
def get_set(Repo_Identity_Info, set, metadata_type, namespaces):
    records = ''
    baseURL = Repo_Identity_Info['baseURL']
    verb = 'ListRecords'
    # first request
    r = requests.get(baseURL, 
                       params={
                           'verb': verb, 
                           'set': set, 
                           'metadataPrefix': metadata_type
                           })
    data = ET.fromstring(r.text)
    for record in data.findall('.//oai:record', namespaces):
        records = records + ET.tostring(record,  'utf-8').decode('utf-8') + '\n'
    # check to see if there are more requests needed
    try:
        resumptionToken = data.find('.//oai:resumptionToken', namespaces)
    except:
        resumptionToken = None
    # if so, then make more requests
    if resumptionToken:
        r = requests.get(baseURL, 
                         params={
                             'verb': verb,
                             'set': set,
                             'metadataPrefix': metadata_type,
                             'resumptionToken': resumptionToken
                         })
        data = ET.fromstring(r.text)
        for record in data.findall('.//oai:record', namespaces): 
            records = records + ET.tostring(record, 'utf-8').decode('utf-8') + '\n'
    return records    

In [80]:
record_list = get_set(Repo_Identity_Info, 'rencen', 'oai_dc', namespaces=ns)

In [81]:
with open('OAI-records-list.xml', 'w', encoding='utf-8') as f:
    f.write(record_list)
    print('wrote file')


wrote file


In [84]:
#TODO the above creates an invalid XML document with multiple primary 
#namespace declarations - the creation of the individual objects requires more work
#see below does not work

In [83]:
metadata = ET.fromstring(record_list)

for record in metadata:
    print(record.tag)

ParseError: junk after document element: line 18, column 0 (<string>)

OAI-PMH records tend not to have many attributes, so there are not a whole lot of XPath 
things to practice. For more practice with XPath, see the notebook `xml-xpath-examples-EAD.ipynb` in this repo.