# Working with XML and OAIPMH

OAIPMH, the Open Archives Inititave Protocol for Metadata Harvesting, is a metadata protocol used by many digital libraries that can work similarly to a REST API. It provides standard rules and structuring principles, which we can use to construct requests and gather information about the objects in a repository or single collection. This version of the notebook uses the [xml.etree.ElementTree](https://docs.python.org/3.7/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iterfind) library. 

The following assumes a basic understanding of XML and a basic understanding of HTTP requests and responses. 

Resources: 

* Digital Maryland https://www.digitalmaryland.org/
* OAI-PMH requests http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
* XPATH reference https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms256115(v%3dvs.100)
* Python requests module https://2.python-requests.org//en/master/user/quickstart/
* CONTENTdm [capacities for harvesting via OAIPMH](https://help.oclc.org/Metadata_Services/CONTENTdm/CONTENTdm_Administration/Server_Administration/020Harvesting#OAI_Support)
* XML Chrome Extension [here](https://chrome.google.com/webstore/detail/xml-tree/gbammbheopgpmaagmckhpjbfgdfkpadb?hl=en)

We will explore how to gather and enhance information using an example of a collection managed in a CONTENTdm repository. 
One local example is [Digital Maryland](https://www.digitalmaryland.org/), a statewide digitization program and digital collection managed by the Maryland State Library 
Resource Center and the Enoch Pratt Free Library. We can find the endpoint URL for OAIPMH here: https://collections.digitalmaryland.org/oai/oai.php

The OAIPMH protocol offers a few different methods to make a request via HTTP. 
The type of request, and the type of desired response is specified by 
a "verb" variable in the URL request. Here are the possible verbs for an OAIPMH request:

* Identify
* ListIdentifiers
* ListSets
* GetRecord
* ListRecords

We will explore `Identify`, `ListIdentifiers`, and `GetRecord`. The others are explained in more 
detail in the OAI documentation linked above. 

First, we'll need to set things up. We need `requests` to use HTTP, 
we will use a library called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and 
[lxml](https://lxml.de/tutorial.html) to parse the responses in XML,
and we will use CSV later to provide an output.

In [1]:
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import csv

In [2]:
endpoint = 'https://collections.digitalmaryland.org/oai/oai.php'

# this headers dictionary helps the server think we are a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

In [3]:
args = {
    'verb': 'Identify'
}

repositoryInfo = requests.get(endpoint, params=args)

print(repositoryInfo.url)
print(repositoryInfo.encoding)
print('---------------------- response text-------------------------')
print(repositoryInfo.text)

https://collections.digitalmaryland.org/oai/oai.php?verb=Identify
UTF-8
---------------------- response text-------------------------
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2021-10-12T15:28:29Z</responseDate><request verb="Identify">http://collections.digitalmaryland.org/oai/oai.php</request><Identify>
      <repositoryName>CONTENTdm Server Repository</repositoryName>
      <baseURL>http://collections.digitalmaryland.org/oai/oai.php</baseURL>
      <protocolVersion>2.0</protocolVersion>
      <adminEmail>digitalmaryland@prattlibrary.org</adminEmail>
      <earliestDatestamp>2006-01-23</earliestDatestamp>
      <deletedRecord>transient</deletedRecord>
      <granularity>YYYY-MM-DD</granularity>
   </Identify>
  </OAI-PMH>


Although we already have it, we can determine the official OAI endpoint
from the repository's `Identify` response, which we queried above. 
To get that, we can parse the HTML with BeautifulSoup:

In [5]:
soup = BeautifulSoup(repositoryInfo.text, 'lxml')

In [6]:
baseurl = soup.identify.baseurl.text

print(baseurl)

http://collections.digitalmaryland.org/oai/oai.php


## Identify Collection Items

The repository groups items together into collections. To determine what is 
in a collection, we can use the `ListIdentifiers` verb. This will respond with an 
XML file that lists all of the collection items. If the collection has more than 200 items, 
the last item in the file will be a `resumptionToken`, which we can use to request the next list of results. We will use that later, but for now let's see how we can see 
what's in the collection.

In [7]:
# request the identifiers from cdmg21
args = {
    'verb': 'ListIdentifiers',
    'set': 'btpe',
    'metadataPrefix': 'oai_dc'
}

btpeReq = requests.get(baseurl, params=args, headers=headers)

btpeReq.text

'<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2021-10-12T15:29:29Z</responseDate><request verb="ListIdentifiers" set="btpe" metadataPrefix="oai_dc">http://collections.digitalmaryland.org/oai/oai.php</request><ListIdentifiers><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/0</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/1</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/2</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/3</id

In [8]:
resp = BeautifulSoup(btpeReq.text, 'lxml')

resumptionToken = resp.resumptiontoken.text

resumptionToken

'btpe:200:btpe:0000-00-00:9999-99-99:oai_dc'

In [9]:
args = {
    'verb': 'ListIdentifiers',
    'resumptionToken': resp.resumptiontoken.text
}

In [10]:
req2 = requests.get(baseurl, params=args, headers=headers)

print(req2.url)

resp2 = BeautifulSoup(req2.text)

print(resp2)

http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&resumptionToken=btpe%3A200%3Abtpe%3A0000-00-00%3A9999-99-99%3Aoai_dc
<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-10-12T15:30:19Z</responsedate><request metadataprefix="oai_dc" resumptiontoken="btpe:200:btpe:0000-00-00:9999-99-99:oai_dc" set="btpe" verb="ListIdentifiers">http://collections.digitalmaryland.org/oai/oai.php</request><listidentifiers><header><identifier>oai:collections.digitalmaryland.org:btpe/200</identifier><datestamp>2008-09-22</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:btpe/201</identifier><datestamp>2008-08-14</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:b

In [11]:
args = {
    'verb': 'ListIdentifiers',
    'resumptionToken': resp2.resumptiontoken.text
}

req3 = requests.get(baseurl, params=args, headers=headers)

print(req3.url)

resp3 = BeautifulSoup(req3.text)

print(resp3)

http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&resumptionToken=btpe%3A400%3Abtpe%3A0000-00-00%3A9999-99-99%3Aoai_dc
<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-10-12T15:30:33Z</responsedate><request metadataprefix="oai_dc" resumptiontoken="btpe:400:btpe:0000-00-00:9999-99-99:oai_dc" set="btpe" verb="ListIdentifiers">http://collections.digitalmaryland.org/oai/oai.php</request><listidentifiers><header><identifier>oai:collections.digitalmaryland.org:btpe/400</identifier><datestamp>2008-09-22</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:btpe/401</identifier><datestamp>2008-08-07</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:b

In [12]:
for item in resp.find_all('header'):
    print(item.identifier)

<identifier>oai:collections.digitalmaryland.org:btpe/0</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/1</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/2</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/3</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/4</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/5</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/6</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/7</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/8</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/9</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/10</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/11</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/12</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/13</identifier>
<identifier>oai:collections.digitalmaryland.

In [13]:
for item in resp2.find_all('header'):
    print(item.identifier)

<identifier>oai:collections.digitalmaryland.org:btpe/200</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/201</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/202</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/203</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/204</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/205</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/206</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/207</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/208</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/209</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/210</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/211</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/212</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/213</identifier>
<identifier>oai:coll

In [14]:
for item in resp3.find_all('header'):
    print(item.identifier)

<identifier>oai:collections.digitalmaryland.org:btpe/400</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/401</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/402</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/403</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/404</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/405</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/406</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/407</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/408</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/409</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/410</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/411</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/412</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/413</identifier>
<identifier>oai:coll

### Namespaces

Here are a few namespace helpers that may be useful as wel parse OAI XML and Dublin Core XML as expressed by CONTENTdm.

In [15]:
DC_NS = '{}'
OAI_NS = '{http://www.openarchives.org/OAI/2.0/}'
OAI_DC_NS = '{http://www.openarchives.org/OAI/2.0/oai_dc/}'

### Get the IDs with elementree

Above we used Beautiful soup to get a list of the identifiers. 
There are specialized XML parsers, which will be useful when we want to 
make more detailed or complex queries beyond the HTML tags. In this 
example, we will use lxml to pull the identifiers into a list. 
We build on our use of the `ListIdentifiers` response above. 

In [16]:
root = ET.fromstring(btpeReq.text[38:])

In [17]:
#identify the main tag and namespace
root.tag

'{http://www.openarchives.org/OAI/2.0/}OAI-PMH'

In [18]:
root.attrib

{'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd'}

In [19]:
#lxml gives us a list of elements in the document hierarchy
for branch in root:
    print(branch)

<Element '{http://www.openarchives.org/OAI/2.0/}responseDate' at 0x7fedd5ffd180>
<Element '{http://www.openarchives.org/OAI/2.0/}request' at 0x7fedd5ffd270>
<Element '{http://www.openarchives.org/OAI/2.0/}ListIdentifiers' at 0x7fedd5ffd2c0>


In [20]:
btpeReq.url

'http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc'

In [21]:
for branch in root:
    print(branch.tag, branch.keys(), branch.attrib)

{http://www.openarchives.org/OAI/2.0/}responseDate [] {}
{http://www.openarchives.org/OAI/2.0/}request ['verb', 'set', 'metadataPrefix'] {'verb': 'ListIdentifiers', 'set': 'btpe', 'metadataPrefix': 'oai_dc'}
{http://www.openarchives.org/OAI/2.0/}ListIdentifiers [] {}


In [23]:
data = root

In [24]:
for element in data[2][1]:
    print(element.tag, element.text)

{http://www.openarchives.org/OAI/2.0/}identifier oai:collections.digitalmaryland.org:btpe/1
{http://www.openarchives.org/OAI/2.0/}datestamp 2012-05-31
{http://www.openarchives.org/OAI/2.0/}setSpec btpe


In [25]:
for element in data[2].iter(OAI_NS + 'identifier'): 
    print(element.text)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [26]:
# convert the earlier requests in etree.xml ELement objects
resp = data

resp2 = ET.fromstring(req2.text[38:])

resp3 = ET.fromstring(req3.text[38:])

In [27]:
identifiers = list()

for item in resp[2].iter(OAI_NS + 'identifier'):
    identifiers.append(item.text)

for item in resp2[2].iter(OAI_NS + 'identifier'):
    identifiers.append(item.text)

for item in resp3[2].iter(OAI_NS + 'identifier'):
    identifiers.append(item.text)

print(len(identifiers))

483


In [28]:
for identifier in identifiers:
    print(identifier)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [29]:
for element in data[2]:
    print(element)

<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffd360>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffd590>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffd6d0>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffd810>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffd950>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffda90>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffdbd0>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffdd10>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffde50>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd5ffdf90>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd6001130>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd6001270>
<Element '{http://www.openarchives.org/OAI/2.0/}header' at 0x7fedd60013b0>
<Element '{http://www.ope

In [30]:
token = data[2].find(OAI_NS + 'resumptionToken').text

print(token)

btpe:200:btpe:0000-00-00:9999-99-99:oai_dc


Make something that can check for `resumptionToken` and then keep going... 

In [31]:
itemList = list()

args = {
    'verb': 'ListIdentifiers',
    'set': 'btpe',
    'metadataPrefix': 'oai_dc'
}

try:
    r = requests.get(baseurl, params=args, headers=headers)
    coll_xml = ET.fromstring(r.text[38:])
    print('requested',r.url)
    print('response',r.status_code)
except:
    print('no xml response created')

while True:
    for item in coll_xml[2].iter(OAI_NS + 'identifier'):
        itemList.append(item.text)
    print('appended items from page',r.url)
    resumptionToken = coll_xml[2].find(OAI_NS + 'resumptionToken').text
    # set up next URL request
    if coll_xml[2].find(OAI_NS + 'resumptionToken') is not None:
        args = dict()
        args['resumptionToken'] = resumptionToken
        args['verb'] = 'ListIdentifiers'
        args_string = "&".join("%s=%s" % (k,v) for k,v in args.items())
        r = requests.get(baseurl, params=args_string, headers=headers)
        print('requesting',r.url)
        coll_xml = ET.fromstring(r.text[38:])
    else:
        break

requested http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc
response 200
appended items from page http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc
requesting http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:200:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
appended items from page http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:200:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
requesting http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:400:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
appended items from page http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:400:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers


AttributeError: 'NoneType' object has no attribute 'text'

In [32]:
print(len(itemList))

483


In [33]:
for item in itemList:
    print(item)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [34]:
fname = 'btpe-identifiers.csv'
count = 0

with open(fname,'w', encoding='utf-8', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['number','identifier'])
    for item in itemList:
        count += 1
        csvwriter.writerow([count,item])
    print('wrote csv', fname)

wrote csv btpe-identifiers.csv


## Get Item Information

Use verb `GetRecord` to retrieve the information about the  title.

In [35]:
args = {
    'verb': 'GetRecord',
    'metadataPrefix': 'oai_dc',
    'identifier': 'oai:collections.digitalmaryland.org:btpe/482'
}

args_string = "&".join("%s=%s" % (k,v) for k,v in args.items())

print(args_string)

verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:collections.digitalmaryland.org:btpe/482


In [36]:
r.url

'http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:400:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers'

In [37]:
r = requests.get(baseurl, params=args_string, headers=headers)

print(r.text)

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2021-10-12T15:34:26Z</responseDate><request verb="GetRecord" metadataPrefix="oai_dc" identifier="oai:collections.digitalmaryland.org:btpe/482">http://collections.digitalmaryland.org/oai/oai.php</request><GetRecord><record><header><identifier>oai:collections.digitalmaryland.org:btpe/482</identifier><datestamp>2008-08-07</datestamp><setSpec>btpe</setSpec></header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:identifier>btpe2024</dc:identifier>
<dc:title>Baltimore Transit car numbe

In [38]:
# the item metadata introduces dublin core as a namespace, and we may need to reference that.
# in lxml we can use a dictionary to manage valid namespaces:

DC_NS = '{http://purl.org/dc/elements/1.1/}'
OAI_NS = '{http://www.openarchives.org/OAI/2.0/}'
OAI_DC_NS = '{http://www.openarchives.org/OAI/2.0/oai_dc/}'

ns = { 
    'dc' : 'http://purl.org/dc/elements/1.1/',
    'oai_pmh' : 'http://www.openarchives.org/OAI/2.0/',
    'oai_dc' : 'http://www.openarchives.org/OAI/2.0/oai_dc/'
}

In [39]:
item_xml = ET.fromstring(r.text[38:])

In [40]:
for branch in item_xml:
    print(branch.tag, branch.attrib)

{http://www.openarchives.org/OAI/2.0/}responseDate {}
{http://www.openarchives.org/OAI/2.0/}request {'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': 'oai:collections.digitalmaryland.org:btpe/482'}
{http://www.openarchives.org/OAI/2.0/}GetRecord {}


In [41]:
for identifier in item_xml[2].iter(OAI_NS + 'identifier'):
    print(identifier.tag, identifier.text)

{http://www.openarchives.org/OAI/2.0/}identifier oai:collections.digitalmaryland.org:btpe/482


In [42]:
for identifier in item_xml.findall('oai_pmh:identifier', namespaces=ns):
    print(identifier)

In [43]:
item_xml.tag

'{http://www.openarchives.org/OAI/2.0/}OAI-PMH'

In [44]:
for element in item_xml.iterfind('oai_pmh:OAI-PMH', namespaces=ns):
    print(element.tag)

In [45]:
for item in item_xml[2][0][1][0]:
    print(item.tag, item.text)
    print(len(item))

{http://purl.org/dc/elements/1.1/}identifier btpe2024
0
{http://purl.org/dc/elements/1.1/}title Baltimore Transit car number 5748, loop at Dundalk Avenue at Center Place, Dundalk (NRSH Baltimore Chapter tour)
0
{http://purl.org/dc/elements/1.1/}creator Miller, Edward S., 1920-2010;
0
{http://purl.org/dc/elements/1.1/}subject Baltimore (Md.); Baltimore Transit Company; National Railway Historical Society. Baltimore Chapter; Street-railroads; Streets;
0
{http://purl.org/dc/elements/1.1/}description Photograph of Baltimore Transit car number 5748 at the loop at Dundalk Avenue at Center Place in Dundalk, Baltimore County, Maryland during a tour set up by the Baltimore Chapter of the NRHS (National Railway Historical Society). Car line 26 runs on these tracks.
0
{http://purl.org/dc/elements/1.1/}source Pennsylvania Trolley Museum;
0
{http://purl.org/dc/elements/1.1/}date 1953-09-20;
0
{http://purl.org/dc/elements/1.1/}type Image;
0
{http://purl.org/dc/elements/1.1/}format Digital reproducti

In [46]:
fname = 'item-records-full.csv'
count = 0

with open(fname, 'w', encoding='utf-8', newline=''):
    for item in itemList:
        args = {'verb' : 'GetRecord', 'metadataPrefix' : 'oai_dc', 'identifier' : item}
        r = requests.get(baseurl, params=args, headers=headers)
        item_xml = ET.fromstring(r.text[38:])
        for item in item_xml[2][0][1][0]:
            if item.tag == DC_NS + 'identifier': 
                itemID = item.tag
            if item.tag == DC_NS + 'title':
                item_title = item.tag

IndexError: child index out of range

In [48]:
for title in item_xml[2].findall('.//' + OAI_DC_NS + 'title'):
    print(title.text)