# Working with XML and OAIPMH

OAIPMH, the Open Archives Inititave Protocol for Metadata Harvesting, is a metadata protocol used by many digital libraries that can work similarly to a REST API. It provides standard rules and structuring principles, which we can use to construct requests and gather information about the objects in a repository or single collection. 

The following assumes a basic understanding of XML and a basic understanding of HTTP requests and responses. 

Resources: 

* Digital Maryland https://www.digitalmaryland.org/
* OAI-PMH requests http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
* XPATH reference https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms256115(v%3dvs.100)
* Python requests module https://2.python-requests.org//en/master/user/quickstart/
* CONTENTdm [capacities for harvesting via OAIPMH](https://help.oclc.org/Metadata_Services/CONTENTdm/CONTENTdm_Administration/Server_Administration/020Harvesting#OAI_Support)
* XML Chrome Extension [here](https://chrome.google.com/webstore/detail/xml-tree/gbammbheopgpmaagmckhpjbfgdfkpadb?hl=en)

We will explore how to gather and enhance information using an example of a collection managed in a CONTENTdm repository. 
One local example is [Digital Maryland](https://www.digitalmaryland.org/), a statewide digitization program and digital collection managed by the Maryland State Library 
Resource Center and the Enoch Pratt Free Library. We can find the endpoint URL for OAIPMH here: https://collections.digitalmaryland.org/oai/oai.php

The OAIPMH protocol offers a few different methods to make a request via HTTP. 
The type of request, and the type of desired response is specified by 
a "verb" variable in the URL request. Here are the possible verbs for an OAIPMH request:

* Identify
* ListIdentifiers
* ListSets
* GetRecord
* ListRecords

We will explore `Identify`, `ListIdentifiers`, and `GetRecord`. The others are explained in more 
detail in the OAI documentation linked above. 

## XML Concepts

The examples below also presume a basic knowledge of XML. Some of the major techniques for parsing XML and working with XML data in python include:

* Understanding basic structure of an XML document, including declaration statement, elements, attributes, and schema concept
* Understanding the XML document as "tree"-like as well as "path"-like
* Concept of schemas and relationship to namespaces referenced in elements (see perhaps http://effbot.org/zone/element-namespaces.htm)

**Actions**
* Creating a python-processable XML object using ETree and/or lxml modules
* Iterating through the XML object to find various Elements in a list-like structure (use indices)
* Querying for specific attributes using a dictionary-like reference
* Using intial iter & findalls to explore the tree
* Using a dictionary to manage namespaces 
* Query for specific tags within namespaces using basic XPath queries
  * requires basic understanding of XPath, similar to the basics of regex that 
  are covered in LibCarpentry

First, we'll need to set things up. We need `requests` to use HTTP, 
we will use a library called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and 
[lxml](https://lxml.de/tutorial.html) to parse the responses in XML,
and we will use CSV later to provide an output.

In [37]:
import requests
from bs4 import BeautifulSoup
from lxml import etree
import csv

In [38]:
endpoint = 'https://collections.digitalmaryland.org/oai/oai.php'

# this headers dictionary helps the server think we are a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

In [4]:
args = {
    'verb': 'Identify'
}

repositoryInfo = requests.get(endpoint, params=args)

print(repositoryInfo.url)
print(repositoryInfo.encoding)
print('---------------------- response text-------------------------')
print(repositoryInfo.text)

https://collections.digitalmaryland.org/oai/oai.php?verb=Identify
UTF-8
---------------------- response text-------------------------
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-04-25T15:24:02Z</responseDate><request verb="Identify">http://collections.digitalmaryland.org/oai/oai.php</request><Identify>
      <repositoryName>CONTENTdm Server Repository</repositoryName>
      <baseURL>http://collections.digitalmaryland.org/oai/oai.php</baseURL>
      <protocolVersion>2.0</protocolVersion>
      <adminEmail>digitalmaryland@prattlibrary.org</adminEmail>
      <earliestDatestamp>2006-01-23</earliestDatestamp>
      <deletedRecord>transient</deletedRecord>
      <granularity>YYYY-MM-DD</granularity>
   </Identify>
  </OAI-PMH>


Although we already have it, we can determine the official OAI endpoint
from the repository's `Identify` response, which we queried above. 
To get that, we can parse the HTML with BeautifulSoup:

In [39]:
soup = BeautifulSoup(repositoryInfo.text)

soup

<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2019-04-25T15:24:02Z</responsedate><request verb="Identify">http://collections.digitalmaryland.org/oai/oai.php</request><identify>
<repositoryname>CONTENTdm Server Repository</repositoryname>
<baseurl>http://collections.digitalmaryland.org/oai/oai.php</baseurl>
<protocolversion>2.0</protocolversion>
<adminemail>digitalmaryland@prattlibrary.org</adminemail>
<earliestdatestamp>2006-01-23</earliestdatestamp>
<deletedrecord>transient</deletedrecord>
<granularity>YYYY-MM-DD</granularity>
</identify>
</oai-pmh></body></html>

In [40]:
baseurl = soup.identify.baseurl.text

print(baseurl)

http://collections.digitalmaryland.org/oai/oai.php


## Identify Collection Items

The repository groups items together into collections. To determine what is 
in a collection, we can use the `ListIdentifiers` verb. This will respond with an 
XML file that lists all of the collection items. If the collection has more than 200 items, 
the last item in the file will be a `resumptionToken`, which we can use to request the next list of results. We will use that later, but for now let's see how we can see 
what's in the collection.

In [8]:
# request the identifiers from cdmg21
args = {
    'verb': 'ListIdentifiers',
    'set': 'btpe',
    'metadataPrefix': 'oai_dc'
}

btpeReq = requests.get(baseurl, params=args, headers=headers)

btpeReq.text

'<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-04-25T12:02:44Z</responseDate><request verb="ListIdentifiers" set="btpe" metadataPrefix="oai_dc">http://collections.digitalmaryland.org/oai/oai.php</request><ListIdentifiers><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/0</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/1</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/2</identifier><datestamp>2012-05-31</datestamp><setSpec>btpe</setSpec></header><header status="deleted"><identifier>oai:collections.digitalmaryland.org:btpe/3</id

In [10]:
resp = BeautifulSoup(btpeReq.text)

resumptionToken = resp.resumptiontoken.text

resumptionToken

'btpe:200:btpe:0000-00-00:9999-99-99:oai_dc'

In [11]:
args = {
    'verb': 'ListIdentifiers',
    'resumptionToken': resp.resumptiontoken.text
}

In [12]:
req2 = requests.get(baseurl, params=args, headers=headers)

print(req2.url)

resp2 = BeautifulSoup(req2.text)

print(resp2)

http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&resumptionToken=btpe%3A200%3Abtpe%3A0000-00-00%3A9999-99-99%3Aoai_dc
<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2019-04-25T12:03:11Z</responsedate><request metadataprefix="oai_dc" resumptiontoken="btpe:200:btpe:0000-00-00:9999-99-99:oai_dc" set="btpe" verb="ListIdentifiers">http://collections.digitalmaryland.org/oai/oai.php</request><listidentifiers><header><identifier>oai:collections.digitalmaryland.org:btpe/200</identifier><datestamp>2008-09-22</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:btpe/201</identifier><datestamp>2008-08-14</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:b

In [13]:
args = {
    'verb': 'ListIdentifiers',
    'resumptionToken': resp2.resumptiontoken.text
}

req3 = requests.get(baseurl, params=args, headers=headers)

print(req3.url)

resp3 = BeautifulSoup(req3.text)

print(resp3)

http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&resumptionToken=btpe%3A400%3Abtpe%3A0000-00-00%3A9999-99-99%3Aoai_dc
<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2019-04-25T12:03:19Z</responsedate><request metadataprefix="oai_dc" resumptiontoken="btpe:400:btpe:0000-00-00:9999-99-99:oai_dc" set="btpe" verb="ListIdentifiers">http://collections.digitalmaryland.org/oai/oai.php</request><listidentifiers><header><identifier>oai:collections.digitalmaryland.org:btpe/400</identifier><datestamp>2008-09-22</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:btpe/401</identifier><datestamp>2008-08-07</datestamp><setspec>btpe</setspec></header><header><identifier>oai:collections.digitalmaryland.org:b

In [14]:
for item in resp.find_all('header'):
    print(item.identifier)
    

<identifier>oai:collections.digitalmaryland.org:btpe/0</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/1</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/2</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/3</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/4</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/5</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/6</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/7</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/8</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/9</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/10</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/11</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/12</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/13</identifier>
<identifier>oai:collections.digitalmaryland.

In [15]:
for item in resp2.find_all('header'):
    print(item.identifier)

<identifier>oai:collections.digitalmaryland.org:btpe/200</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/201</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/202</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/203</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/204</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/205</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/206</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/207</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/208</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/209</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/210</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/211</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/212</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/213</identifier>
<identifier>oai:coll

In [16]:
for item in resp3.find_all('header'):
    print(item.identifier)

<identifier>oai:collections.digitalmaryland.org:btpe/400</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/401</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/402</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/403</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/404</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/405</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/406</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/407</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/408</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/409</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/410</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/411</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/412</identifier>
<identifier>oai:collections.digitalmaryland.org:btpe/413</identifier>
<identifier>oai:coll

### Get the IDs with lxml

Above we used Beautiful soup to get a list of the identifiers. 
There are specialized XML parsers, which will be useful when we want to 
make more detailed or complex queries beyond the HTML tags. In this 
example, we will use lxml to pull the identifiers into a list. 
We build on our use of the `ListIdentifiers` response above. 

In [20]:
data = etree.fromstring(btpeReq.text[38:])

In [21]:
#identify the main tag and namespace
data.tag

'{http://www.openarchives.org/OAI/2.0/}OAI-PMH'

In [24]:
#lxml gives us a list of elements in the document hierarchy
for element in data:
    print(element)

<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 0x120608eda48>
<Element {http://www.openarchives.org/OAI/2.0/}request at 0x120625f9d48>
<Element {http://www.openarchives.org/OAI/2.0/}ListIdentifiers at 0x12062651b08>


In [25]:
btpeReq.url

'http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc'

In [27]:
for element in data:
    print(element.tag, element.keys())

{http://www.openarchives.org/OAI/2.0/}responseDate []
{http://www.openarchives.org/OAI/2.0/}request ['verb', 'set', 'metadataPrefix']
{http://www.openarchives.org/OAI/2.0/}ListIdentifiers []


In [28]:
for element in data[2][1]:
    print(element.tag, element.text)

{http://www.openarchives.org/OAI/2.0/}identifier oai:collections.digitalmaryland.org:btpe/1
{http://www.openarchives.org/OAI/2.0/}datestamp 2012-05-31
{http://www.openarchives.org/OAI/2.0/}setSpec btpe


In [29]:
for item in data.iter('{*}header'): 
    print(item[0].text)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [30]:
# convert the earlier requests in etree.xml ELement objects
resp = data

resp2 = etree.fromstring(req2.text[38:])

resp3 = etree.fromstring(req3.text[38:])

In [31]:
identifiers = list()

for item in resp.iter('{*}header'):
    identifiers.append(item[0].text)

for item in resp2.iter('{*}header'):
    identifiers.append(item[0].text)

for item in resp3.iter('{*}header'):
    identifiers.append(item[0].text)

print(len(identifiers))

483


In [33]:
for identifier in identifiers:
    print(identifier)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [34]:
token = data.find('.//{*}resumptionToken')

print(token.text)

btpe:200:btpe:0000-00-00:9999-99-99:oai_dc


Make something that can check for `resumptionToken` and then keep going... 

In [35]:
itemList = list()

args = {
    'verb': 'ListIdentifiers',
    'set': 'btpe',
    'metadataPrefix': 'oai_dc'
}

try:
    r = requests.get(baseurl, params=args, headers=headers)
    coll_xml = etree.fromstring(r.text[38:])
    print('requested',r.url)
    print('response',r.status_code)
except:
    print('no xml response created')

while True:
    for item in coll_xml.iter('{*}header'):
        itemList.append(item[0].text)
    print('appended items from page',r.url)
    resumptionToken = coll_xml.find('.//{*}resumptionToken').text
    # set up next URL request
    if coll_xml.find('.//{*}resumptionToken') is not None:
        args = dict()
        args['resumptionToken'] = resumptionToken
        args['verb'] = 'ListIdentifiers'
        args_string = "&".join("%s=%s" % (k,v) for k,v in args.items())
        r = requests.get(baseurl, params=args_string, headers=headers)
        print('requesting',r.url)
        coll_xml = etree.fromstring(r.text[38:])
    else:
        break

requested http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc
response 200
appended items from page http://collections.digitalmaryland.org/oai/oai.php?verb=ListIdentifiers&set=btpe&metadataPrefix=oai_dc
requesting http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:200:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
appended items from page http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:200:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
requesting http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:400:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers
appended items from page http://collections.digitalmaryland.org/oai/oai.php?resumptionToken=btpe:400:btpe:0000-00-00:9999-99-99:oai_dc&verb=ListIdentifiers


AttributeError: 'NoneType' object has no attribute 'text'

In [36]:
print(len(itemList))

483


In [37]:
for item in itemList:
    print(item)

oai:collections.digitalmaryland.org:btpe/0
oai:collections.digitalmaryland.org:btpe/1
oai:collections.digitalmaryland.org:btpe/2
oai:collections.digitalmaryland.org:btpe/3
oai:collections.digitalmaryland.org:btpe/4
oai:collections.digitalmaryland.org:btpe/5
oai:collections.digitalmaryland.org:btpe/6
oai:collections.digitalmaryland.org:btpe/7
oai:collections.digitalmaryland.org:btpe/8
oai:collections.digitalmaryland.org:btpe/9
oai:collections.digitalmaryland.org:btpe/10
oai:collections.digitalmaryland.org:btpe/11
oai:collections.digitalmaryland.org:btpe/12
oai:collections.digitalmaryland.org:btpe/13
oai:collections.digitalmaryland.org:btpe/14
oai:collections.digitalmaryland.org:btpe/15
oai:collections.digitalmaryland.org:btpe/16
oai:collections.digitalmaryland.org:btpe/17
oai:collections.digitalmaryland.org:btpe/18
oai:collections.digitalmaryland.org:btpe/19
oai:collections.digitalmaryland.org:btpe/20
oai:collections.digitalmaryland.org:btpe/21
oai:collections.digitalmaryland.org:btpe/2

In [39]:
fname = 'btpe-identifiers.csv'
count = 0

with open(fname,'w', encoding='utf-8', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['number','identifier'])
    for item in itemList:
        count += 1
        csvwriter.writerow([count,item])
    print('wrote csv', fname)

wrote csv btpe-identifiers.csv


## Get Item Information

Use verb `GetRecord` to retrieve the information about the  title.

This section also introduces how to handle namespace prefixing with a dictionary
and the use of basic XPath queries.

In [41]:
args = {
    'verb': 'GetRecord',
    'metadataPrefix': 'oai_dc',
    'identifier': 'oai:collections.digitalmaryland.org:btpe/482'
}

args_string = "&".join("%s=%s" % (k,v) for k,v in args.items())

print(args_string)

verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:collections.digitalmaryland.org:btpe/482


In [42]:
r = requests.get(baseurl, params=args_string, headers=headers)

print(r.text)

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-05-05T14:47:44Z</responseDate><request verb="GetRecord" metadataPrefix="oai_dc" identifier="oai:collections.digitalmaryland.org:btpe/482">http://collections.digitalmaryland.org/oai/oai.php</request><GetRecord><record><header><identifier>oai:collections.digitalmaryland.org:btpe/482</identifier><datestamp>2008-08-07</datestamp><setSpec>btpe</setSpec></header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:identifier>btpe2024</dc:identifier>
<dc:title>Baltimore Transit car numbe

To parse the XML, strip out the XML declaration to feed the root element and contents into 
a variable that we'll call `item_xml`:

In [44]:
item_xml = etree.fromstring(r.text[38:])

We can use the `iter()` function to query basic information using a placeholder namespace (indicated by `{*}` below):

In [11]:
for identifier in item_xml.iter('{*}identifier'):
    print(identifier.text)

oai:collections.digitalmaryland.org:btpe/482
btpe2024
http://collections.digitalmaryland.org/cdm/ref/collection/btpe/id/482


In [16]:
for title in item_xml.iter('{*}title'):
    print(title.text)

Baltimore Transit car number 5748, loop at Dundalk Avenue at Center Place, Dundalk (NRSH Baltimore Chapter tour)


In [20]:
for title in item_xml.iter('{*}creator'):
    print(title.text)

Miller, Edward S., 1920-2010;


However, it is more powerful to use the ETree capabilities to help handle namespaces. For this, a dictionary of the namespaces helps to assign prefixes, then a bit of querying to see how the script sees the tags helps to determine which prefixes to use. Finally, we will use some basic XPath queries to look into specific elements using their tag names rather than (as above) their index positions in the tree. 

In [53]:
# the item metadata introduces dublin core as a namespace, and we may need to reference that.
# in lxml we can use a dictionary to manage valid namespaces:
ns = {
    'OAI-PMH': 'http://www.openarchives.org/OAI/2.0/',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
}

In [47]:
# Use for loops like this to reveal the top tags and their namespaces
# Note this also can show possible attributes to query
for item in item_xml:
    print(item.tag, item.attrib)
    try:
        print(item.attrib['verb'])
    except:
        continue

{http://www.openarchives.org/OAI/2.0/}responseDate {}
{http://www.openarchives.org/OAI/2.0/}request {'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': 'oai:collections.digitalmaryland.org:btpe/482'}
GetRecord
{http://www.openarchives.org/OAI/2.0/}GetRecord {}


In [52]:
for item in item_xml:
    try:
        itemID = item.attrib['identifier']
        print(itemID)
    except:
        continue

oai:collections.digitalmaryland.org:btpe/482


In [75]:
# this loop uses findall() and an XPath query to looks inside the GetRecord element 
# to find the item's header information
for element in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:header/', namespaces=ns):
    print(element.tag, element.attrib)
    print('  ',element.text)

{http://www.openarchives.org/OAI/2.0/}identifier {}
   oai:collections.digitalmaryland.org:btpe/482
{http://www.openarchives.org/OAI/2.0/}datestamp {}
   2008-08-07
{http://www.openarchives.org/OAI/2.0/}setSpec {}
   btpe


In [70]:
# this loop looks inside the GetRecord element to find the item's record information
for element in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadata/oai_dc:dc/', namespaces=ns):
    print(element.tag, element.attrib)
    print('  ',element.text)    

{http://purl.org/dc/elements/1.1/}identifier {}
   btpe2024
{http://purl.org/dc/elements/1.1/}title {}
   Baltimore Transit car number 5748, loop at Dundalk Avenue at Center Place, Dundalk (NRSH Baltimore Chapter tour)
{http://purl.org/dc/elements/1.1/}creator {}
   Miller, Edward S., 1920-2010;
{http://purl.org/dc/elements/1.1/}subject {}
   Baltimore (Md.); Baltimore Transit Company; National Railway Historical Society. Baltimore Chapter; Street-railroads; Streets;
{http://purl.org/dc/elements/1.1/}description {}
   Photograph of Baltimore Transit car number 5748 at the loop at Dundalk Avenue at Center Place in Dundalk, Baltimore County, Maryland during a tour set up by the Baltimore Chapter of the NRHS (National Railway Historical Society). Car line 26 runs on these tracks.
{http://purl.org/dc/elements/1.1/}source {}
   Pennsylvania Trolley Museum;
{http://purl.org/dc/elements/1.1/}date {}
   1953-09-20;
{http://purl.org/dc/elements/1.1/}type {}
   Image;
{http://purl.org/dc/element

In [76]:
# Make a list of the tags in the metadata element:

item_tags = list()

for item in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadata/oai_dc:dc/', namespaces=ns):
    item_tags.append(item.tag)

print(item_tags)

['{http://purl.org/dc/elements/1.1/}identifier', '{http://purl.org/dc/elements/1.1/}title', '{http://purl.org/dc/elements/1.1/}creator', '{http://purl.org/dc/elements/1.1/}subject', '{http://purl.org/dc/elements/1.1/}description', '{http://purl.org/dc/elements/1.1/}source', '{http://purl.org/dc/elements/1.1/}date', '{http://purl.org/dc/elements/1.1/}type', '{http://purl.org/dc/elements/1.1/}format', '{http://purl.org/dc/elements/1.1/}rights', '{http://purl.org/dc/elements/1.1/}rights', '{http://purl.org/dc/elements/1.1/}identifier']


In [78]:
# if you want to get just the titles without the namespaces and "qualified names", 
# try to parse the returned element titles as strings, splitting them like this:

item_tags = list()

for item in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadata/oai_dc:dc/', namespaces=ns):
    if item.tag[0] == '{':
        URI, tag = item.tag[1:].split('}')
        item_tags.append(tag)
    else:
        item_tags.append(item.tag)

print(item_tags)

['identifier', 'title', 'creator', 'subject', 'description', 'source', 'date', 'type', 'format', 'rights', 'rights', 'identifier']


Assuming (dangerously) that all of the items will have this set of elements, and in this order,
could we pull information using that list to serve as a list of fields to pull information from each item?

In [113]:
for item in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadata/oai_dc:dc/dc:title/', ns):
    print(type(item))

In [90]:
# then, extending the above list, use those elements to create a dictionary of 
# the metadata contents:

item_tags = list()
item_info = dict()

for item in item_xml.findall('.//OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadata/oai_dc:dc/', namespaces=ns):
    if item.tag[0] == '{':
        URI, tag = item.tag[1:].split('}')
        item_tags.append(tag)
    else:
        tag = item.tag
        item.tags.append(tag)
    for element in item_tags:
        path = './OAI-PMH:GetRecord/OAI-PMH:record/OAI-PMH:metadta/oai_dc:dc/dc:' + field + '/'
        info = item.findall(element, ns).text
        item_info[element] = info
        
print(item_info)

AttributeError: 'NoneType' object has no attribute 'text'