# Working with OAI-PMH

The Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH)
provides an XML-based API interface. This interface supports common actions
useful for "metadata harvesting," such as querying, export or gathering
of data in standard metadata formats, and support for collections or sets
of content. 

This notebook explores OAI-PMH with the help of a python library
called [OAIPMH Scythe](https://afuetterer.github.io/oaipmh-scythe/latest/).

To install the library, you can use `pip`:

In [None]:
# uncomment the following if you need to install
#!python -m pip install oaipmh-scythe

Import the module:

In [1]:
from oaipmh_scythe import Scythe

## Query an OAI-PMH endpoint

As seen below, the endpoint is opened with the `Scythe()` function, which can then be reused.
The following also shows the `list_records()` function, which mirrorst the `ListRecords` verby. Here, the loop through the records displayed in the repository, and prints the identifier for each one:

In [None]:
with Scythe("http://jajohnst.si676.si.umich.edu/omeka-s/oai") as scythe:
    records = scythe.list_records()
    for record in records:
        print(record.header.identifier)

oai:jajohnst.si676.si.umich.edu:196
oai:jajohnst.si676.si.umich.edu:198
oai:jajohnst.si676.si.umich.edu:199
oai:jajohnst.si676.si.umich.edu:200
oai:jajohnst.si676.si.umich.edu:201
oai:jajohnst.si676.si.umich.edu:202
oai:jajohnst.si676.si.umich.edu:203
oai:jajohnst.si676.si.umich.edu:204
oai:jajohnst.si676.si.umich.edu:205
oai:jajohnst.si676.si.umich.edu:206
oai:jajohnst.si676.si.umich.edu:207
oai:jajohnst.si676.si.umich.edu:208
oai:jajohnst.si676.si.umich.edu:220
oai:jajohnst.si676.si.umich.edu:221
oai:jajohnst.si676.si.umich.edu:222
oai:jajohnst.si676.si.umich.edu:223
oai:jajohnst.si676.si.umich.edu:228
oai:jajohnst.si676.si.umich.edu:229
oai:jajohnst.si676.si.umich.edu:230
oai:jajohnst.si676.si.umich.edu:231
oai:jajohnst.si676.si.umich.edu:232
oai:jajohnst.si676.si.umich.edu:233
oai:jajohnst.si676.si.umich.edu:234
oai:jajohnst.si676.si.umich.edu:235
oai:jajohnst.si676.si.umich.edu:236
oai:jajohnst.si676.si.umich.edu:246
oai:jajohnst.si676.si.umich.edu:247
oai:jajohnst.si676.si.umich.

## Identify the endpoint

Returns information about the host repository. This mirrors the `Identify` verb.

In [3]:
repository = scythe.identify()
for info in repository:
    print(info)

('repositoryName', ['2025 Omeka'])
('baseURL', ['http://jajohnst.si676.si.umich.edu/omeka-s/oai'])
('protocolVersion', ['2.0'])
('adminEmail', ['jajohnst@umich.edu'])
('earliestDatestamp', ['1970-01-01T00:00:00Z'])
('deletedRecord', ['no'])
('granularity', ['YYYY-MM-DDThh:mm:ssZ'])
('description', [None, None])
('oai-identifier', [None])
('scheme', ['oai'])
('repositoryIdentifier', ['jajohnst.si676.si.umich.edu'])
('delimiter', [':'])
('sampleIdentifier', ['oai:jajohnst.si676.si.umich.edu:1'])
('toolkit', [None])
('title', ['Omeka S OAI-PMH Repository Module'])
('author', [None])
('name', ['John Flatness; Julian Maurice; Daniel Berthereau; and other contributors'])
('email', ['john@zerocrates.org; julian.maurice@biblibre.com; daniel.git@berthereau.net'])
('institution', ['RRCHNM; BibLibre;'])
('version', ['3.4.11'])
('toolkitIcon', ['https://omeka.org/favicon.ico'])
('URL', ['https://gitlab.com/Daniel-KM/Omeka-S-module-OaiPmhRepository'])


## Request metadata formats

Determine which metadata standard outputs can be requested.
Mirrors the `ListMetadataFormats` verb in OAI-PMH:

In [4]:
metadata_formats = scythe.list_metadata_formats()
for format in metadata_formats:
    print(format.metadataPrefix)

oai_dc
cdwalite
mets
mods
oai_dcterms
simple_xml


## Request a single record

How can you request a single record. This approach requires
a list of identifiers and mirrors the `GetRecord` verb.

In [5]:
record = scythe.get_record("oai:jajohnst.si676.si.umich.edu:284")

for data in record:
    print(data)

('title', ['[Interior view of library reading room with male and female students sitting at tables, reading, at the Tuskegee Institute]'])
('date', ['1902'])
('identifier', ['2009632000', 'https://www.loc.gov/item/2009632000/'])
('rights', ['No known restrictions on publication.'])


## Retrieving XML data

OAI-PMH was developed to be communicated in XML formats.
As the above show, the OAIPMH Scythe library works with the responses
as Python data.
However, it is possible to also request the full XML responses
using a method called `OAIReponseIterator`.

In [12]:
from oaipmh_scythe.iterator import OAIResponseIterator

scythe = Scythe("http://jajohnst.si676.si.umich.edu/omeka-s/oai", iterator=OAIResponseIterator)
responses = scythe.list_records()

for response in responses:
    print(response.xml)

<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 0x110991640>
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 0x110a3e180>


And, save to a local file:

In [None]:
with open("oai-xml-responses.xml", "w") as f:
    f.write(next(responses).raw.encode("utf-8"))