This section gives a brief overview on how to use Sickle for querying OAI interfaces.
To make a connection to an OAI interface, you need to import the Sickle object:
>>> from sickle import Sickle
Next, you can initialize the connection by passing it the basic URL. In our example, we use the OAI interface of the ELIS repository:
>>> sickle = Sickle('http://elis.da.ulcc.ac.uk/cgi/oai2')
Sickle provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers). Start with a ListRecords request:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc')
Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc
. We can add additional parameters, like, for example, an OAI set
:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', set='driver')
Since most OAI verbs yield more than one element, their respective Sickle methods return iterator objects which can be used to iterate over the records of a repository:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc')
>>> records.next()
<Record oai:eprints.rclis.org:4088>
Note that this works with all verbs that return more than one element. These are: ~sickle.app.Sickle.ListRecords
, ~sickle.app.Sickle.ListIdentifiers
, ~sickle.app.Sickle.ListSets
, and ~sickle.app.Sickle.ListMetadataFormats
.
The following example shows how to iterate over the headers returned by ListIdentifiers
:
>>> headers = sickle.ListIdentifiers(metadataPrefix='oai_dc')
>>> headers.next()
<Header oai:eprints.rclis.org:4088>
Iterating over the the sets returned by ListSets
works similarly:
>>> sets = sickle.ListSets()
>>> sets.next()
<Set Status = In Press>
If you need to perform selective harvesting by date using the from
parameter, you may face the problem that from
is a reserved word in Python:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', from="2012-12-12")
File "<stdin>", line 1
records = sickle.ListRecords(metadataPrefix='oai_dc', from="2012-12-12")
^
SyntaxError: invalid syntax
Fortunately, you can circumvent this problem by using a dictionary together with the **
operator:
>>> records = sickle.ListRecords(
... **{'metadataPrefix': 'oai_dc',
... 'from': '2012-12-12'
... })
OAI-PMH allows you to get a single record by using the GetRecord
verb:
>>> sickle.GetRecord(identifier='oai:eprints.rclis.org:4088',
... metadataPrefix='oai_dc')
<Record oai:eprints.rclis.org:4088>
Sickle supports two harvesting modes that differ in the type of the returned objects. The default mode returns OAI-specific items (records, headers etc.) encoded as Python objects as seen earlier. If you want to save the whole XML response returned by the server, you have to pass the sickle.iterator.OAIResponseIterator
during the instantiation of the ~sickle.app.Sickle
object:
>>> sickle = Sickle('http://elis.da.ulcc.ac.uk/cgi/oai2', iterator=OAIResponseIterator)
>>> responses = Sickle.ListRecords(metadataPrefix='oai_dc')
>>> responses.next()
<OAIResponse ListRecords>
You could then save the returned responses to disk:
>>> with open('response.xml', 'w') as fp:
... fp.write(responses.next().raw.encode('utf8'))
The ~sickle.app.Sickle.ListRecords
and ~sickle.app.Sickle.ListIdentifiers
methods accept an optional parameter ignore_deleted
. If set to True
, the returned ~sickle.iterator.OAIItemIterator
will skip deleted records/headers:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', ignore_deleted=True)
Note
This works only using the sickle.iterator.OAIItemIterator
. If you use the sickle.iterator.OAIResponseIterator
, the resulting OAI responses will still contain the deleted records.