# Accessing e-rara metadata and fulltexts

## 1 Metadata access via OAI-PMH interface with Polymatheia

The Open Archives Initiative Protocol for Metadata Harvesting (**OAI-PMH** ) is a well-known interface for libraries, \
archives etc. for delivering their metadata in various formats - librarian's specific like *MODS* and common ones like *Dublin Core* alike. Further information on OAI-PMH is available here: https://www.openarchives.org/pmh/ and [here](https://sickle.readthedocs.io/en/latest/oaipmh.html) you'll find a brief glossary to OAI concepts, too.

OAI example requests of the e-rara OAI interface look like:

- Identify: https://www.e-rara.ch/oai?verb=Identify
- ListSets: https://www.e-rara.ch/oai?verb=ListSets
- ListMetadataFormats: https://www.e-rara.ch/oai?verb=ListMetadataFormats
- ListIdentifiers:
https://www.e-rara.ch/oai?verb=ListIdentifiers&metadataPrefix=mods&set=bernensia
- GetRecords (certain record):
https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=mods&identifier=23216296 
- ListRecords (two and more records):
https://www.e-rara.ch/oai?verb=ListRecords&from=1900-01-01&set=bernensia&metadataPrefix=oai_dc

These examples with the given parameters and standard OAI request methods aka *verbs*  are easy to encode.
But to access single certain metadata fields a browser-based retrieval won't be sufficient. So, here we go!

### 1.0 Prerequisites

In [1]:
import os                              # navigate and manipulate file directories
import pandas as pd     # pandas is the Python standard library to work with dataframes, i.e. a tab-like data format
from IPython.display import IFrame     # embed website views in Jupyter Notebook
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


**Polymatheia** is a Python library to support working with digital library/archive metadata. It supports accessing metadata of different formats from OAI-PMH and also offers methods to handle the retrieved data. The metadata will be turned into a Python-style ['navigable dictionary'](https://polymatheia.readthedocs.io/en/latest/concepts.html), which allows convenient access to certain metadata fields.
Its aim is not necessarily to cover all ways of working with metadata, but to make it easy to undertake most types of tasks and analysis.  
See the [documentation](https://polymatheia.readthedocs.io/en/latest/) of the Polymatheia library.

In [2]:
# de-comment !pip command for installing polymatheia
#!pip install polymatheia          
from polymatheia.data.reader import OAISetReader               # list OAI sets
from polymatheia.data.reader import OAIMetadataFormatReader    # list available metadata formats
from polymatheia.data.reader import OAIRecordReader            # read one metadata record from OAI
from polymatheia.data.writer import PandasDFWriter             # easy transformation of flat data into a dataframe
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


### 1.1 Inspect the OAI interface (oai: 'ListSets', oai: 'ListMetadataFormats')

https://www.e-rara.ch/oai/ will be the **base URL** for all OAI requests. To make live easier we put it into the variable `oai`.

In [3]:
oai = 'https://www.e-rara.ch/oai/'

First, it's good to know **which collections or *sets* are available**. To take a look at the sets from the native OAI interface let's take a look of https://www.e-rara.ch/bes_1/oai?verb=ListSets with the `IFrame` function. For every set, there is the `setName` and a `setSpec`, which is a somewhat shortcut for the set name used for OAI accesses.

In [4]:
IFrame('https://www.e-rara.ch/oai?verb=ListSets', width=990, height=300)

That's nice, but how to download these contents? 'OAISetReader' does this conveniently in Python. Here's how it works.

In [5]:
reader = OAISetReader(oai)             # instantiate ('make') a OAISetReader named reader
# instantiation is a standard procedure with Python , so it's a good idea to get familiar with it

print(type(reader))                    # print the object type of 'reader' for information

<class 'polymatheia.data.reader.OAISetReader'>


In [6]:
for x in reader:                       # for-loop which iterates through the reader-content and prints each
    print(x)                           # note that 'x' is an arbitrary term resp. variable

{
  "setSpec": "frc_g",
  "setName": "BCU Fribourg (GLN)"
}
{
  "setSpec": "elibch",
  "setName": "Alle Bibliotheken"
}
{
  "setSpec": "zbs",
  "setName": "Zentralbibliothek Solothurn"
}
{
  "setSpec": "sbs",
  "setName": "Stadtbibliothek Schaffhausen"
}
{
  "setSpec": "astrorara",
  "setName": "Astronomie-rara"
}
{
  "setSpec": "bau_1",
  "setName": "UB Basel (DSV01)"
}
{
  "setSpec": "kbg",
  "setName": "Kantonsbibliothek Graub\u00fcnden"
}
{
  "setSpec": "nep_r",
  "setName": "Biblioth\u00e8que des Pasteurs, BPU Neuch\u00e2tel (RERO)"
}
{
  "setSpec": "astrozut",
  "setName": "ETH-Bibliothek Z\u00fcrich"
}
{
  "setSpec": "frc_r",
  "setName": "BCU Fribourg (RERO)"
}
{
  "setSpec": "ebs",
  "setName": "Eisenbibliothek Schlatt"
}
{
  "setSpec": "nep_g",
  "setName": "Biblioth\u00e8que des Pasteurs, BPU Neuch\u00e2tel (GLN)"
}
{
  "setSpec": "lg1",
  "setName": "Biblioteca Salita dei Frati, Lugano"
}
{
  "setSpec": "demusmu",
  "setName": "Deutsches Museum, M\u00fcnchen"
}
{
  "setSpec

We might put this together and turn the retrieved data into a *Pandas dataframe* with 'PandasDFWriter'.

In [7]:
reader = OAISetReader(oai)
setspec = []                          # make an empty list named 'setspec'

for x in reader:                 
    setspec.append(x)                 # .append adds all the single reader-contents to the list 'setspec'

print(setspec[0:3])                   # print the first 3 items of the list (of key-value pairs) - just for visualizing

df = PandasDFWriter().write(setspec)  # write list 'setspec' into a Pandas dataframe named 'df'
df                                    # shows 'df' 

[{'setSpec': 'frc_g', 'setName': 'BCU Fribourg (GLN)'}, {'setSpec': 'elibch', 'setName': 'Alle Bibliotheken'}, {'setSpec': 'zbs', 'setName': 'Zentralbibliothek Solothurn'}]


Unnamed: 0,setSpec,setName
0,frc_g,BCU Fribourg (GLN)
1,elibch,Alle Bibliotheken
2,zbs,Zentralbibliothek Solothurn
3,sbs,Stadtbibliothek Schaffhausen
4,astrorara,Astronomie-rara
...,...,...
76,doi,unknown spec
77,notated_music,notated music
78,book,book
79,illustration_document,illustration document


If a great number of sets are given, you might **search for a certain collection by string**. This can be also helpful to **get to know the collection/set short cut** `setSpec` used by the OAI interface for further investigation of a certain set.

In [8]:
# Example: Searching for strings 'bern' or 'Bern' in the 'setName' column
for i in df.index:                                             # for-loop which iterates through 'df' contents
    if 'bern' in df.setName[i] or 'Bern' in df.setName[i]:     # if-condition which looks for 'bern'- or 'Bern'-strings
                                                               # in the 'setName' column
        print(df.loc[i])                                       # print 'df' row, if if-condition is True

setSpec              bes_5
setName    UB Bern (NEBIS)
Name: 26, dtype: object
setSpec              bes_1
setName    UB Bern (DSV01)
Name: 28, dtype: object
setSpec                                            bernensia
setName    Bernensia des 18. bis frühen 20. Jahrhunderts ...
Name: 47, dtype: object
setSpec                        rossica
setName    Rossica Europeana (UB Bern)
Name: 56, dtype: object
setSpec                                   russexil
setName    Russisches Schrifttum im Exil (UB Bern)
Name: 57, dtype: object


In [9]:
# A nicer view to explore all given sets
df.style

Unnamed: 0,setSpec,setName
0,frc_g,BCU Fribourg (GLN)
1,elibch,Alle Bibliotheken
2,zbs,Zentralbibliothek Solothurn
3,sbs,Stadtbibliothek Schaffhausen
4,astrorara,Astronomie-rara
5,bau_1,UB Basel (DSV01)
6,kbg,Kantonsbibliothek Graubünden
7,nep_r,"Bibliothèque des Pasteurs, BPU Neuchâtel (RERO)"
8,astrozut,ETH-Bibliothek Zürich
9,frc_r,BCU Fribourg (RERO)


It's also very useful to know in which **formats the metadata records** are available. The genuine interface does this by requesting the URL https://www.e-rara.ch/oai?verb=ListMetadataFormats. Here, we use the 'OAIMetadataFormatReader'.

As you might see, you can directly select some information like `metadataPrefix` and `metadataNamespace` from the retrieved data by **using the dot-notation**. Dot-notation just adds the wanted subordinated element/field after a dot.

In [10]:
reader = OAIMetadataFormatReader(oai)
for formats in reader:                   
    print(formats.metadataPrefix)            # dot-notation: chooses sub-element 'metadataPrefix'

oai_dc
mets
mods
rawmods
epicur


In [11]:
reader = OAIMetadataFormatReader(oai)
[formats.metadataNamespace for formats in reader]   # shorter notation for the for-loops above, which outputs a list

['http://www.openarchives.org/OAI/2.0/oai_dc/',
 'http://www.loc.gov/METS/',
 'http://www.loc.gov/mods/v3',
 'http://www.loc.gov/mods/v3',
 'urn:nbn:de:1111-2004033116']

### 1.2 Inspect metadata records (oai: 'ListRecords')

 Retrieving available **metadata as a bunch** is simple with the 'OAIRecordReader' command. Just specify the following parameters in the 'OAIRecordReader' function.
 
- `metadata_prefix` (mandatory)
- `set_spec` (the shortcut for the set you want to retrieve): not mandatory, but default will be *all* available records
- `max_records` (the number of records): not mandatory, but default will be *all* available records

To compare this result with the native OAI interface you might check the top item of 
https://www.e-rara.ch/oai?verb=ListRecords&metadataPrefix=oai_dc&set=bernensia.


In [12]:
reader = OAIRecordReader(oai, metadata_prefix='oai_dc', set_spec='bernensia', max_records=1)
[record for record in reader]      

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1395833'},
   'datestamp': {'_text': '2012-09-26T14:23:16Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc': {'_attrib': {'xsi_schemaLocation': 'http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd'},
    'dc_title': {'_text': 'Adressbuch der Stadt Bern'},
    'dc_creator': {'_text': '[s.n.]'},
    'dc_description': [{'_text': '1860 - Jg. 75(1957)'},
     {'_text': 'Mit Stadtplan (zuerst eingeklebt, später als lose Beilage)'}],
    'dc_publisher': {'_text': 'Hallwag'},
    'dc_date': [{'_text': '1860'}, {'_text': '1957'}],
    'dc_type': [{'_text': 'Text'},
     {'_text': 'Periodical'},
     {'_text': 'Zeitschrift'}],
    'dc_format': {'_text': '35 cm'},
    'dc_identifier': [{'_text': 'doi:10.3931/e-rara-4614'},

To access a certain metadata content, you can **follow down the *navigable dictionary* path** with dot-notation, like the following example.

In [13]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.identifier._text)        # compare to the first line of the output above

oai:www.e-rara.ch:1395833


Not always metadata content is a simple flat value like the identifier above. **Many fields in structured metadata formats are lists** as they hold multiple values. The `header` field `setSpec` which holds the information about the different OAI set memberships of the item is a good example.

In [14]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.setSpec)

[{'_text': 'bes_1'}, {'_text': 'journal'}, {'_text': 'collections'}, {'_text': 'bernensia'}, {'_text': 'ch'}, {'_text': 'ch19'}]


The surrounding square brackets `[ ]` indicate a list (of key-value pairs). To access each content of the list items of its own you might use *subsetting*, which calls the relevant item by its number in the list. 

In [15]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.setSpec[0]._text)
    print(record.header.setSpec[1]._text)
    print(record.header.setSpec[2]._text)
    print(record.header.setSpec[3]._text)
    print(record.header.setSpec[4]._text)
    print(record.header.setSpec[5]._text)

bes_1
journal
collections
bernensia
ch
ch19


For retrieving contents from the `metadata` section a similar subsetting insertion has to be done according to its qualifying string `'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'`.

In [16]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_title._text)
    print('---')
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[0]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[1]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[2]._text)

Adressbuch der Stadt Bern
---
doi:10.3931/e-rara-4614
https://www.e-rara.ch/bes_1/doi/10.3931/e-rara-4614
system:99116914771105511


Of course, MODS metadata is by far more rich in content. The request can easily be adjusted by the `metadata_prefix` parameter. But, as jou might see in the title fields, it also bears more complexity.

In [17]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1395833'},
   'datestamp': {'_text': '2012-09-26T14:23:16Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.loc.gov/mods/v3}mods': {'_attrib': {'version': '3.6',
     'xsi_schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-6.xsd'},
    'mods_titleInfo': [{'mods_title': {'_text': 'Adressbuch der Stadt Bern'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adressbuch der Stadt Bern und Umgebung'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adress-Kalender der Stadt Bern'}}],
    'mods_typeOfResource': {'_text': 'text'},
    'mods_genre': [{'_text': 

Also, the qualifying string of the `metadata` section has to be adapted to `'{http://www.loc.gov/mods/v3}mods'`.

In [18]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
for record in reader:
    print(record.header.identifier._text)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[0].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[1].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[2].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[3].mods_title._text)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[3]._attrib.type)

oai:www.e-rara.ch:1395833
---
[{'mods_title': {'_text': 'Adressbuch der Stadt Bern'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adressbuch der Stadt Bern und Umgebung'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adress-Kalender der Stadt Bern'}}]
---
Adressbuch der Stadt Bern
Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...
Adressbuch der Stadt Bern und Umgebung
Adress-Kalender der Stadt Bern
---
alternative


Because drilling down the *navigable dictionary* path can lead to long and complicated commands - which might not be very clear, either - there is a catchier way to do so with the `get` command applied on the records.  
And: There is **no issue anymore with single values versus lists and qualifying strings**. Just putting the terms together as a list of `get` parameters!

Note that in the case of more than one retrieved element a result list (in squared brackets) will be created.

In [19]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
for record in reader:
    print(record.get(['header', 'identifier', '_text']))
    print('---')
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_titleInfo', 'mods_title', '_text']))
    print('---')
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_titleInfo', '_attrib', 'type']))

oai:www.e-rara.ch:1395833
---
['Adressbuch der Stadt Bern', 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...', 'Adressbuch der Stadt Bern und Umgebung', 'Adress-Kalender der Stadt Bern']
---
[None, 'alternative', 'alternative', 'alternative']


In [20]:
# Also works with the shorter form of for-loops - but mind that it delivers a nested - or 'doubled' - list
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)

[record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_identifier', '_text']) \
            for record in reader]       # '\' indicates that command proceeds on the next line

[['doi:10.3931/e-rara-4614',
  'https://www.e-rara.ch/bes_1/doi/10.3931/e-rara-4614',
  'system:99116914771105511']]

Now, it's really easy to access whatever metadata content you like.

For instance, you might be interested in **all responsible persons and bodies**, their role, and GND identifiers...

In [21]:
# First looking at the 'mods_name' section to get an overview of its structure
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=2)

[record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name']) for record in reader]

[None,
 {'_attrib': {'type': 'personal',
   'usage': 'primary',
   'authority': 'gnd',
   'authorityURI': 'http://d-nb.info/gnd/',
   'valueURI': 'http://d-nb.info/gnd/127809651,'},
  'mods_nameIdentifier': {'_text': '(DE-588)127809651,'},
  'mods_namePart': [{'_text': 'Raemy, Alfred de'},
   {'_text': '1825-1909', '_attrib': {'type': 'date'}}],
  'mods_role': [{'mods_roleTerm': {'_text': 'Verfasser',
     '_attrib': {'type': 'text'}}},
   {'mods_roleTerm': {'_text': 'aut',
     '_attrib': {'authority': 'marcrelator', 'type': 'code'}}}]}]

In [22]:
# Selecting the 'mods_name' sub-elements of interest
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
for record in reader:
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', 'mods_namePart', '_text']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', '_attrib', 'valueURI']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', 'mods_role', 'mods_roleTerm', \
                      '_text']))
    print('---')

None
None
None
---
['Raemy, Alfred de', '1825-1909']
http://d-nb.info/gnd/127809651,
['Verfasser', 'aut']
---
Sommerlatt, Christian Vollrath von
http://d-nb.info/gnd/100559921
None
---
Typographische Societäts-Buchhandlung (Bern)
http://d-nb.info/gnd/1086438582,
['Drucker', 'prt']
---
Messerli, Johann Ch.
None
None
---
None
None
None
---
Sterchi, Jakob
None
None
---
['Jenni, Christian Albrecht', '1786-1861']
http://d-nb.info/gnd/1037555503
['Herausgeber', 'edt']
---
None
None
None
---
['Tscharner, Friedrich', ['Haller, Ludwig Albrecht', '1773-1837'], 'Stadtbibliothek Bern']
[None, 'http://d-nb.info/gnd/1037511646', 'http://d-nb.info/gnd/508313-8']
[None, ['Drucker', 'prt'], None]
---


### 1.3 Save  and recover complex metadata structures

Before any data will be downloaded, let's build a common folder `data` aside our working directory to store any data into.

In [23]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\code


In [24]:
os.chdir(os.pardir)                               # change to parent directory
os.makedirs('data', exist_ok=True)                # make new folder 'data'
os.chdir('data')                                  # change to 'data' folder

To **download a whole bunch of metadata items** in nested formats like MODS, the *JSONWriter* module is very helpful.
It creates a complex folder structure and JSON files to reproduce the structured metadata. And with *JSONReader* one can easily recover the metadata set.

In [25]:
from polymatheia.data.writer import JSONWriter     # also available: CSVReader (for flat data), XMLReader resp. Writer
from polymatheia.data.reader import JSONReader

JSONWriter takes two parameters:
- The first is the name of the directory into which the data should be stored.
- The second is the dot-notated path (via its `header.identifier`) used to access the item's metadata.

For more clarity, these are the contents of `header.identifier` for the first ten Bernensia records we will refer to:

In [26]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
for record in reader:
    print(record.header.identifier._text)

oai:www.e-rara.ch:1395833
oai:www.e-rara.ch:1396731
oai:www.e-rara.ch:1397203
oai:www.e-rara.ch:1757425
oai:www.e-rara.ch:1757509
oai:www.e-rara.ch:1757592
oai:www.e-rara.ch:1757931
oai:www.e-rara.ch:1758267
oai:www.e-rara.ch:2069554
oai:www.e-rara.ch:4709578


In [27]:
# Download and save the first ten Bernensia records from MODS format
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
writer = JSONWriter('polymatheia', 'header.identifier._text')    # 'oai_data' = directory to stored into
writer.write(reader)

In [28]:
# Recover the first ten Bernensia records from local disk
reader = JSONReader('polymatheia')
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1757592'},
   'datestamp': {'_text': '2012-05-08T13:37:52Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.loc.gov/mods/v3}mods': {'_attrib': {'version': '3.6',
     'xsi_schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-6.xsd'},
    'mods_titleInfo': {'mods_title': {'_text': 'Hand- und Adressbuch der Bundesstadt Bern'},
     'mods_subTitle': {'_text': 'Verzeichniss der Behörden, Angabe der Häuserbesitzer und Wohnungen, der Handels- und Gewerbstreibenden, der Gesellschaften und Vereine, Tarife und Verordnungen der Verkehrsanstalten u. dgl. m'}},
    'mods_typeOfResource': {'_text': 'text'},
    'mods_genre': [{'_text': 'periodical', '_attrib': {'authority': 'marcgt'}},
     {'_text': 'Text', '_attrib': {'authority': 'rdacontent'}},
     {'_text': 

The stored data **can be used just the same way** as the direct accessed one, like for instance for the `mods_genre` element.

In [29]:
reader = JSONReader('polymatheia')
for record in reader:
    print(record.header.identifier._text)
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_genre', '_text']))
    print('---')

oai:www.e-rara.ch:1757592
['periodical', 'Text', 'Zeitschrift']
---
oai:www.e-rara.ch:1397203
Text
---
oai:www.e-rara.ch:4709578
None
---
oai:www.e-rara.ch:1758267
None
---
oai:www.e-rara.ch:1757931
None
---
oai:www.e-rara.ch:1395833
['periodical', 'Text', 'Zeitschrift']
---
oai:www.e-rara.ch:2069554
Text
---
oai:www.e-rara.ch:1757425
Text
---
oai:www.e-rara.ch:1396731
Text
---
oai:www.e-rara.ch:1757509
Text
---


In [30]:
# One more example: The 'mods_note' element and its '_attrib' sub-element
reader = JSONReader('polymatheia')
for record in reader:
    print(record.header.identifier._text)
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_note', '_text']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_note', '_attrib', 'type']))
    print('---')

oai:www.e-rara.ch:1757592
1859
date/sequential designation
---
oai:www.e-rara.ch:1397203
['Bearb. und hrsg. von C. v. Sommerlatt', 'Beigebunden: Ergänzungsheft zu dem Adressbuch der Republik Bern von 1836. Erschienen im April 1839, Thun.']
['statement of responsibility', None]
---
oai:www.e-rara.ch:4709578
['Erster alphabetischer Katalog der damaligen Stadtbibliothek, verfasst vom Oberbibliothekar F. Tscharner', 'Theil 1 (1811): XLVIII, 470 S.; Theil 2 (1811): 508 S.; Theil 3 (1811): 462 S.; Suppl. (1839): XXII, 287 S.; Suppl. 2 (1847): 224 S.; Suppl. 3 (1856): 390 S.', 'Betrifft die Handschrift Cod. 757.I der Burgerbibliothek Bern (S. 212).']
[None, None, None]
---
oai:www.e-rara.ch:1758267
['Hrsg. von Christian Albrecht Jenni ; Fortsetzung der Sammlung der Grabschriften der ... Gottesäcker Monbijou und Rosengarten', 'Herausgeber am Ende des Vorworts von Band 2: "C. A. Jenni"']
['statement of responsibility', None]
---
oai:www.e-rara.ch:1757931
von J. Sterchi
statement of responsibili

Of course, there is also the way to read out certain metadata fields **via basic dot-notation**. But this will take a bit more of code to cope with the list vs. single value issue.

In [31]:
reader = JSONReader('polymatheia')
for record in reader:
    print(record.header.identifier._text)
    if 'mods_genre' in record.metadata['{http://www.loc.gov/mods/v3}mods']:
        if isinstance(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre, list):
            le = len(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre)
            for i in range(le):
                print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre[i]._text)
        else:
            print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre._text)
    else: 
        print(None)
    print('---')

oai:www.e-rara.ch:1757592
periodical
Text
Zeitschrift
---
oai:www.e-rara.ch:1397203
Text
---
oai:www.e-rara.ch:4709578
None
---
oai:www.e-rara.ch:1758267
None
---
oai:www.e-rara.ch:1757931
None
---
oai:www.e-rara.ch:1395833
periodical
Text
Zeitschrift
---
oai:www.e-rara.ch:2069554
Text
---
oai:www.e-rara.ch:1757425
Text
---
oai:www.e-rara.ch:1396731
Text
---
oai:www.e-rara.ch:1757509
Text
---


## 2 Direct metadata access via OAI-PMH 

Unfortunately, the Polymatheia library doesn't offer methods for *all* OAI verbs. For instance, there is no `ListIdentifiers` method (which delivers only the identifiers of a given set) and no `GetRecord` for retrieving the metadata of a certain item using its e-rara ID.

That's where especially the common libraries **requests** and  **BeautifulSoup** come into play, and more manually coding is needed.


### 2.0 Prerequisites

In [32]:
# Load the necessary libraries
import requests                                 # request URLs
from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # lxml’s XML parser is the only currently  by bs4 supported XML parser;
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import json                                     # work with JSON data structures
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Succesfully imported necessary libraries")

Succesfully imported necessary libraries


### 2.1 Inspect the OAI interface (oai: 'Identify')

https://www.e-rara.ch/oai/ will be the **base URL** for all OAI requests. To make live easier we put it into the variable `oai`.

Furthermore, the very **core of all operations on the OAI interface** will be a small function called `load_xml()`. It simply requests the base URL with the various parameters and decodes the answer to plain XML. Therefore, it can be used with all OAI verbs and their respective parameters.

In [33]:
oai = 'https://www.e-rara.ch/oai/'

In [34]:
def load_xml(params):
    '''
    Accesses the OAI interface according to given parameters and scrapes its HTML/XML content.
    '''
    base_url = oai
    response = requests.get(base_url, params=params)
    output_soup = soup(response.content, "lxml")
    return output_soup

You may use it to read out the basic `Identify` response of the OAI interface.

Note, that the parameters to be used by the `load_xml` function are the same as in the respective URL `https://www.e-rara.ch/oai?verb=Identify`. That is, `verb` as the parameter key, and `Identify` as the parameter value. Therefore, we need a **parameter key-value pair**, which will be indicated by enclosing them in curly braces.

In [35]:
xml_soup = load_xml({'verb': 'Identify'})
xml_soup

<html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-05-11T02:35:27Z</responsedate><request verb="Identify">https://www.e-rara.ch/oai/</request><identify><repositoryname>Visual Library Server</repositoryname><baseurl>https://www.e-rara.ch/oai/</baseurl><protocolversion>2.0</protocolversion><adminemail>issue-erara@library.ethz.ch</adminemail><earliestdatestamp>2009-11-10T09:38:31Z</earliestdatestamp><deletedrecord>no</deletedrecord><granularity>YYYY-MM-DDThh:mm:ssZ</granularity></identify></oai-pmh></body></html>

You can easily check with the `IFrame` method underneath.

In [36]:
IFrame('https://www.e-rara.ch/oai?verb=Identify', width=990, height=330)

### 2.2 Access and download certain records (oai: 'GetRecord')

The same can be done with the `GetRecord` verb, here `metadataPrefix`and `identifier` are mandatory parameters, naturally. Note, that, as the identifier is an integer, you can discard the quotation marks used with the other parameter key-value pairs - anyway, often it's a good idea to keep this quotation marks.

In [37]:
# Example for accessing a single metadata record
# https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=15916138

xml_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': 15916138})
xml_soup

<html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-05-11T02:35:27Z</responsedate><request identifier="15916138" metadataprefix="oai_dc" verb="GetRecord">https://www.e-rara.ch/oai/</request><getrecord><record><header><identifier>oai:www.e-rara.ch:15916138</identifier><datestamp>2017-06-15T06:45:13Z</datestamp><setspec>bes_1</setspec><setspec>book</setspec></header><metadata><oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:title>Freiburg und Bern und die Genfer Messen</dc:title><dc:creator>Ammann, Hektor</dc:creator><dc:subject>Handel</dc:subject><dc:subject>Messe</d

Again before downloading, first make a respective folder for the retrieved metadata.

In [38]:
print(os.getcwd())                       # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data


In [39]:
os.makedirs('metadata', exist_ok=True)    # make folder 'metadata'
os.chdir('metadata')                     # change to folder 'metadata'

You might want to **download the metadata record directly** by its e-rara ID and in a specified metadata format. The `download_record()` function does this for you easily. If you choose no format, MODS will be delivered.

In [40]:
def download_record(ID, metadataPrefix = 'mods'):
    '''
    Downloads a certain metadata record from OAI to a single XML file.
    Throws a notice if metadata file already exists and leaves the existing one.
    Parameters:
    ID = E-rara ID of the wanted record.
    metadataPrefix = Metadata format to be delivered. Default value is MODS.
    '''
    path = os.getcwd()
    output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': metadataPrefix, 'identifier': ID})
    outfile = path + '/{}.xml'.format(ID) 
    try:
        with open(outfile, mode='x', encoding='utf-8') as f:
            f.write(output_soup.decode())
            print("Metadata file {}.xml saved".format(ID))
    except FileExistsError:
            print("Metadata file {}.xml exists already".format(ID))
    finally:
            pass

In [41]:
# Example for downloading a single metadata record
# https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=15916138

download_record(15916138, 'oai_dc')

Metadata file 15916138.xml saved


### 2.3 Handle bigger data sets with a resumption token (oai: 'ListIdentifiers')

But: Scraping the OAI interface output directly yields a problem with bigger data. The output is **split into segments of ten records, which are presented on single webpages**. Looking at a sample request with `ListIdentifier` method, you will find the `resumptionToken` element, which on the one hand delivers the `completeListSize`, and on the other hand holds the resumption token. The [resumption token](http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl) is required to access the next segment webpage, which again includes a resumtpiotn token to the next page and so on.

In [42]:
# Scroll to the end of the page for the resumption token
IFrame('https://www.e-rara.ch/oai?verb=ListIdentifiers&set=bernensia&metadataPrefix=oai_dc', width=990, height=300)

Because of this, access metadata directly from the OAI interface is a bit more complex. With `retrieve_set_metadata()` we create a function to retrieve  all metadata records of a set in a certain format, and save the XML files into the following (automatically build) folder structure: metadata > format > set > files.

In [43]:
def retrieve_set_metadata(Set, metadataPrefix = 'mods'):
    '''
    Downloads metadata records of a given set and in a given format from OAI to XML files in a certain folder structure.
    Therefore it
    * requests e-rara OAI-PMH interface according to a set 
    * creates the following folder path: set > format
    * retrieves the set's e-rara IDs
    * retrieves metadata according to set e-rara IDs and given metadata format (default: MODS)
    * writes metadata into single <e_rara_id>.xml files in the metadata > set > format folder.
    Parameters:
    Set = The wanted collection/set.
    metadataPrefix = Metadata format to be delivered. Default value is MODS.
    '''
    start = time.perf_counter()

    # Set parameters to the interface
    base_url = oai
    recordsearch_term = {'verb': 'GetRecord', 'metadataPrefix': metadataPrefix}
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': metadataPrefix, 'set': Set}
    
    # Make a folder <metadata> with subfolder named like the set to store files in it
    directory = metadataPrefix
    parent_dir = os.getcwd() + '/' + Set
    path = os.path.join(parent_dir, directory)
    try:
        os.makedirs(path, exist_ok = True)
        print("Directory '%s' is already available or created successfully" %directory)
    except OSError as error:
        print("Directory '%s' can not be created")
    
        
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_record(ID):
        '''
        Downloads a certain metadata record from OAI to a single XML file.
        Throws a notice if metadata file already exists and leaves the existing one.
        Parameter:
        ID = E-rara ID of the wanted record.
        '''
        output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': metadataPrefix, 'identifier': ID})
        outfile = path + '/{}.xml'.format(ID) 
        try:
            with open(outfile, mode='x', encoding='utf-8') as f:
                    f.write(output_soup.decode())
        except FileExistsError:
                print("Metadata file {}.xml exists already".format(ID))
        finally:
                pass

    # Start with the first access to OAI interface - get the item IDs of a set
    xml_soup = load_xml(listsearch_term)

    # Calculate how many accesses it takes to go through all the pages of the results list, print notice
    splits = math.ceil(int(xml_soup.resumptiontoken['completelistsize']) // 10) + 1
    print(xml_soup.resumptiontoken['completelistsize'], 'identifiers to request in ', splits, 'data splits')
    

    for i in range(splits):
        if i == 0:
            # First access for item IDs - first page + information about whole length of results list
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses for item IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-rara IDs
        ids = [] 
        for ID in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            match = re.search('oai:www.e-rara.ch:(\d+)', ID)      # extract the number following 'oai:www.e-rara.ch:'
            if match:
                ids.append(match.group(1))     # first parenthesized subgroup of group() = number

        # Download the MODS metadata records according to retrieved e-rara IDs
        print('Start retrieving metadata for e-rara IDs ', ids)  
        for ID in ids:
            download_record(ID)
        ids = []

        # Actualize the resumtpion token to retrieve the the next page
        try:
            new_token = xml_soup_new.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last page is done
        finally:
            pass

    with os.scandir(path) as entries:
        count = 0
        for entry in entries:
            count += 1       
    print("{} metadata files in {}".format(count, path))
    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))

In [44]:
# Just choose the appropriate set short cut and the metadata format
retrieve_set_metadata('vitruviana', 'oai_dc')

Directory 'oai_dc' is already available or created successfully
47 identifiers to request in  5 data splits
Start retrieving metadata for e-rara IDs  ['6090286', '6090520', '6091472', '6091777', '6092090', '6092470', '6092752', '6092953', '6093667', '6094045']
New resumption token: 0x801963ea7925db6c2d0297bbb4961a1e-cursor_p_3D10_p_26set_p_3Dvitruviana_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_3D11
Start retrieving metadata for e-rara IDs  ['6094442', '6102031', '6102445', '6104752', '6105281', '6105669', '6105930', '6106048', '6106428', '6106922']
New resumption token: 0x9b5859275dc1fc81cc0983e74ddf7d0c-cursor_p_3D20_p_26set_p_3Dvitruviana_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_3D11
Start retrieving metadata for e-rara IDs  ['6116002', '6117191', '6125369', '6125674', '6126206', '6126535', '6133272', '6150911', '6151451', '6151889']
New resumption token: 0x9b5859275dc1fc81cc0983e74ddf7d0c-cursor_p_3D30_p_26set_p_3Dvitruviana_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_

## 3 Download fulltext files from e-rara website

### 3.0 Prerequisites

In [45]:
# Load the necessary libraries
import requests                                 # request URLs
from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # lxml’s XML parser is the only currently  by bs4 supported XML parser;
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import json                                     # work with JSON data structures
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


Downloading e-rara fulltetxts can be done from the e-rara website. For e-rara items which have a fulltext file available, a link is provided in the *Links* > *Download* section of the item page.

In [46]:
IFrame('https://www.e-rara.ch/content/titleinfo/19457887', width=990, height=300)

### 3.1 Download certain fulltext files by e-rara ID

At first, next to the `metadata` folder a new directory `fulltexts`will be created.

In [47]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata


In [48]:
os.chdir(os.pardir)                               # change to parent directory
os.makedirs('fulltexts', exist_ok=True)           # make new folder 'fulltexts'
os.chdir('fulltexts')                             # change to 'fulltexts' folder

A single fulltext file can be retrieved by a given e-rara ID with the following function `download_fulltext()`.

Note that **for fulltexts a different base URL** has to be used - in combination with the given e-rara ID: `https://www.e-rara.ch/download/fulltext/plain/`.

In [49]:
def download_fulltext(ID):
    '''
    Downloads a certain metadata record from OAI to TXT file.
    Builds with e-rara ID the fulltext URL, reads the TXT file and writes them in a TXT file on local disk.
    Parameter:
    ID = E-rara ID of the wanted record.
    '''
    baseurl_fulltext = "https://www.e-rara.ch/download/fulltext/plain/"
    webadd = baseurl_fulltext + str(ID)
    response = requests.get(webadd) 
    soup_out = soup(response.text, 'html.parser')
    outfile = '{}.txt'.format(ID)
    
    try:
        with open(outfile, 'x', encoding='utf-8') as f:
            f.write(soup_out.get_text())
            print("Fulltext file {}.txt saved".format(ID))
    except FileExistsError:
        print("Fulltext file {}.txt exists already".format(ID))
    except:
        print("Saving fulltext file {}.txt failed".format(ID))
    finally:
        pass

In [50]:
# Retrieving example fulltexts with e-rara IDs
e_rara_ids = [1396731, 6156847, 6094442, 6125674]

for e_rara_id in e_rara_ids: 
    download_fulltext(e_rara_id)

Fulltext file 1396731.txt saved
Fulltext file 6156847.txt saved
Fulltext file 6094442.txt saved
Fulltext file 6125674.txt saved


We might read the files then from local disk.

In [51]:
with open('6156847.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
print(fulltext)

m '0W fmm f Vi}' 'R-a- *-» ?;*£.-

4

> 'JP&i-fzi. P*“*^UtLvi*

M. VITRUVII POLLIONIS ARCHITECTURA LIBRI X.

*

M. VITRUVII POLLIONIS DE ARCHITECTURA LIBRI DECEM AD OPTIMAS EDITIONES COLLATI PRAEMITTITUR NOTITIA LITERARIA STUDIIS SOCIETATIS BIPONTlNAE ACCEDIT ANONYMI SCRIPTORIS VETERIS ARCHITECTURAE COMPENDIU ( M CUM INDICIBUS. ARGENTORATI EX XYPOGRAPHIA SOCIETATIS MDGCCVII. Atrt

nr >’ i- :? y a

M. VITRUVII POLLIONIS VITA A BERNARDINO BALDO URBINATE CONSCRIPTA. o mnes  fere artes natura ita conftitutas effe fcimus 9 ut non fatis commode tra&ari valeant, fi qui in illis verfantur, ingenio pariter & manu non fuerint exercitati. Etenim contemplatio, quam Sicopiav  Graeci appellant, oculus quidam eft; ‘srpdi|if vero, hoc eft, ipfa operatio, manus fibi locum atque officium vendiCat. Quam- obrem is, meo iudicio, qui fola meditatione fretus operationem negligit, non inepte cuipiam componi poffet, qui vifu pollens, mancus effiet & fcaevus: is vero qui ufu tantum praeftat, ei qui oculis captu

In [52]:
# Empty file! Resulting from empty fulltext page
with open('6125674.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
fulltext

''

In [53]:
IFrame('https://www.e-rara.ch/download/fulltext/plain/6125674', width=990, height=100)

So, if you **don't know which of the e-rara items have fulltext files available** and which not, there's the following way to get rid of the resulting empty files.

In [54]:
print(os.getcwd())                 # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts


In [55]:
# Delete empty files in the current working directory
path = os.getcwd()
count = 0
count_empty = 0
for entry in os.scandir(path):
    if os.path.getsize(entry) == 0:
        print("File {} is empty".format(entry.name))
        os.remove(entry.name)
        count_empty += 1
    else: 
        count += 1
        pass
print("{} empty fulltext files in {} deleted".format(count_empty, path))
print("{} fulltext files in {}".format(count, path))

File 6094442.txt is empty
File 6125674.txt is empty
2 empty fulltext files in C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts deleted
2 fulltext files in C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts


### 3.2 Download fulltext files by set/collection

Finally, let's build a function `retrieve_set_fulltexts` to retrieve **all fulltexts of a certain e-rara set**! Note that also empty fulltext files are retrieved and stored in between, but cleaned up in the end.

In [56]:
def retrieve_set_fulltexts(Set):
    '''
    Downloads fulltexts of a given set from e-rara website to TXT files in a certain folder structure.
    Therefore it
    * requests e-rara OAI-PMH interface according to a set 
    * creates the folder 'set'
    * retrieves the set's e-rara IDs from OAI interface
    * retrieves fulltexts according to set e-rara IDs from e-rara website
    * writes metadata into single <e_rara_id>.txt files in the fulltexts > set folder
    * finally checks all metadata files if they are empty, and deletes those empty files.
    Parameters:
    Set = The wanted collection/set.
    '''
    start = time.perf_counter()

    # Set parameters to the interface
    base_url = oai
    baseurl_fulltext = "https://www.e-rara.ch/download/fulltext/plain/"
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Make a folder <fulltexts> with subfolder named like the set to store files in it
    directory = Set
    parent_dir = os.getcwd()
    path = os.path.join(parent_dir, directory)
    try:
        os.makedirs(path, exist_ok = True)
        print("Directory '%s' is already available or created successfully" %directory)
    except OSError as error:
        print("Directory '%s' cannot be created")
           
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_fulltext(ID):
        '''
        Downloads fulltext from e-rara website to a single TXT file.
        Throws a notice if fulltext file already exists and leaves it.
        Parameter:
        ID = E-rara ID of the wanted record.
        '''
        webadd = baseurl_fulltext + str(ID)
        response = requests.get(webadd) 
        soup_out = soup(response.text, 'html.parser')
        outfile = path + '/{}.txt'.format(ID) 
        try:
            with open(outfile, 'w', encoding='utf-8') as f:
                f.write(soup_out.get_text())
                print("Fulltext file {}.txt saved".format(ID))
        except FileExistsError:
            print("Fulltext file {}.txt exists already".format(ID))
        except:
            print("Saving fulltext file {}.txt failed".format(ID))
        finally:
                pass
            
    # Start with the first access to OAI interface
    xml_soup = load_xml(listsearch_term)

    # Calculate how many accesses it takes to go through all the pages of the results list, print notice
    splits = math.ceil(int(xml_soup.resumptiontoken['completelistsize']) // 10) + 1
    print(xml_soup.resumptiontoken['completelistsize'], 'identifiers to request in ', splits, 'data splits')
            
            
    for i in range(splits):
        if i == 0:
            # First access to OAI for e-rara IDs - first page + information about whole length of results list
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses to OAI for e-rara IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-rara IDs
        e_rara_ids = [] 
        for e_rara_id in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            match = re.search('oai:www.e-rara.ch:(\d+)', e_rara_id) # extract the number following 'oai:www.e-rara.ch:'
            if match:
                e_rara_ids.append(match.group(1))       # first parenthesized subgroup of group() = number

        # Download the fulltexts according to retrieved e-rara IDs
        print('Start retrieving fulltetxts for e-rara IDs ', e_rara_ids) 
        for e_rara_id in e_rara_ids:
            download_fulltext(e_rara_id)
        e_rara_ids = []

        # Actualize the resumption token to retrieve the the next page
        try:
            new_token = xml_soup_new.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last page is done
        finally:
            pass
        
    # Clean up empty files
    count = 0
    count_empty = 0
    for entry in os.scandir(path):
        if os.path.getsize(entry) == 0:
            print("File {} is empty".format(entry.name))
            os.remove(path + '/' + entry.name)
            count_empty += 1
        else: 
            count += 1
    print("{} empty fulltext files in {} deleted".format(count_empty, path))
    print("{} fulltext files in {}".format(count, path))

    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))
    

In [57]:
retrieve_set_fulltexts('vitruviana')

Directory 'vitruviana' is already available or created successfully
47 identifiers to request in  5 data splits
Start retrieving fulltetxts for e-rara IDs  ['6090286', '6090520', '6091472', '6091777', '6092090', '6092470', '6092752', '6092953', '6093667', '6094045']
Fulltext file 6090286.txt saved
Fulltext file 6090520.txt saved
Fulltext file 6091472.txt saved
Fulltext file 6091777.txt saved
Fulltext file 6092090.txt saved
Fulltext file 6092470.txt saved
Fulltext file 6092752.txt saved
Fulltext file 6092953.txt saved
Fulltext file 6093667.txt saved
Fulltext file 6094045.txt saved
New resumption token: 0x801963ea7925db6c2d0297bbb4961a1e-cursor_p_3D10_p_26set_p_3Dvitruviana_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_3D11
Start retrieving fulltetxts for e-rara IDs  ['6094442', '6102031', '6102445', '6104752', '6105281', '6105669', '6105930', '6106048', '6106428', '6106922']
Fulltext file 6094442.txt saved
Fulltext file 6102031.txt saved
Fulltext file 6102445.txt saved
Fulltext file 61