# e-periodica: accessing metadata and fulltexts

<h1><span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#0-Introduction" data-toc-modified-id="0-Introduction-0">0 Introduction</a></span><ul class="toc-item"><li><span><a href="#0.0-Scope-and-content" data-toc-modified-id="0.0-Scope-and-content-0.1">0.0 Scope and content</a></span></li><li><span><a href="#0.1-E-periodica" data-toc-modified-id="0.1-E-periodica-0.2">0.1 E-periodica</a></span></li><li><span><a href="#0.2-OAI-PMH" data-toc-modified-id="0.2-OAI-PMH-0.3">0.2 OAI-PMH</a></span></li></ul></li><li><span><a href="#1-Metadata-access-with-Polymatheia" data-toc-modified-id="1-Metadata-access-with-Polymatheia-1">1 Metadata access with Polymatheia</a></span><ul class="toc-item"><li><span><a href="#1.0-Prerequisites" data-toc-modified-id="1.0-Prerequisites-1.1">1.0 Prerequisites</a></span></li><li><span><a href="#1.1-Start-with-the-OAI-interface-via-Polymatheia" data-toc-modified-id="1.1-Start-with-the-OAI-interface-via-Polymatheia-1.2">1.1 Start with the OAI interface via Polymatheia</a></span></li><li><span><a href="#1.2-Retrieve-metadata-records-via-Polymatheia" data-toc-modified-id="1.2-Retrieve-metadata-records-via-Polymatheia-1.3">1.2 Retrieve metadata records via Polymatheia</a></span></li><li><span><a href="#1.3-Save--and-recover-complex-metadata-structures" data-toc-modified-id="1.3-Save--and-recover-complex-metadata-structures-1.4">1.3 Save  and recover complex metadata structures</a></span></li></ul></li><li><span><a href="#2-Direct-metadata-access-via-OAI-PMH" data-toc-modified-id="2-Direct-metadata-access-via-OAI-PMH-2">2 Direct metadata access via OAI-PMH</a></span><ul class="toc-item"><li><span><a href="#2.0-Prerequisites" data-toc-modified-id="2.0-Prerequisites-2.1">2.0 Prerequisites</a></span></li><li><span><a href="#2.1-Start-with-the-native-OAI-interface" data-toc-modified-id="2.1-Start-with-the-native-OAI-interface-2.2">2.1 Start with the native OAI interface</a></span></li><li><span><a href="#2.2--Download-metadata-records" data-toc-modified-id="2.2--Download-metadata-records-2.3">2.2  Download metadata records</a></span></li><li><span><a href="#2.3-Download-metadata-by-set" data-toc-modified-id="2.3-Download-metadata-by-set-2.4">2.3 Download metadata by set</a></span></li></ul></li><li><span><a href="#3-Download-fulltext-files-from-e-periodica-website" data-toc-modified-id="3-Download-fulltext-files-from-e-periodica-website-3">3 Download fulltext files from e-periodica website</a></span><ul class="toc-item"><li><span><a href="#3.0-Prerequisites" data-toc-modified-id="3.0-Prerequisites-3.1">3.0 Prerequisites</a></span></li><li><span><a href="#3.1-Download-fulltext-files-by-e-periodica-ID" data-toc-modified-id="3.1-Download-fulltext-files-by-e-periodica-ID-3.2">3.1 Download fulltext files by e-periodica ID</a></span></li><li><span><a href="#3.2-Download-fulltext-files-by-set" data-toc-modified-id="3.2-Download-fulltext-files-by-set-3.3">3.2 Download fulltext files by set</a></span></li></ul></li></ul></div>

## 0 Introduction

### 0.0 Scope and content

This Python [Jupyter notebook](https://jupyter.org/) aims to help you with **accessing metadata and fulltexts of the [e-periodica platform](https://www.e-periodica.ch/)**. It uses the OAI-PMH interface of the e-periodica service for retrieving metadata in different formats, and the e-rara website in addition for downloading fulltexts.

The notebook consists of three parts:
1. Metadata access with Polymatheia
2. Direct metadata access via OAI-PMH
3. Download fulltext files from e-periodica website.

So, there are two ways to access e-periodica metadata. The **first chapter** introduces the Polymatheia library, which allows very convenient requests to the OAI interface by wrapping otherwise more elaborate functions. Working with Polymatheia is an **easy solution for quick access** without going deep into coding.

The **second (and the third) chapter** shows how to access the OAI interface natively. Hence, more code will be needed and **some functions will be defined**. You can use the functions without deeper programming skills - nevertheless these might be helpful if you want to adapt those functions.

You may start from the beginning and walk trough the whole notebook or jump to the section that suits you. Also, it's a good idea to play around with the code in the cells and see what happens. Have fun!

Have any comments, questions and the like? Try kathi.woitas[at]ub.unibe.ch.

### 0.1 E-periodica

[E-periodica](https://www.e-periodica.ch/?lang=en) is the online platform for journals from Switzerland. It holds more than 500 freely accessible journals from the 18th century through to the present, covering subjects from natural sciences through architecture, mathematics, history, geography, art and culture to the environment and social policies. You may consult e-periodica's [Terms of Use](https://www.e-periodica.ch/digbib/about3?lang=en) to check the licences of the e-periodica documents.

### 0.2 OAI-PMH

The **Open Archives Initiative Protocol for Metadata Harvesting** (**OAI-PMH**) is a well-known interface for libraries,
archives etc. for delivering their metadata in various formats - librarian's specific like *[MODS](http://www.loc.gov/standards/mods/index.html)* and common ones like *[Dublin Core](https://www.dublincore.org/specifications/dublin-core/dces/)* alike.
Further information on OAI-PMH is available [here](http://www.openarchives.org/OAI/openarchivesprotocol.html).

First of all, a few OAI-PMH related concepts should be introduced:

**repository**:
A repository is a server-side application that exposes metadata via OAI-PMH. It can process the *six OAI-PMH request types* aka *OAI verbs*. So, the e-periodica OAI-PMH facility is a repository in this sense. 

**harvester**: OAI-PMH client applications are called harvesters. When you are approaching the OAI-PMH interface and requesting records, you do *harvesting*.

**resource**: A resource is the object that the delivered metadata is "about". Of course in case of e-periodica OAI-PMH, the referred resources are the publications of the e-periodica platform. Note that resources themselves are always outside of the OAI-PMH.

**record**: A record is the XML-encoded container for the metadata of a single resource (i.e. publication) item. It consists of a header and a metadata section.

**header**:
The record header contains the unique identifier of the record, a datestamp and optionally the set specification.

**metadata**: The record metadata contains the resource (i.e. publication) metadata in a defined metadata format.

**set**: A structure for grouping records for selective harvesting. Sets often refer to collections of thematic scopes/subjects, to collections of different owners/institutions (in case of aggregated content) or to collections of certain publication types.

Now let's look at some example requests of the e-periodica OAI interface with the **six OAI verbs**:

- Identify ([specification](http://www.openarchives.org/OAI/openarchivesprotocol.html#Identify)):
https://www.e-periodica.ch/oai?verb=Identify

- ListSets ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListSets)):
https://www.e-periodica.ch/oai?verb=ListSets

- ListMetadataFormats ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListMetadataFormats)):
https://www.e-periodica.ch/oai?verb=ListMetadataFormats
- ListIdentifiers ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListIdentifiers)):
https://www.e-periodica.ch/oai?verb=ListIdentifiers&metadataPrefix=oai_dc&set=ddc:360

- GetRecord ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#GetRecord)):
https://www.e-periodica.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:agora.ch:acd-002:1897:5::227 

- ListRecords ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListRecords)):
https://www.e-periodica.ch/oai?verb=ListRecords&set=ddc:490&metadataPrefix=oai_dc

These examples with the given *parameters* are somewhat easy to encode - and so is building similar request URLs.
But how to download the delivered data and to interact with it? That's the aim of this Notebook. So, here we go!

## 1 Metadata access with Polymatheia

### 1.0 Prerequisites

First, some basic Python libraries have to be imported. Just **click on the arrow icon** on the left side of the code cell - or first click into the cell and then select 'Crtl' + 'Enter' or 'Shift' + 'Enter'. When the code runs, a star symbol next to the cell appears and when it's done a number turns up. And most important, the provoked output is given beneath the code cell.

In [1]:
import os                              # navigate and manipulate file directories
import pandas as pd                    # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame     # embed website views in Jupyter Notebook
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


**Polymatheia** is a Python library to support working with digital library/archive metadata. It supports accessing metadata of different formats from OAI-PMH and also offers methods to handle the retrieved data. The metadata will be turned into a Python-style ['navigable dictionary'](https://polymatheia.readthedocs.io/en/latest/concepts.html), which allows convenient access to certain metadata fields.
Its aim is not necessarily to cover all ways of working with metadata, but to make it easy to undertake most types of tasks and analysis. See the [documentation](https://polymatheia.readthedocs.io/en/latest/) of the Polymatheia library.

Using Polymatheia package **for the first time**, you will need to **install this code library**: Just remove the `#` from the second line of code, and then execute the cell like the one before.

In [2]:
# de-comment !pip command for installing polymatheia
#!pip install polymatheia                                      
from polymatheia.data.reader import OAISetReader               # list OAI sets
from polymatheia.data.reader import OAIMetadataFormatReader    # list available metadata formats
from polymatheia.data.reader import OAIRecordReader            # read one metadata record from OAI
from polymatheia.data.writer import PandasDFWriter             # easy transformation of flat data into a dataframe
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


https://www.e-periodica.ch/oai/ will be the **base URL** for all OAI requests. To make live easier we put it into the variable `oai`.

In [3]:
oai = 'https://www.e-periodica.ch/oai/'

### 1.1 Start with the OAI interface via Polymatheia

First, it's good to know **which collections or *sets* are available**. To take a look at the sets from the native OAI interface let's take a look of https://www.e-periodica.ch/oai?verb=ListSets with the `IFrame` function. For every set, there is the `setName`, and a `setSpec`, which is a short cut for the set name and will be used as parameter with the OAI accesses.

In [4]:
IFrame('https://www.e-periodica.ch/oai?verb=ListSets', width=970, height=300)

That's nice, but how to retrieve these contents as data? Polymatheia's 'OAISetReader' does this conveniently. Here's how it works.

In [5]:
reader = OAISetReader(oai)             # instantiate ('make') a OAISetReader named reader
# 'Instantiation' is a standard procedure with Python, so it's a good idea to get familiar with it.

print(type(reader))                    # print the object type of 'reader' for information

<class 'polymatheia.data.reader.OAISetReader'>


In [6]:
for x in reader:                       # for-loop which iterates through the reader-content and prints each entry
    print(x)                           # note that 'x' is an arbitrary term

{
  "setSpec": "ddc:000",
  "setName": "Information"
}
{
  "setSpec": "ddc:020",
  "setName": "Library & information sciences"
}
{
  "setSpec": "ddc:060",
  "setName": "Associations, organizations & museums"
}
{
  "setSpec": "ddc:070",
  "setName": "News media, journalism & publishing"
}
{
  "setSpec": "ddc:100",
  "setName": "Philosophy"
}
{
  "setSpec": "ddc:200",
  "setName": "Religion"
}
{
  "setSpec": "ddc:230",
  "setName": "Christianity & Christian theology"
}
{
  "setSpec": "ddc:290",
  "setName": "Other religions"
}
{
  "setSpec": "ddc:300",
  "setName": "Social sciences"
}
{
  "setSpec": "ddc:310",
  "setName": "Statistics"
}
{
  "setSpec": "ddc:320",
  "setName": "Political science"
}
{
  "setSpec": "ddc:330",
  "setName": "Economics"
}
{
  "setSpec": "ddc:340",
  "setName": "Law"
}
{
  "setSpec": "ddc:350",
  "setName": "Public administration & military science"
}
{
  "setSpec": "ddc:360",
  "setName": "Social problems & social services"
}
{
  "setSpec": "ddc:370",
  "setNa

We might put this together and then turn the retrieved data into a *Pandas dataframe* with the 'PandasDFWriter' command. A **dataframe** is a table-like data object, which is a nice breakdown and moreover an useful format for further investigation. *Pandas* is the standard library in Python for dataframe handling.

In [7]:
reader = OAISetReader(oai)
setspec = []                          # make an empty list named 'setspec'

for x in reader:                 
    setspec.append(x)                 # .append adds all the single reader-contents to the list 'setspec'

print(setspec[0:3])                   # print the first 3 items of the list (of key-value pairs), just to see
print('---')                          # print a separating line

df = PandasDFWriter().write(setspec)  # write list 'setspec' into a Pandas dataframe named 'df'
df                                    # shows 'df' 

[{'setSpec': 'ddc:000', 'setName': 'Information'}, {'setSpec': 'ddc:020', 'setName': 'Library & information sciences'}, {'setSpec': 'ddc:060', 'setName': 'Associations, organizations & museums'}]
---


Unnamed: 0,setSpec,setName
0,ddc:000,Information
1,ddc:020,Library & information sciences
2,ddc:060,"Associations, organizations & museums"
3,ddc:070,"News media, journalism & publishing"
4,ddc:100,Philosophy
5,ddc:200,Religion
6,ddc:230,Christianity & Christian theology
7,ddc:290,Other religions
8,ddc:300,Social sciences
9,ddc:310,Statistics


If a great number of sets are given, you might **search for a certain collection by string**. This can be also helpful to **get to know the set short cut** `setSpec` used by the OAI interface for further investigation of a certain set.

In [8]:
# Example: Searching for strings 'art' or 'Art' in the 'setName' column
for i in df.index:                                             # for-loop which iterates through 'df' contents
    if 'art' in df.setName[i] or 'Art' in df.setName[i]:       # if-condition which looks for 'art' or 'Art'
                                                               # in the 'setName' column
        print(df.loc[i])                                       # print 'df' row, if if-condition is True

setSpec                     ddc:550
setName    Earth sciences & geology
Name: 28, dtype: object
setSpec               ddc:700
setName    Arts, Architecture
Name: 40, dtype: object
setSpec                      ddc:740
setName    Drawing & decorative arts
Name: 44, dtype: object
setSpec                       ddc:770
setName    Photography & computer art
Name: 45, dtype: object


It's also very useful to know in which **formats the metadata records** are available. The genuine interface does this by requesting the URL https://www.e-periodica.ch/oai?verb=ListMetadataFormats. Here, we use the 'OAIMetadataFormatReader' from Polymatheia.

As you might see, you can directly select some information like `metadataPrefix` and `metadataNamespace` from the retrieved data by **using the dot-notation**. Dot-notation just adds the desired subordinated element after a dot.

In [9]:
reader = OAIMetadataFormatReader(oai)
for formats in reader:    
    print(formats) 
    print(formats.metadataPrefix)                   # dot-notation: chooses sub-element 'metadataPrefix'

{
  "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd",
  "metadataPrefix": "oai_dc",
  "metadataNamespace": "http://www.openarchives.org/OAI/2.0/oai_dc/"
}
oai_dc


In [10]:
reader = OAIMetadataFormatReader(oai)
[formats.metadataNamespace for formats in reader]   # shorter notation for the for-loops above, which outputs a list

['http://www.openarchives.org/OAI/2.0/oai_dc/']

### 1.2 Retrieve metadata records via Polymatheia

 Retrieving available **metadata as a bunch** is simple with the 'OAIRecordReader' command. Just specify the following parameters in the 'OAIRecordReader' function:
 
- `metadata_prefix`: mandatory
- `set_spec` (the short cut for the set you want to retrieve): not mandatory, but default will be *all = many* available records!
- `max_records` (the number of records): not mandatory, but default will be *all = many* available records!

To compare this result with the native OAI interface you might check the top item of 
https://www.e-periodica.ch/oai?verb=ListRecords&set=ddc:720&metadataPrefix=oai_dc.


In [11]:
reader = OAIRecordReader(oai, metadata_prefix='oai_dc', set_spec='ddc:720', max_records=1)
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:agora.ch:ant-001:1962:1::213'},
   'datestamp': {'_text': '2013-12-09T21:27:22Z'},
   'setSpec': {'_text': 'ddc:720'}},
  'metadata': {'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc': {'_attrib': {'xsi_schemaLocation': 'http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd'},
    'dc_title': {'_text': "Man and motor don't mix : segregate them efficiently! = Vermischt Mensch und Motor nicht : trennt sie wirkungsvoll! = Ne mélangez pas l'homme et le moteur : séparez-les efficacement!"},
    'dc_creator': {'_text': 'Ritter, Paul'},
    'dc_subject': {},
    'dc_description': {},
    'dc_publisher': {'_text': 'Graf + Neuhaus'},
    'dc_contributor': {},
    'dc_date': {'_text': '1962-03-01'},
    'dc_type': [{'_text': 'Text'}, {'_text': 'Journal Article'}],
    'dc_source': [{'_text': 'Anthos : Zeitschrift für Landschaftsarchitektur = Une revue pour le paysage'},
     {'_text': '519488-x'},
     {'_text': '000

To access a certain metadata content, you can **follow down the *navigable dictionary* path** with dot-notation, like the following example. The `identifier` element in the record's `header` section denotes the record identifier and the `setSpec` element the set short cut, which was used for selection.

In [12]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.identifier._text)        # compare to the first lines of the output above
    print(record.header.setSpec._text)

oai:agora.ch:ant-001:1962:1::213
ddc:720


For retrieving contents from the `metadata` section a certain insertion has to be done according to its qualifying string `'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'`. To give some background information here: This string refers to the `metadataNamespace` element we've seen at retrieving the available metadata formats above.

In [13]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_title._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_creator._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_publisher._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_date._text)

Man and motor don't mix : segregate them efficiently! = Vermischt Mensch und Motor nicht : trennt sie wirkungsvoll! = Ne mélangez pas l'homme et le moteur : séparez-les efficacement!
Ritter, Paul
Graf + Neuhaus
1962-03-01


Not always metadata content is a simple flat value like the identifier above. **Some fields in structured metadata formats are lists** as they hold multiple values.  A good example is the `metadata` field `dc_type` which holds the information about the different types a document falls into.

In [14]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_type)

[{'_text': 'Text'}, {'_text': 'Journal Article'}]


The surrounding square brackets `[ ]` indicate a list (here of key-value pairs). To access each content of the list items of its own you might use *subsetting*, which calls the relevant item by its number in the list. 

In [15]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_type[0]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_type[1]._text)

Text
Journal Article


In [16]:
# Another example with subsetting
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:  
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[0]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[1]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[2]._text)

https://www.e-periodica.ch/digbib/view?pid=ant-001:1962:1::213
https://www.e-periodica.ch/cntmng?type=pdf&pid=ant-001:1962:1::213
doi:10.5169/seals-131342


With e-periodica **journal articles** are retrieved. Therefore, the Dublin Core element `dc_source` is highly significant: It represents the various **metadata information of the periodical** in which the article was published. Eight `dc_source` elements are delivered and their sequence bears a deeper meaning:

1. Title of the periodical
2. ZDB ID - ID of the Zeitschriftendatenbank (see [example](https://zdb-katalog.de/title.xhtml?idn=011220082&view=full))
3. ISSN
4. Volume
5. Year
6. Issue
7. 
8. Start page.

In [17]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)
for record in reader:  
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_source)

[{'_text': 'Anthos : Zeitschrift für Landschaftsarchitektur = Une revue pour le paysage'}, {'_text': '519488-x'}, {'_text': '0003-5424'}, {'_text': '1'}, {'_text': '1962'}, {'_text': '1'}, {}, {'_text': '7'}]


Because drilling down the *navigable dictionary* path can lead to long commands - which might not be very clear, either - there is a catchier way to do so with the `get` command applied on the records.  Also, there is **no issue anymore with single values versus lists and qualifying strings**. Just putting the terms together as a list of `get` parameters!

Note that in the case of more than one of the same element (like `dc_source`) a result list in squared brackets will be created.

In [18]:
reader = OAIRecordReader(oai, metadata_prefix='oai_dc', set_spec='ddc:720', max_records=1)
for record in reader:
    print(record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_creator', '_text']))
    print('---')
    print(record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_source', '_text']))

Ritter, Paul
---
['Anthos : Zeitschrift für Landschaftsarchitektur = Une revue pour le paysage', '519488-x', '0003-5424', '1', '1962', '1', None, '7']


So it's really easy to access whatever metadata content you like.
This also works with the shorter form of for-loops. But mind that it delivers a nested - or 'doubled' - list, if there are the same elements several times, like `dc_source` here.


In [19]:
reader = OAIRecordReader(oai, set_spec='ddc:720', metadata_prefix='oai_dc', max_records=1)

[record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_source', '_text']) \
            for record in reader]                   # '\' indicates that command proceeds on the next line

[['Anthos : Zeitschrift für Landschaftsarchitektur = Une revue pour le paysage',
  '519488-x',
  '0003-5424',
  '1',
  '1962',
  '1',
  None,
  '7']]

Now let's **create a small dataframe with the creator, title and source** elements.

There is a convenient way for this, relying on Python and Pandas standard procedures. So, to do this, first retrieve the single elements - as done before - and write them separately into lists (`sources`, `titles`, `creators`). Then, bind the lists into a dictionary (a genuine data type with Python), and finally turn this dictionary into a dataframe. Done!

In [20]:
reader = OAIRecordReader(oai, set_spec='ddc:290', metadata_prefix='oai_dc', max_records=10)

# Make lists from Dublin Core elements
sources = [record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_source', '_text']) \
            for record in reader]                 
titles = [record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_title', '_text']) \
            for record in reader]
creators = [record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_creator', '_text']) \
            for record in reader]
   
# Create a dictionary from the lists and turn the dictionary into a dataframe
dic = {'dc_creator': creators, 'dc_title': titles, 'dc_source': sources} 
df = pd.DataFrame(dic)
df.style

Unnamed: 0,dc_creator,dc_title,dc_source
0,"['Astié, J.F.', 'Dorner, D.J.A.']",Histoire de la théologie protestante,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '1']"
1,"['H.F.A.', 'Fichte, I.-H.']",Le théisme universel proposé comme but à la théologie spéculative,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '51']"
2,"['Buisson, Ferdinand', 'Ritter, Henri']",Paradoxes philosophiques,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '53']"
3,"Choisy, Louis",Ecce Homo : coup d'œil sur la vie et l'œuvre de Jésus-Christ,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '69']"
4,[s.n.],François Bacon de Vérulam,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '113']"
5,"['Michelet, C.-L.', 'Amiel, H.-F.']",L'hégélianisme en 1867 et le mouvement philosophique de l'Allemagne depuis trente-cinq ans,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '130']"
6,"['Cocorda, Oscar', 'Mazzarella, B.']",Histoire de la critique,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '177']"
7,"['Pasquet, P.', 'Weissaecker, C.']",De la rédemption,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '202']"
8,"['Buisson, F.', 'Ritter, Henri']",Paradoxes philosophiques [suite],"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '217']"
9,"['Claparède, T.', 'Polenz, G. de']",L'âge héroïque du Calvinisme français,"[""Théologie et philosophie : compte-rendu des principales publications scientifiques à l'étranger"", '205427-9', '0259-7152', '1', '1868', None, None, '239']"


We might go only a small step further to get some dedicated content from the various `dc_source` elements like the title of the journal and the publication year. The code only needs slight adjustment.

In [21]:
reader = OAIRecordReader(oai, set_spec='ddc:350', metadata_prefix='oai_dc', max_records=10)

# Make lists from Dublin Core elements                 
titles = [record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_title', '_text']) \
            for record in reader]
creators = [record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_creator', '_text']) \
            for record in reader]

# Create empty lists 'periodicals' and 'years' to fill
periodicals = []                   
years = []
# Fill these lists with the respective dc_source element
for record in reader:  
    periodicals.append(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_source[0]._text)
    years.append(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_source[4]._text)

# Create a dictionary from the lists and turn the dictionary into a dataframe
dic = {'dc_creator': creators, 'dc_title': titles, 'periodical': periodicals, 'year': years} 
df = pd.DataFrame(dic)
df.style

Unnamed: 0,dc_creator,dc_title,periodical,year
0,"Straub, Emil",Verpflegungsdienst im Gebirge,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
1,"Gubler, Emil",Die Feldbäckerei und das Brot,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
2,"Jeangros, X.",Wiedereinrücken!,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
3,[s.n.],Die Haftbarkeit des Truppenrechnungsführers,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
4,"Gubler, Emil",Die Feldbäckerei und das Brot [Fortsetzung und Schluss],Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
5,"Riess, Max","Bericht über die verpflegungstaktische Uebung im Gebiete Ringlikon, Uto-Kulm, Döltschi",Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
6,[s.n.],Die Postulate des Schweizerischen Fourierverbandes,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
7,"Windlinger, Hermann",Skikurs auf Oberalp : Bericht,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
8,"Huber, H.",Die Brieftaube im Dienste unserer Armee,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928
9,"Bieler, E.",Die Stellung des Fouriers in der Einheit und seine Aufgaben,Der Fourier : offizielles Organ des Schweizerischen Fourier-Verbandes und des Verbandes Schweizerischer Fouriergehilfen,1928


### 1.3 Save  and recover complex metadata structures

Before any data will be downloaded, let's build a folder `data` in our working directory to save any data.

In [22]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory. Just uncomment (and maybe multiply) the commands you need.

In [23]:
#os.chdir(os.pardir)                              # change to parent directory
#os.chdir(...)                                    # change to subdirectory
os.makedirs('data', exist_ok=True)                # make new folder 'data' - if there isn't already one
os.chdir('data')                                  # change to 'data' folder

To **download a whole bunch of metadata items** in nested formats like *MODS*, the 'JSONWriter' from Polymatheia is very helpful.
It creates a complex folder structure and JSON files to reproduce the structured metadata. And with 'JSONReader' one can easily recover the metadata set.

In [24]:
from polymatheia.data.writer import JSONWriter     # also available: CSVReader (for flat data), XMLReader and Writer
from polymatheia.data.reader import JSONReader

'JSONWriter' takes two parameters:
- The first is the name of the directory into which the data should be stored.
- The second is the dot-notated path (via its `header.identifier`) used to access the item's metadata.

For more clarity, these are the contents of `header.identifier` for the first six records in the *DDC 690* set we will refer to:

In [25]:
reader = OAIRecordReader(oai, set_spec='ddc:690', metadata_prefix='oai_dc', max_records=6)
for record in reader:
    print(record.header.identifier._text)

oai:agora.ch:arc-001:1998:0::488
oai:agora.ch:arc-001:1998:0::489
oai:agora.ch:arc-001:1998:0::490
oai:agora.ch:arc-001:1998:0::491
oai:agora.ch:arc-001:1998:0::492
oai:agora.ch:arc-001:1998:0::493


In [26]:
# Download and save the first six records from Dublin Core format
# 'poly_metadata' = directory to store into
reader = OAIRecordReader(oai, set_spec='ddc:690', metadata_prefix='oai_dc', max_records=6)
writer = JSONWriter('poly_metadata', 'header.identifier._text')    
writer.write(reader)

In [27]:
# Recover the six records from local disk
reader = JSONReader('poly_metadata')
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:agora.ch:arc-001:1998:0::491'},
   'datestamp': {'_text': '2019-06-19T06:31:19Z'},
   'setSpec': {'_text': 'ddc:690'}},
  'metadata': {'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc': {'_attrib': {'xsi_schemaLocation': 'http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd'},
    'dc_title': {'_text': "Una vocazione per l'architetto"},
    'dc_creator': {'_text': 'Canella, Guido'},
    'dc_subject': {},
    'dc_description': {},
    'dc_publisher': {'_text': 'Edizioni Casagrande SA'},
    'dc_contributor': {},
    'dc_date': {'_text': '1998-03-01'},
    'dc_type': [{'_text': 'Text'}, {'_text': 'Journal Article'}],
    'dc_source': [{'_text': 'Archi : rivista svizzera di architettura, ingegneria e urbanistica = Swiss review of architecture, engineering and urban planning'},
     {},
     {'_text': '1422-5417'},
     {'_text': '-'},
     {'_text': '1998'},
     {'_text': '1'},
     {},
     {'_text': '22'}],

The stored data **can be used just the same way** as the direct accessed one, like for instance for the `dc_title` element. Note that the order of the first six records is shuffled now.

In [28]:
reader = JSONReader('poly_metadata')
for record in reader:
    print(record.header.identifier._text)
    print(record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_title', '_text']))
    print('---')

oai:agora.ch:arc-001:1998:0::491
Una vocazione per l'architetto
---
oai:agora.ch:arc-001:1998:0::490
La necessità della ricerca
---
oai:agora.ch:arc-001:1998:0::489
Il mestiere com'era : a proposito del mio lavoro a Berlino
---
oai:agora.ch:arc-001:1998:0::488
Sul mestiere dell'architetto
---
oai:agora.ch:arc-001:1998:0::492
Un mestiere negato
---
oai:agora.ch:arc-001:1998:0::493
Il mestiere quotidiano
---


Of course, there is also the way to read out certain metadata fields **via basic dot-notation**. But this will take a bit more of code to cope with the list vs. single value issue.

In [29]:
reader = JSONReader('poly_metadata')
for record in reader:
    print(record.header.identifier._text)
    if isinstance(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_creator, list):
        le = len(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_creator)
        for i in range(le):
            print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_creator[i]._text)
    else:
        print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_creator._text)
    print('---')

oai:agora.ch:arc-001:1998:0::491
Canella, Guido
---
oai:agora.ch:arc-001:1998:0::490
Koulermos, Panos
---
oai:agora.ch:arc-001:1998:0::489
Grassi, Giorgio
---
oai:agora.ch:arc-001:1998:0::488
Moneo, Rafael
---
oai:agora.ch:arc-001:1998:0::492
Monestiroli, Antonio
---
oai:agora.ch:arc-001:1998:0::493
Giraudi, Sandra
Wettstein, Felix
Caruso, Alberto
---


## 2 Direct metadata access via OAI-PMH

Unfortunately, the Polymatheia library doesn't offer methods for *all* OAI verbs. For instance, there is no `ListIdentifiers` method (which delivers only the identifiers of a given set) and no `GetRecord` for retrieving the metadata of a certain item using its e-periodica ID.

That's where especially the common libraries **requests** and  **BeautifulSoup** come into play, and more manually coding is needed.


### 2.0 Prerequisites

In [30]:
# Load the necessary libraries
import requests                                 # request URLs
from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # XML parser supported by bs4
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Succesfully imported necessary libraries")

Succesfully imported necessary libraries


https://www.e-periodica.ch/oai/ will be the **base URL** for all OAI requests. To make life easier we put it into the variable `oai`.

In [31]:
oai = 'https://www.e-periodica.ch/oai/'

### 2.1 Start with the native OAI interface

The very **core of all operations on the OAI interface** will be a small function called `load_xml()`. It simply requests the base URL with the various parameters and decodes the answer to XML. Therefore, it can be used with all OAI verbs and their respective parameters.

In [32]:
def load_xml(params):
    '''
    Accesses the OAI interface according to given parameters and scrapes its content.
    Parameters:
    All available native OAI verbs and parameter/value pairs.
    '''
    base_url = oai
    response = requests.get(base_url, params=params)
    output_soup = soup(response.content, "lxml")
    return output_soup

You may use it to read out the basic `Identify` response of the OAI interface.

Note, that the parameters to be used by the `load_xml` function are the same as in the respective URL `https://www.e-periodica.ch/oai?verb=Identify`. That is, `verb` as the parameter key, and `Identify` as the parameter value. Therefore, we need a **parameter key-value pair**, which will be indicated by enclosing them in curly braces.

In [33]:
xml_soup = load_xml({'verb': 'Identify'})
xml_soup

<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responsedate>2021-06-04T10:02:30Z</responsedate>
<request verb="Identify">https://www.e-periodica.ch/oai/dataprovider</request>
<identify>
<repositoryname>repository.prod</repositoryname>
<baseurl>https://www.e-periodica.ch/oai/dataprovider</baseurl>
<protocolversion>2.0</protocolversion>
<adminemail>webmaster@e-periodica.ch</adminemail>
<earliestdatestamp>2013-12-09T21:21:34Z</earliestdatestamp>
<deletedrecord>no</deletedrecord>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<description>
<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oa

You can easily check with the `IFrame` method underneath.

In [34]:
IFrame('https://www.e-periodica.ch/oai?verb=Identify', width=970, height=330)

### 2.2  Download metadata records

The same can be done with the `GetRecord` OAI verb, here `metadataPrefix`and `identifier` are mandatory parameters, naturally. Since the e-periodica identifier is not simply an integer, you ought to put it in quotation marks every time.

In [35]:
# Example for accessing a single metadata record
# https://www.e-periodica.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:agora.ch:fde-001:1908:1::7

xml_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', \
                     'identifier': 'oai:agora.ch:fde-001:1908:1::7'})
xml_soup

<?xml version="1.0" encoding="UTF-8"?><html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responsedate>2021-06-04T10:02:30Z</responsedate>
<request identifier="oai:agora.ch:fde-001:1908:1::7" metadataprefix="oai_dc" verb="GetRecord">https://www.e-periodica.ch/oai/dataprovider</request>
<getrecord>
<record>
<header>
<identifier>oai:agora.ch:fde-001:1908:1::7</identifier>
<datestamp>2018-04-12T19:17:43Z</datestamp>
<setspec>ddc:100</setspec>
<setspec>ddc:200</setspec>
<setspec>ddc:320</setspec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Unser Programm

Again before downloading, first make a designated folder for the retrieved metadata.

In [36]:
print(os.getcwd())                       # print current working directory

C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access\data


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory.  Just uncomment (and maybe multiply) the commands you need.

In [37]:
#os.chdir(os.pardir)                      # change to parent directory
#os.chdir(...)                            # change to subdirectory '...'
os.makedirs('metadata', exist_ok=True)    # make folder 'metadata' - if it is not already there
os.chdir('metadata')                      # change to folder 'metadata'

You might want to **download the metadata record directly** by its e-periodica ID. The `download_record()` function does this for you easily.

In [38]:
def download_record(ID, filename):
    '''
    Downloads a certain metadata record from OAI to a single XML file.
    Throws a notice if metadata file already exists and leaves the existing one.
    Parameters:
    ID = E-periodica ID of the desired record.
    filename = File name to choose for the downloaded record.
    '''
    path = os.getcwd()
    output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': 'oai:agora.ch:' + str(ID)})
    outfile = path + '/{}.xml'.format(filename) 
    try:
        with open(outfile, mode='x', encoding='utf-8') as f:
            f.write(output_soup.decode())
            print("Metadata file {}.xml saved".format(filename))
    except FileExistsError:
            print("Metadata file {}.xml exists already".format(filename))
    finally:
            pass

In [39]:
# Example for downloading a single metadata record
# https://www.e-periodica.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:agora.ch:adi-001:1931:2::214

download_record('fde-001:1908:1::7', 'freidenker_programm')

Metadata file freidenker_programm.xml saved


### 2.3 Download metadata by set

Scraping the OAI interface output directly yields a problem with bigger data. The output is **split into segments of 100 records, which are presented on single webpages**. Looking at a sample request with `ListIdentifier` method, you will find the `resumptionToken` element, which holds the resumption token. The [resumption token](http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl) is required to access the next segment webpage, which again includes a resumtpiotn token to the next page and so on.

In [40]:
# Scroll to the end of the page for the resumption token
IFrame('https://www.e-periodica.ch/oai?verb=ListIdentifiers&set=ddc:340&metadataPrefix=oai_dc', \
       width=970, height=300)

Because of this, access metadata in bulk directly from the OAI interface is a bit more complex. With `retrieve_set_metadata()` we create a function to **retrieve metadata records of a set** and save the XML files into a created folder. As the e-periodica IDs aren't suitable for file names, the downloaded files will be serially numbered. 

**WARNING:** Mind, that entire sets of e-periodica are large! You might rather limit the number of records to download. therefore, a default value of 50 records will be defined in the function.

In [41]:
def retrieve_set_metadata(Set, foldername, max_records=50):
    '''
    Downloads metadata records of a given set from OAI to XML files in a certain folder structure.
    Therefore it
    * creates a folder to hold the records
    * requests e-periodica OAI-PMH interface according to a set 
    * retrieves the set's e-periodica IDs
    * downloads Dublin Core metadata records according to IDs
    * writes them into single serially numbered XML files in the folder.
    Parameters:
    Set = The desired OAI set.
    foldername = name of the folder in which the records will be stored.
    max_records = (Maximum) Number of records to retrieve. Default value is 50.
    '''
    start = time.perf_counter()
    number = 0

    # Set parameters to the interface
    base_url = oai
    recordsearch_term = {'verb': 'GetRecord', 'metadataPrefix': 'oai_dc'}
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Make a folder for the files named according to parameter 'foldername'
    path = os.getcwd() + '/' + foldername
    try:
        os.makedirs(path, exist_ok = True)
        print("Path {} is already available or created successfully".format(path))
    except OSError as error:
        print("Path {} can not be created".format(path))
    
        
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        Parameters:
        All available native OAI verbs and parameter/value pairs.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_record(ID):
        '''
        Downloads a certain metadata record from OAI to a single XML file.
        Throws a notice if metadata file already exists and leaves the existing one.
        Parameter:
        ID = E-periodica ID of the desired record.
        '''
        output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': ID})
        outfile = path + '/{}.xml'.format(number) 
        try:
            with open(outfile, mode='x', encoding='utf-8') as f:
                    f.write(output_soup.decode())
        except FileExistsError:
                print("Metadata file {}.xml exists already".format(number))
        finally:
                pass

    # Start with the first access to OAI interface - get the item IDs of a set
    xml_soup = load_xml(listsearch_term)
    

    while xml_soup.find('resumptiontoken') and number <= max_records:
        if number == 0:
            # First access for item IDs - first page
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses for item IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-periodica IDs
        ids = [] 
        for ID in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            ids.append(ID)

        # Download the metadata records according to retrieved e-periodica IDs
        print('Retrieving metadata for e-periodica IDs')  
        for ID in ids:
            number += 1
            if number <= max_records:
                download_record(ID)
            else: pass
        ids = []

        # Actualize the resumtpion token to retrieve the the next page
        try:
            new_token = xml_soup.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last page is accessed
        finally:
            pass

    with os.scandir(path) as entries:
        count = 0
        for entry in entries:
            count += 1       
    print("{} metadata files in {}".format(count, path))
    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))

In [42]:
# Just choose the appropriate set short cut, the desired folder name and the number of records
retrieve_set_metadata('ddc:370', 'DDC_370', 10)

Path C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access\data\metadata/DDC_370 is already available or created successfully
Retrieving metadata for e-periodica IDs
New resumption token: SETddc:370ID41517
10 metadata files in C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access\data\metadata/DDC_370
Finished in 2.63 second(s)


## 3 Download fulltext files from e-periodica website

### 3.0 Prerequisites

In [1]:
# Load the necessary libraries
import requests                                 # request URLs
import urllib.request                           # open URLs, e.g. PDF files on URLs

#!pip install pdfplumber   
import pdfplumber                               # read - available - text out from PDFs

from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # XML parser supported by bs4
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


https://www.e-periodica.ch/oai/ will be the **base URL** for all OAI requests. To make life easier we put it into the variable `oai`.

In [2]:
oai = 'https://www.e-periodica.ch/oai/'

### 3.1 Download fulltext files by e-periodica ID

Downloading e-periodica fulltetxts can be done from the e-periodica website. Fulltext is currently only available via PDF format.

In [3]:
IFrame('https://www.e-periodica.ch/cntmng?type=pdf&pid=act-001:1946:3::626', width=970, height=600)

At first, next to the `metadata` folder a new directory `fulltexts`will be created.

In [46]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access\data\metadata


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory.

In [47]:
os.chdir(os.pardir)                               # change to parent directory
os.makedirs('fulltexts', exist_ok=True)           # make new folder 'fulltexts'
os.chdir('fulltexts')                             # change to 'fulltexts' folder

A single fulltext file can be retrieved by a given e-periodica ID with the following function `download_fulltext()`. Note that **for fulltexts a different base URL** - in combination with the given e-periodica ID - has to be used: `https://www.e-periodica.ch/cntmng?type=pdf&pid=`. Since e-periodica IDs don't make a suitable fulltext filename, you have to choose one manually.

In [3]:
def download_fulltext(ID, filename):
    '''
    Downloads the PDF file of a certain e-periodica document by its ID.
    Builds with e-periodica ID the fulltext URL, and saves the PDF file on local disk.
    Parameters:
    ID = E-periodica ID of the desired fulltext/PDF file.
    filename = The file name to choose for the retrieved PDF file.
    '''
    baseurl_fulltext = "https://www.e-periodica.ch/cntmng?type=pdf&pid="
    pdf_url = baseurl_fulltext + str(ID)
    response = urllib.request.urlopen(pdf_url)
    outfile = '{}.pdf'.format(filename)
    
    try:
        with open(outfile, 'wb') as f:
            f.write(response.read())
            print("Fulltext file {} saved".format(outfile))
    except FileExistsError:
        print("Fulltext file {} exists already".format(outfile))
    except:
        print("Saving fulltext file {} failed".format(outfile))
    finally:
        pass

In [None]:
# Retrieving example PDF with e-periodica ID
download_fulltext('act-001:1946:3::626', 'tropenkaufleute') 
download_fulltext('fde-001:1908:1::7', 'freidenker_programm') 

Now, we might **check the PDF files we've just downloaded**. Of course you can open the files with your default PDF viewer. But furthermore, you can take a somewhat deeper look on them. With the small, but mighty Python library *pdfplumber* we can **read out some information form the files**, for instance, the technical metadata.

For the full capabilities of *pdfplumber* you might visit https://github.com/jsvine/pdfplumber.

In [4]:
with open('tropenkaufleute.pdf', 'rb') as f:        
    pdf = pdfplumber.open(f)
    print(pdf.metadata)

{'Author': 'Bodmer, Walter', 'Creator': 'Retroseals PDF-Generator', 'Title': 'Schweizer Tropenkaufleute und Plantagenbesitzer in Niederländisch-Westindien im 18. und zu Beginn des 19. Jahrhunderts', 'Producer': 'DynamicPDF for Java v4.0.3', 'CreationDate': 'D:20210614113844Z'}


 By defining a small function called `tech_metadata` there's an easy way to get a better formatted output.

In [5]:
def tech_metadata(pdf_path):
    '''
    Reads the technical metadata of a PDF formatted file and prints a summary.
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    with open(pdf_path, 'rb') as f:
        pdf = pdfplumber.open(f)
        md = pdf.metadata   
        num_pages = len(pdf.pages)
        
    txt = f"""
    Information about {pdf_path}: 

    Author: {md['Author']}
    Title: {md['Title']}
    Number of pages: {num_pages}
    Creator: {md['Creator']}
    Producer: {md['Producer']}
    """
    print(txt)

In [6]:
tech_metadata('tropenkaufleute.pdf')
tech_metadata('freidenker_programm.pdf')


    Information about tropenkaufleute.pdf: 

    Author: Bodmer, Walter
    Title: Schweizer Tropenkaufleute und Plantagenbesitzer in Niederländisch-Westindien im 18. und zu Beginn des 19. Jahrhunderts
    Number of pages: 34
    Creator: Retroseals PDF-Generator
    Producer: DynamicPDF for Java v4.0.3
    

    Information about freidenker_programm.pdf: 

    Author: A. F.
    Title: Unser Programm : Was will ein Freidenkerverein in der Schweiz?
    Number of pages: 2
    Creator: Retroseals PDF-Generator
    Producer: DynamicPDF for Java v4.0.3
    


But *pdfplumber* can also **read out the raw text of the pages** stored in the PDF. Let's try to take a look at the very first page of the *freidenker_programm.pdf* file with a small code snippet. This **first page** is indeed a cover sheet generated by e-periodica to get an overview about the document's **bibliographic metadata** and **terms of use** (in German).

In [7]:
with open('freidenker_programm.pdf', 'rb') as f:              
    pdf = pdfplumber.open(f)                  # creating a reader object
    first_page = pdf.pages[0]                 # creating a page object from the first PDF page = cover sheet
    print(first_page.extract_text())          # extracting text form the page object

Unser Programm : Was will ein
Freidenkerverein in der Schweiz?
Autor(en): A. F.
Objekttyp: Article
Zeitschrift: Freidenker [1908-1914]
Band (Jahr): 1 (1908)
Heft 1
PDF erstellt am: 15.06.2021
Persistenter Link: http://doi.org/10.5169/seals-405882
Nutzungsbedingungen
Die ETH-Bibliothek ist Anbieterin der digitalisierten Zeitschriften. Sie besitzt keine Urheberrechte an
den Inhalten der Zeitschriften. Die Rechte liegen in der Regel bei den Herausgebern.
Die auf der Plattform e-periodica veröffentlichten Dokumente stehen für nicht-kommerzielle Zwecke in
Lehre und Forschung sowie für die private Nutzung frei zur Verfügung. Einzelne Dateien oder
Ausdrucke aus diesem Angebot können zusammen mit diesen Nutzungsbedingungen und den
korrekten Herkunftsbezeichnungen weitergegeben werden.
Das Veröffentlichen von Bildern in Print- und Online-Publikationen ist nur mit vorheriger Genehmigung
der Rechteinhaber erlaubt. Die systematische Speicherung von Teilen des elektronischen Angebots
auf anderen Se

Of course, you can read out all the pages at once and **get the whole raw text**, which is saved in the PDF file. Furthermore it is very easy to skip the cover sheet in doing so. We might define a small function named `read_pdf()` for printing out the whole raw text of the article.

In [8]:
def read_pdf(pdf_path):
    '''
    Extracts the raw text of a PDF formatted file and prints it.
    Omits the first page of the PDF file, which is a cover sheet and not part of the article's genuine text.
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    with open(pdf_path, 'rb') as f:                
        pdf = pdfplumber.open(f)
        for i in range(1, len(pdf.pages)):       # start with the second page to skip the first one = cover sheet
            page = pdf.pages[i]                  # creating a page object
            text = page.extract_text()           # extracting text form the page object
            print(text)   

In [9]:
read_pdf('tropenkaufleute.pdf')

Schweizer Tröpenkaufleute und
Plantagenbesitzer in Niederländisch-Westindien
im 18. und zu Beginn des 19. Jahrhunderts.
Von Walter Bodmer.
Ueber die schweizerische Auswanderung nach Niederländisch-
Westindien im 18. Jahrhundert ist schon einiges veröffentlicht
worden. Indessen war über die Tätigkeit von Schweizer Kaufleuten
und über Schweizer Plantagenbesitzer in jenem Tropengebiete bis
heute wenig bekannt. Nachforschungen, hauptsächlich in Basler
Archiven, haben es dem Verfasser erlaubt, sich sowohl über die
kaufmännische Tätigkeit von um 1740 auf Curaçao
niedergelassenen
Schweizern, wie auch über Schweizer Plantagenbesitz in
Surinam im 18. und zu Beginn des 19. Jahrhunderts ein Bild zu
machen. Sie bilden Gegenstand der vorliegenden Studie. Die
betreffenden Schweizer sind zum Teil infolge ihrer kaufmännischen
Tätigkeit in Amsterdam mit Westindien in Beziehung getreten,
weshalb hier auch auf diese kurz eingegangen werden soll \
Genfer Plantagenbesitzer.
Nach dem Fall Antwerpens im Jahr

Bodmer, Schweizer Tropenkaufleute und Plantagenbesitzer 293
nur sekundäre Bedeutung zukam. Durch die starke Entwicklung
der holländischen Seeschiffahrt sowie infolge der Schwierigkeiten,
die im 17. Jahrhundert dem bilateralen Handel zwischen den
großen europäischen Nationen entgegenstanden, ist Amsterdam
zum Zentrum eines die Produkte der ganzen Welt umfassenden
Zwischenhandels geworden 10.
Auch der Handel auf St. Eustatius und Curaçao trägt um 1740
die charakteristischen Züge des niederländischen Güteraustausches.
Er ist zum größten Teil Zwischenhandel besonderer Art. Es waren
ja die westindischen Inseln Curaçao, Bonaire, Aruba und die Hälfte
von St. Martin sowie St. Eustatius und Saba zum Plantagenbau nur
bedingt geeignet. Sie sind daher im 17. Jahrhundert von der
Niederländisch-Westindischen Kompagnie den Spaniern vor allem
zur Errichtung von Handelsstülzpunkten, sowohl für den legalen
Handel wie für den Schleichhandel mit den spanischen Kolonien
am Rande des Karibischen Meeres, abg

Bodmer, Schweizer Tropenkaufleute und Plantagenbesitzer 297
IL
Daneben ist Hoffmann, der nur ausnahmsweise für eigene
Rechnung Handel trieb, sehr häufig als Kommissionär für auf
St. Eustatius niedergelassene holländische Kaufleute tätig gewesen.
Für diese verkaufte er Güter, die jene direkt oder indirekt aus
Europa, Nordamerika oder von den benachbarten Antilleninseln
erhalten hatten und auf Curacao abzusetzen suchten. Es waren
dies vor allem Lebensmittel (Fleisch, Speck, Mehl, Pataten und
Orangen, ferner Eisenwaren, Madeira-Wein, irländische Talgkerzen
und ausnahmsweise Textilien (Platilles) 23. Dafür handelte
Hoffmann für ihre Rechnung Häute, Kakao und Maulesel ein,
welche Güter sie weiterverkauften 24. Als Agent dieser Kaufleute
vermittelte er auch den Absatz einiger Kolonialprodukte, die sie
pflegte, befanden sich zwei, deren Inhaber aus Basel stammten. Da ist in
erster Linie die Firma A'euve Jean Rudolf Faesch & Cie. zu nennen, der Isaak
Faesch vor 1720 selbst angehört hatte und d

Bodmer, Schweizer Tropenkaufleute und Plantagenbesitzer 301
Allein auch diese Herrlichkeit sollte nicht lange dauern. Die Zahl
der im südlichen Teile des Karibischen Meeres kreuzenden
britischen Kaperschiffe vermehrte sich rasch, was den Schleichhandel
von Curaçao aus immer schwieriger gestalten mußte. Der Verkehr
mit dem General-Kapitanat Caracas blieb stark behindert, und die
Rückkehr der Schmugglerschiffe von Cartagena und Porto Bello
ließ nun plötzlich auf sich warten. Immer mehr Schleichhandel
treibende Barken wurden von den seit dem Rückzug der Cartagena
erfolglos belagernden englischen Flotte mutig gewordenen
spanischen, ferner auch von den von Rhode Island herbeigeeilten
englischen Freibeutern abgefangen. Dies hatte einen völligen
Stillstand des Exportes von Curaçao aus zur Folge, während dort
gleichzeitig infolge der ungehinderten Zufuhr aus Europa und von
den Antillen plötzlich großer Warenüberfluß herrschte, so daß
Hoffmann seine Kommittenten ersuchte, keine weiteren Waren
z

Bodmer, Schweizer Tropenkaufleute und Plantagenbesitzer 305
VIII.
Seit dem 17. Jahrhundert, als die Niederländisch-Westindische
Kompagnie noch das Monopol des Sklavenhandels mit Spanisch-
Amerika innehatte, war Curaçao zu einem Hauptzentrum für diesen
Handel geworden. In der Folge war dann das «Asiento» an
die Franzosen, 1713 an die Engländer übergegangen. Der spanischenglische
Seekrieg brachte den Sklavenhandel auf Curaçao zu
neuer Blüte. Auch der geschäftstüchtige Hoffmann hat sich für
eigene wie für Rechnung seiner auf St. Eustatius niedergelassenen
Kommittenten daran beteiligt. Allerdings konnten sich diese Kauf-
leute die Sklaven nicht in Afrika direkt beschaffen, wie die
Westindische Kompagnie. Sie waren darauf angewiesen, diese den
englischen Sklavenhändlern auf St. Christopher abzukaufen, und
ließen sie hernach durch Mittelsmänner, meist Juden, an die Küste
von Venezuela bringen, wo sie gegen Kakao eingetauscht wurden.
Hoffmann empfahl seinen Korrespondenten, nur junge, kräftig

310 Ada Trop. Ill, 4, 1946 — Kolonisation
sehen Meeres, welche infolge der Blockade und Besetzung durch
die Briten vom Mutterlande abgeschnitten waren. Die holländische
Handelsschiffahrt wurde wiederum durch die Kaperschiffe
der kriegführenden Mächte bedroht.
Neben andern Amsterdamer Beederkaufleuten passierte auch
Johann Jakob Faesch das Mißgeschick, daß sein Schiff mit einer
beträchtlichen Ladung auf der Rückreise von Curaçao nach
Amsterdam, am 27. Oktober 1758, von einem englischen Kaperschiff
aufgegriffen und nach Annapolis (Maryland) gebracht worden ist.
Nach 1760/61 hat er sich um dessen Freigabe bemüht68.
Die Konfiskation des Schiffes und seiner Ladung hat Johann
Jakob zweifellos einen empfindlichen Verlust gebracht. Indessen
war seine finanzielle Lage nun so weit gesichert, daß dieser seine
kaufmännische Tätigkeit auf die Länge nicht mehr zu beeinträchtigen
vermochte. Am 8. Juli 1759 hatte er sich mit Catharina Maria
de Hoy, der Schwester der zweiten Frau seines Bruders Johanne

314 Ada Trop. Ill, 4, 1946 — Kolonisation
Die Krisenzeit hat allerdings Johann Jakob nicht mehr erlebt.
Nach dem Wegzug seines Bruders Johannes führte er zunächst
das Handelshaus in Amsterdam allein weiter. Seine Frau, Catharina
Maria Faesch-de Hoy, starb schon 1765 bei der Geburt des
Sohnes Johann Jakob II an Kindbettfieber. Sechs Jahre später,
im Jahre 1771, zog sich der Witwer mit seinen drei überlebenden
Kindern in seine Vaterstadt Basel zurück. Dort verheiratete er sich
im selben Jahre ein zweites Mal, und zwar mit Valeria Schweig-
hauser, die ihm einen weiteren Sohn und drei Töchter schenkte.
Die Geschäfte in Amsterdam besorgte während der Abwesenheit
des Handelsherrn Faesch dessen Faktor, Johann Christian
Neuhaus. Indessen hat Johann Jakob auch nach 1771 wiederholte
Geschäftsreisen nach Holland unternommen. Sein eigenes
Handelshaus halte er zwar im Jahre 1794 aufgelöst, war jedoch zur
selben Zeit als Associé in das Haus Braunsberg eingetreten,
welches von nun an den Namen «Braun

318 Ada Trop. Ill, 4, 1946 — Kolonisation
ten ab; denn die in Surinam gepflanzte Kaffeesorte, eine Varietät
von «Coffea arabica», wurde von den Konsumenten nicht geschätzt
und erzielte daher keine befriedigenden Preise. Es wurde daher
ernstlich die Einstellung des Betriebes erwogen "3.
Bis zum Jahre 1827 wurde der Faeschsche Planlagenbesitz von
J. J. de Faesch & Cie. in Amsterdam verwaltet. Allein die
Prosperität
dieses Handels- und Bankhauses dauerte nicht unbegrenzte
Zeit. Es geriet Ende der zwanziger Jahre des 19. Jahrhunderts in
finanzielle Schwierigkeiten. Die Erben des Ratsherrn J. J. Faesch
sahen sich daher veranlaßt, die Verwaltung von Hoyland und
Voorburg sowie diejenige ihrer Anteilscheine an anderen Plantagen
der Amsterdamer Firma Moyet & Cie. zu übertragen 8\ Im
Jahre 1850 sind die in Basel, Genf und Amsterdam lebenden
Nachkommen noch Eigentümer der beiden vorgenannten Plantagen
gewesen. 1851 wurde indessen Voorburg veräußert. Zu welchem
Zeitpunkt Hoyland verkauft worden is

Similarly, with `pdf_to_txt()` we can define a function to write the extracted raw text from PDF into a TXT file immediately.

In [10]:
def pdf_to_txt(pdf_path):
    '''
    Extracts the raw text of a PDF formatted file and writes it into a TXT file of the same name (with
    '.txt' file extension respectively).
    Omits the first page of the PDF file, which is a cover sheet and not part of the article's genuine text.
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    fulltext = ''
    with open(pdf_path, 'rb') as f:                
        pdf = pdfplumber.open(f)
        for i in range(1, len(pdf.pages)):       # start with the second page to skip the first one = cover sheet
            page = pdf.pages[i]                  # creating a page object
            page_text = page.extract_text()      # extracting text form the page object
            fulltext += page_text                # bind page texts together to whole text
            
    match = re.search('(\S+).pdf', pdf_path)
    filename = match.group(1)
    outfile = filename + '.txt'
    
    try:
        with open(outfile, 'w', encoding='utf-8') as f:
            f.write(fulltext)
            print("Fulltext file {}.txt saved".format(filename))
    except:
        print("Saving fulltext file {}.txt failed".format(filename))
    finally:
        pass

In [11]:
pdf_to_txt('tropenkaufleute.pdf')

Fulltext file tropenkaufleute.txt saved


In [12]:
# check the content of the generated TXT file
with open('tropenkaufleute.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
    print(fulltext)

Schweizer Tröpenkaufleute und
Plantagenbesitzer in Niederländisch-Westindien
im 18. und zu Beginn des 19. Jahrhunderts.
Von Walter Bodmer.
Ueber die schweizerische Auswanderung nach Niederländisch-
Westindien im 18. Jahrhundert ist schon einiges veröffentlicht
worden. Indessen war über die Tätigkeit von Schweizer Kaufleuten
und über Schweizer Plantagenbesitzer in jenem Tropengebiete bis
heute wenig bekannt. Nachforschungen, hauptsächlich in Basler
Archiven, haben es dem Verfasser erlaubt, sich sowohl über die
kaufmännische Tätigkeit von um 1740 auf Curaçao
niedergelassenen
Schweizern, wie auch über Schweizer Plantagenbesitz in
Surinam im 18. und zu Beginn des 19. Jahrhunderts ein Bild zu
machen. Sie bilden Gegenstand der vorliegenden Studie. Die
betreffenden Schweizer sind zum Teil infolge ihrer kaufmännischen
Tätigkeit in Amsterdam mit Westindien in Beziehung getreten,
weshalb hier auch auf diese kurz eingegangen werden soll \
Genfer Plantagenbesitzer.
Nach dem Fall Antwerpens im Jahr

So everything is fine? Unfortunately not.

First, you might notice that the **headlines and footers of the pages as their footnotes are included** sequentially in the text output. So keep in mind, that in many cases, you won't get the spotless clean article fulltext to read by a human. But if you are looking for a text mining resource, this outcome will quite do the job.

Secondly, you might have **more complex PDF files**, for instance older ones with a **column layout**. Here, *pdfplumber* will get to its end.

In [13]:
read_pdf('freidenker_programm.pdf')

Organ der Freidenker der deutschen
IreidenkHeerPavuos-sgt'feMagcheebe6rn15e6vionm Zürich I. J1a.hDrgaanuncgrr—190U8o. 1. Abonnement: SchwEeEinirzzsceFhlneru.inmt1mm.2eo0r,na1tA0licuhCs.ltasn.d Fr. 1.50 pro Jahr.
Unser Programm. rialisten im metaphysischen Sinne. Wir predigen keinen Gsti
Achtnng! AtomundüberhauptkeinemetaphysischeWeltanschauung, sondern
an:BPrioefset,faGcehlds6en1d5un6g,enHuanudpTtapuoscshtexZemüprilacrhe.sind zu richten KTuzuahisuuönnystuchsdrAöhlezdarttsEu,rwintaeeirrnefnrvsäieeEfgneawrimerglnlwmnlzieteül,ednlniniteidslttdfilrceegtdamhirltgiul.eeeiacwnGtuihaungofelpngBanhdeueydsfzbisfreewrgieehrseeuaciiniunhtWtsiakcgentinhnegdang.tienhekneFrbehndgDreeaeKrlwavösgirteGßseeäitsnflermatei.Eeennune.vbinnaTeskDancrnaIlihgalunsheilei,dnnmcnrl.oheiupwgeGmhönmoDMeeeei?beGednddisateeeheestöirroseitcndnMemWWedseeiiwmssrnteshssiveseeiloctiotnnghldsnsileHiöcc?sshhtideleeafiestennfsrtt wuDodAVtngyendaienuarrbterdrarsrsestdrtcn.teeehrnaudnldiaslinlesil,iWutaeeesniuejuene,ksnirrfuoen

To handle such challenging PDF files you will come back on a **stronger tool**. *Apache Tika* is a well-known open source toolkit which extracts metadata and text from wide range of file formats. *Apache Tika* is Java-based, and using it as shown beneath will install Java, the Tika REST server and a Python wrapper to speak with it.

You might see the Tika documentation at https://pypi.org/project/tika/ and the documentation of the Tika wrapper: https://pypi.org/project/tika/.

The Tika parser also distinguishes between the file's metadata and its content. Just try the following two short code snippets.

In [14]:
#!pip install tika
from tika import parser

text = parser.from_file('freidenker_programm.pdf')
print(text['metadata'])

{'Author': 'A. F.', 'Content-Type': 'application/pdf', 'Creation-Date': '2021-06-15T10:06:43Z', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '20', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'created': '2021-06-15T10:06:43Z', 'creator': 'A. F.', 'dc:creator': 'A. F.', 'dc:format': 'application/pdf; version=1.4', 'dc:title': 'Unser Programm : Was will ein Freidenkerverein in der Schweiz?', 'dcterms:created': '2021-06-15T10:06:43Z', 'meta:author': 'A. F.', 'meta:creation-date': '2021-06-15T10:06:43Z', 'pdf:PDFVersion

In [15]:
with open('freidenker_programm.pdf', 'rb') as f:                      
    text = parser.from_file(f)
    print(text['content'])






































Unser Programm : Was will ein Freidenkerverein in der Schweiz?


Unser Programm : Was will ein
Freidenkerverein in der Schweiz?

Autor(en): A. F.

Objekttyp: Article

Zeitschrift: Freidenker [1908-1914]

Band (Jahr): 1 (1908)

Heft 1

Persistenter Link: http://doi.org/10.5169/seals-405882

PDF erstellt am: 15.06.2021

Nutzungsbedingungen
Die ETH-Bibliothek ist Anbieterin der digitalisierten Zeitschriften. Sie besitzt keine Urheberrechte an
den Inhalten der Zeitschriften. Die Rechte liegen in der Regel bei den Herausgebern.
Die auf der Plattform e-periodica veröffentlichten Dokumente stehen für nicht-kommerzielle Zwecke in
Lehre und Forschung sowie für die private Nutzung frei zur Verfügung. Einzelne Dateien oder
Ausdrucke aus diesem Angebot können zusammen mit diesen Nutzungsbedingungen und den
korrekten Herkunftsbezeichnungen weitergegeben werden.
Das Veröffentlichen von Bildern in Print- und Online-Publikationen ist nur mit vorheriger Genehmigung


Looks fine! But there's one more problem: The extracted text **includes also the text of the added cover sheet**. To deal with this, we first look after the string which separates the cover sheet text from the article's one. As you can see above, this is the Digital Object Identifier (DOI) link of the document (http://doi.org/10.5169/seals-405882). This DOI link appears two times on the cover sheet overall. So we have to choose the third section of the document split by the DOI link to address the article's raw text only.

So a new function `read_pdf_tika()` for reading out the article's text with the Tika parser can be defined.

In [16]:
def read_pdf_tika(pdf_path):
    '''
    Extracts the raw text of a PDF formatted file with Apache Tika and prints it.
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    with open(pdf_path, 'rb') as f:              
        pdf = pdfplumber.open(f)
        first_page = pdf.pages[0]
        text_first_page = first_page.extract_text()
        match = re.search('http://doi.org/10.5169/seals-(\S+)', text_first_page)   # look for the DOI links
    
    with open(pdf_path, 'rb') as f:                      
        raw = parser.from_file(f)
        print(raw['content'].split(match.group())[2])                   # split the document by the DOI links
                                                                        # and choose the third document split

In [18]:
read_pdf_tika('freidenker_programm.pdf')




Organ der Freidenker der deutschen
Herausgegeben vom

Ireidenkev-'Merein Zürich
Postfach 6156

I. Jahrgang — Uo. 1.
1. Danucrr 1908

Erscheint monatlich.
Abonnement: Schweiz Fr. 1.20, Ausland Fr. 1.50 pro Jahr.

Einzelnummer 10 Cts.

Achtnng!
Briefe, Geldsendungen und Tauschexemplare sind zu richten

an: Postfach 6156, Hauptpost Zürich.

^os^tt ^<>tt^ tt/e

Kreidenkerverein Zürich.

EiltladlW sill KeneralntrslllMlllllng
ans Sonntag den 12. Januar, nachmittags 2 Mr

m Saale des hintern Sternen iL.'l>tv«5p>ay.

T r a k t a n d e n :
1. Bezug der Beiträge.
2. Verlesen des Protokolls.
3. Wahl des Vorstandes und der Delegierten.
4. Antrag betr. Zeitung und Erhöhung des Beitrags.
5. Statutenänderung.
l». Verschiedenes.

Nach Abwicklung der Traktanden

gemütliches Zusammensein.
Abendessen k la carte.

Wir hoffen auf zahlreichen Besuch, speziell von auswärtigen
Mitgliedern. Anmeldungen für Vorträge 5. erbeten.

Der Vorstand.

Atheismus.
I. H. Mackay.

Vielleicht, wenn einst die müden Augen 

In [71]:
def pdf_to_txt_tika(pdf_path):
    '''
    Extracts the raw text of a PDF formatted file with Apache Tika and writes it into a TXT file of the same name
    (with '.txt' file extension respectively).
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    with open(pdf_path, 'rb') as f:              
        pdf = pdfplumber.open(f)
        first_page = pdf.pages[0]
        text_first_page = first_page.extract_text()
        match = re.search('http://doi.org/10.5169/seals-(\S+)', text_first_page)   # look for the DOI links
    
    with open(pdf_path, 'rb') as f:                      
        text = parser.from_file(f)
        fulltext = text['content'].split(match.group())[2]              # split the document by the DOI links
                                                                        # and choose the third document spli
            
    match_name = re.search('(\S+).pdf', pdf_path)
    filename = match_name.group(1)
    outfile = filename + '.txt'
    
    try:
        with open(outfile, 'w', encoding='utf-8') as f:
            f.write(fulltext)
            print("Fulltext file {}.txt saved".format(filename))
    except:
        print("Saving fulltext file {}.txt failed".format(filename))
    finally:
        pass

In [72]:
pdf_to_txt_tika('freidenker_programm.pdf')

Fulltext file freidenker_programm.txt saved


In [21]:
# Check the content of the generated TXT file
with open('freidenker_programm.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
    print(fulltext)




Organ der Freidenker der deutschen
Herausgegeben vom

Ireidenkev-'Merein Zürich
Postfach 6156

I. Jahrgang — Uo. 1.
1. Danucrr 1908

Erscheint monatlich.
Abonnement: Schweiz Fr. 1.20, Ausland Fr. 1.50 pro Jahr.

Einzelnummer 10 Cts.

Achtnng!
Briefe, Geldsendungen und Tauschexemplare sind zu richten

an: Postfach 6156, Hauptpost Zürich.

^os^tt ^<>tt^ tt/e

Kreidenkerverein Zürich.

EiltladlW sill KeneralntrslllMlllllng
ans Sonntag den 12. Januar, nachmittags 2 Mr

m Saale des hintern Sternen iL.'l>tv«5p>ay.

T r a k t a n d e n :
1. Bezug der Beiträge.
2. Verlesen des Protokolls.
3. Wahl des Vorstandes und der Delegierten.
4. Antrag betr. Zeitung und Erhöhung des Beitrags.
5. Statutenänderung.
l». Verschiedenes.

Nach Abwicklung der Traktanden

gemütliches Zusammensein.
Abendessen k la carte.

Wir hoffen auf zahlreichen Besuch, speziell von auswärtigen
Mitgliedern. Anmeldungen für Vorträge 5. erbeten.

Der Vorstand.

Atheismus.
I. H. Mackay.

Vielleicht, wenn einst die müden Augen 

That looks really good! But finally a short caveat has to be stated, again.

Mind, that Tika (and presumably most PDF readers) **will parse whole pages by default**. If there is more than one article on a page, you'll get more than the one article you want in raw text. That's also the case above. Furthermore the **constraints regarding headlines, footers and footnotes** noted before also apply with various PDF readers.

So, as mentioned before: The outcome will make quite a good text mining resource, but might be confusing here and there for human readers.

### 3.2 Download fulltext files by set

Finally, let's build a function `retrieve_set_fulltexts` to **retrieve fulltexts of a certain e-periodica set**.

**WARNING**: As with the metadata records, fulltext sets of e-periodica are large, so it's a good idea to limit the number of fulltexts to download. The default number of fulltexts in the function will be 20. Of course, you can also change that easily.

In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory.

In [52]:
print(os.getcwd()) 

C:\Users\kwoit\Documents\GitHub\ds-pytools\web-tools\e-periodica-access\data\fulltexts


In [40]:
def retrieve_set_fulltexts(Set, foldername, max_fulltexts=20):
    '''
    Downloads PDF fulltexts of a given DDC set from e-periodica website to files in a certain folder.
    Therefore it
    * creates the folder according to the parameter foldername
    * requests e-periodica OAI-PMH interface according to a OAI set 
    * retrieves the set's e-periodica IDs
    * downloads PDF fulltexts according to IDs from e-periodica website
    Parameters:
    Set = The desired OAI set.
    foldername = name of the folder in which the fulltexts will be stored.
    max_fulltext = (Maximum) Number of fulltexts to retrieve. Default value is 20.
    '''
    start = time.perf_counter()
    number = 0

    # Set parameters to the interface
    base_url = oai
    baseurl_fulltext = "https://www.e-periodica.ch/cntmng?type=pdf&pid="
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Make a folder <foldername> to store files in it
    directory = foldername
    parent_dir = os.getcwd()
    path = os.path.join(parent_dir, directory)
    try:
        os.makedirs(path, exist_ok = True)
        print('Path {} is already available or created successfully'.format(path))
    except OSError as error:
        print('Path {} could not be created'.format(path))
           
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_fulltext(ID):
        '''
        Downloads the PDF file of a certain e-periodica document by its ID.
        Builds with e-periodica ID the fulltext URL, and saves the PDF file on local disk.
        Parameter:
        ID = E-periodica ID of the desired fulltext/PDF file.
        '''
        pdf_url = baseurl_fulltext + str(ID)
        response = urllib.request.urlopen(pdf_url)
        outfile = path + '/{}.pdf'.format(number)
        try:
            with open(outfile, 'wb') as f:
                f.write(response.read())
                #print("Fulltext file {}.pdf saved".format(number))
        except FileExistsError:
            print("Fulltext file {}.pdf exists already".format(number))
        except:
            print("Saving fulltext file {}.pdf failed".format(number))
        finally:
            pass
            
    # Start with the first access to OAI interface
    xml_soup = load_xml(listsearch_term)

    while xml_soup.find('resumptiontoken') and number <= max_fulltexts:
        if number == 0:
            # First access for item IDs - first page
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses for item IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-periodica IDs
        ids = [] 
        for ID in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            match = re.search('oai:agora.ch:(\w{3}-\d{3}:\d{4}:\d+::\d+)', ID)    # extract the string following 'oai:agora.ch:'
            if match:
                ids.append(match.group(1))       # second parenthesized subgroup of group() = number

        # Download the fulltext files according to retrieved e-periodica IDs
        print('Retrieving PDF fulltexts for e-periodica IDs')  
        for ID in ids:
            number += 1
            if number <= max_fulltexts:
                download_fulltext(ID)
            else: pass
        ids = []

        # Actualize the resumption token to retrieve the the next page
        try:
            new_token = xml_soup_new.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last results page is accessed
        finally:
            pass
    
    count = 0
    with os.scandir(path) as entries: 
        for entry in entries:
            count += 1  
    print("{} fulltext files in {}".format(count, path))
    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))
    

In [83]:
os.getcwd()

'C:\\Users\\kwoit\\Documents\\GitHub'

In [41]:
retrieve_set_fulltexts('ddc:450', 'DDC_450', 10)

Path C:\Users\kwoit\Documents\GitHub\DDC_450 is already available or created successfully
Retrieving PDF fulltexts for e-periodica IDs
New resumption token: SETddc:450ID52220
10 fulltext files in C:\Users\kwoit\Documents\GitHub\DDC_450
Finished in 28.16 second(s)


You can easily **check the technical metadata of the downloaded PDF files** with a small loop over the included files  using the `os.listdir()` command on the new *DDC_450* folder.

In [113]:
path = os.path.join(os.getcwd(), 'DDC_450')

for entry in os.listdir(path):
    if entry.endswith('.pdf'):
        tech_metadata(path + '/' + entry)


    Information about C:\Users\kwoit\Documents\GitHub\DDC_450/1.pdf: 

    Author: Bühler, J.A.
    Title: Notizias historicas sur l'origin della Societad Rhaeto-romana
    Number of pages: 38
    Creator: Retroseals PDF-Generator
    Producer: DynamicPDF for Java v4.0.3
    

    Information about C:\Users\kwoit\Documents\GitHub\DDC_450/10.pdf: 

    Author: Kuoni, M.
    Title: Restanzas dil lungatg romonsch en las valladas della Landquard e della Plessur (Vall Portenza, Scanvitg ed entginas confinontas)
    Number of pages: 4
    Creator: Retroseals PDF-Generator
    Producer: DynamicPDF for Java v4.0.3
    

    Information about C:\Users\kwoit\Documents\GitHub\DDC_450/2.pdf: 

    Author: Bühler, J.A.
    Title: L'uniun dels dialects raetho-romans : (Referat per la prima radunanza quartala della Societad Rhaeto-romana, il 21 d. Schanèr 1886)
    Number of pages: 24
    Creator: Retroseals PDF-Generator
    Producer: DynamicPDF for Java v4.0.3
    

    Information about C:\Users\

Finally, you may **process all the PDF files batch-wise** in a given folder and **write TXT files from their raw text**. There's even a shorter notation for the general processing loop above. Both ways, using the lightweight *pdfplumber* library, and the *Apache Tika* wrapper for Python are shown beneath with a sample outcome.

In [114]:
# Using pdfplumber library
path = os.path.join(os.getcwd(), 'DDC_450')

[pdf_to_txt(path + '/' + entry) for entry in os.listdir(path) if entry.endswith('.pdf')]

Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/1.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/10.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/2.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/3.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/4.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/5.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/6.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/7.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/8.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/9.txt saved


[None, None, None, None, None, None, None, None, None, None]

In [116]:
# Check the content of a generated TXT file
with open('DDC_450/9.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
    print(fulltext)

La strcda sur la muiitagua del Fuoru.
Our dalla Val Müstair ans pervegnan solum s-charsas
novas c bain darer legia ün qualche notizia da lo in fin u
Toter da nos fögls publics, usche cha nos regents a Beim sa-
veron poch che chi dvainta vidvart il Buffalora, abain cha eir
in quaist chantunet ün po isolo vivan circa 1500 libers
Svizzers, dels quels la confederaziun non ho zuond bricha da 's
trupager ed ils quels piglian viva part als affers ed ineunters
pü plaschaivels e displaschaivels della patria. Eir füssan lo hom-
mens giuvens, ein podessan bain inserrir nels fögls qualche
notizia supra lur val, lur trafic e lur vita, seh' eis non s' uni-
formessan usche gugent al dit, chi tuna: „Bene vixit qui bene
latuit" ,*) e vairamaing non sto que usche mei in quels lös ed
in quellas vals, dellas quelas ün oda poch. Per uossa non
avains nus in sen da der al benevol lectur üna descripziun
della val Müstair, dimperse solum qualche notizias supra la
nova streda, chi maina our dell' Engiadina in q

In [120]:
# Using Apache Tika parser
path = os.path.join(os.getcwd(), 'DDC_450')

[pdf_to_txt_tika(path + '/' + entry) for entry in os.listdir(path) if entry.endswith('.pdf')]

Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/1.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/10.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/2.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/3.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/4.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/5.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/6.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/7.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/8.txt saved
Fulltext file C:\Users\kwoit\Documents\GitHub\DDC_450/9.txt saved


[None, None, None, None, None, None, None, None, None, None]

In [119]:
# check the content of a generated TXT file
with open('DDC_450/9.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
    print(fulltext)




La strcda sur la muiitagua del Fuoru.

Our dalla Val Müstair ans pervegnan solum s-charsas
novas c bain darer legia ün qualche notizia da lo in fin u
Toter da nos fögls publics, usche cha nos regents a Beim sa-
veron poch che chi dvainta vidvart il Buffalora, abain cha eir
in quaist chantunet ün po isolo vivan circa 1500 libers
Svizzers, dels quels la confederaziun non ho zuond bricha da 's

trupager ed ils quels piglian viva part als affers ed ineunters
pü plaschaivels e displaschaivels della patria. Eir füssan lo hom-
mens giuvens, ein podessan bain inserrir nels fögls qualche
notizia supra lur val, lur trafic e lur vita, seh' eis non s' uni-
formessan usche gugent al dit, chi tuna: „Bene vixit qui bene
latuit" ,*) e vairamaing non sto que usche mei in quels lös ed
in quellas vals, dellas quelas ün oda poch. Per uossa non
avains nus in sen da der al benevol lectur üna descripziun
della val Müstair, dimperse solum qualche notizias supra la
nova streda, chi maina our dell' Engiadina