# ISSN matching

This notebook contains a probabilistic matching between metadata of periodicals from the Royal Library of Belgium (KBR) and a dataset of Belgian periodicals from the ISSN center. The aim is to enrich KBR records with their corresponding ISSN number. The matching is performed via Splink https://moj-analytical-services.github.io/splink.

In the following we follow the Splink tutorial, beginning with extracting relevant CSV fields from the input XML data and standardizing it.

## 1 Get data

### 1.1 Get KBR data
With the back-end catalog system Syracuse we can export MARCXML metadata about all Belgian paper periodicals that do not already have an ISSN with the search query: `TYPN=PERE AND ISSN="" AND ALIE="Belg*"`.

The records have many relevant fields that could lead to potential matches
* `035$a`: OCoLC identifier
* `041$a`: language of document
* `100$a`: author name (`100$*` for linked author authority)
* `245$a`: title
* `245$b`: remainder of title
* `245$c`: responsibility statement
* `264$a`: place of publication
* `264$b`: name of publisher
* `264$c`: date of production
* `490$a`: series statement
* `650$a`: subject index term (`$*` for linked subject index authority, Belgian Bibliography or FAST)
* `653$a`: Index Term-Uncontrolled ????
* `710$a`: Linked organization authorities (e.g. publishers and printers, separately indicated via MARC relator code in `$4`)
* `856$u`: URL

### 1.2 Get ISSN-plus data
As ISSN national center, we could create an export of Belgian periodicals via a web interface.

This data is rather limited, the following relevant fields exist:
* `080$a`: Universal Decimal Classification
* `210$a`: Short title
* `245$a`: title
* `260$a`: place of publication
* `856$u`: URL

## 2 Standardize data


### 2.1 Standardize ISSN data

The column place in the ISSN dataframe contains place names in different languages, e.g. `Anvers` (FR) or `Antwerpen` (NL) which both refer to the city of `Antwerp`(EN). We use a local GeoNames database and API to retrieve the uniform English name.

We ran the bash script `enrich-geonames.sh` that utilizes the script [geoname-enrichment](https://github.com/MetaBelgica/geoname-enrichment) (which makes use of an [internal API](https://github.com/kbrbe/geonames-lookup)).
From the `38,589` records that passed the physical periodicals filter, `37,462` contained a place name (`89%`). The script enriched `33,346` records in `27` minutes (`22` records per second). For `1,424` nothing was found and for `2,691` multiple API results were reported. 

The places were nothing was found for, often are recently merged municipalities such as
* Nazareth-De Pinte
* Pajottegem
* tongeren-Borgloon

or foreign places for which we did not have country information (for all records in the dump the country Belgium was indicated)
* Paris
* Montr√©al

This step had the aim to uniformize the spelling and not to enrich with a GeoNames identifier. This means that more than the `89%` of records have uniformized data. For example, even though _Pajottegem_ was not found in GeoNames, this place is likely spelled the same in the whole file as there is no French translation like for the city of Antwerp. However, in a next step we have to merge all enriched records with all that could not be enriched (because otherwise we only work on a subset).

In [None]:
# Merge enriched and not enriched

In [None]:
# We standardized based on the 1:n relationship files, we still have to merge the results to the main dataframe

### 2.2 Standardize KBR data