# ISSN matching

This notebook contains a probabilistic matching between metadata of periodicals from the Royal Library of Belgium (KBR) and a dataset of Belgian periodicals from the ISSN center. The aim is to enrich KBR records with their corresponding ISSN number. The matching is performed via Splink https://moj-analytical-services.github.io/splink.

In the following we follow the Splink tutorial, beginning with extracting relevant CSV fields from the input XML data and standardizing it.

In [5]:
# setting to autoreload Python files if they have changed
%load_ext autoreload
%autoreload 2

import utils


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
from splink.exploratory import completeness_chart
from splink import DuckDBAPI
db_api = DuckDBAPI()

## 1 Get data

### 1.1 Get KBR data
With the back-end catalog system Syracuse we can export MARCXML metadata about all Belgian paper periodicals that do not already have an ISSN with the search query: `TYPN=PERE AND ISSN="" AND ALIE="Belg*"`.

The records have many relevant fields that could lead to potential matches
* `035$a`: OCoLC identifier
* `041$a`: language of document
* `100$a`: author name (`100$*` for linked author authority)
* `245$a`: title
* `245$b`: remainder of title
* `245$c`: responsibility statement
* `264$a`: place of publication
* `264$b`: name of publisher
* `264$c`: date of production
* `490$a`: series statement
* `650$a`: subject index term (`$*` for linked subject index authority, Belgian Bibliography or FAST)
* `653$a`: Index Term-Uncontrolled ????
* `710$a`: Linked organization authorities (e.g. publishers and printers, separately indicated via MARC relator code in `$4`)
* `856$u`: URL

### 1.2 Get ISSN-plus data
As ISSN national center, we could create an export of Belgian periodicals via a web interface.

This data is rather limited, the following relevant fields exist:
* `080$a`: Universal Decimal Classification
* `210$a`: Short title
* `245$a`: title
* `260$a`: place of publication
* `260$b`: name of publisher
* `856$u`: URL

## 2 Standardize data

### 2.1 Standardize Place names

The column place in the ISSN dataframe contains place names in different languages, e.g. `Anvers` (FR) or `Antwerpen` (NL) which both refer to the city of `Antwerp`(EN). We use a local GeoNames database and API to retrieve the uniform English name.

We ran the bash script `enrich-geonames-issn-plus.sh` that utilizes the script [geoname-enrichment](https://github.com/MetaBelgica/geoname-enrichment) (which makes use of an [internal API](https://github.com/kbrbe/geonames-lookup)).
From the `38,589` records that passed the physical periodicals filter, `37,462` contained a place name (`89%`). The script enriched `33,346` records in `27` minutes (`22` records per second). For `1,424` no GeoNames ID was found and for `2,691` multiple API results were reported. 

The places were nothing was found for, often are recently merged municipalities such as
* Nazareth-De Pinte
* Pajottegem
* tongeren-Borgloon

or foreign places for which we did not have country information (for all records in the dump the country Belgium was indicated)
* Paris
* Montréal

This step had the aim to uniformize the spelling and not to enrich with a GeoNames identifier. This means that more than the `89%` of records have uniformized data. For example, even though _Pajottegem_ was not found in GeoNames, this place is likely spelled the same in the whole file as there is no French translation like for the city of Antwerp. However, in a next step we have to merge all enriched records with all that could not be enriched (because otherwise we only work on a subset).

Similarly we enriched place names for KBR data. From `24,673` records with place name, `20,002` could be enriched in `16` minutes. for `3,497` no GeoNames ID was found and for `1,173` multiple API results were reported.

For KBR data, common mistakes include the same as above (foreign places or places with several occurences in Belgium). Additionally, common unclean data include are two languages in one field, e.g. `Bruxelles = Brussel`, or `Brussel = Bruxelles`, but also values like `= Bruxelles`.

In [20]:
commonColumns = ['title', 'alternateTitle', 'publisher', 'place', 'place-enriched', 'udc', 'url']

In [13]:
dfKBR, dfsKBRCols = utils.createInputDataframe('kbr-data', 'kbr', commonColumns)

In [12]:
# Merge enriched and not enriched
dfKBR

Unnamed: 0,autID,country,udc,title,alternateTitle,titleRemainder,place,publisher,url
0,15145431,['be'],,"[""Bulletin de la Société belge d'études géogra...",['Tijdschrift van de Belgische Vereniging voor...,,['Leuven'],['Universiteit van Leuven. Aardrijkskundig Ins...,['https://www.belgicaperiodicals.be/link/opac/...
1,18584980,['be'],,['Voorname aanwinsten'],,['dienstjaar 1969'],['Brussel'],['Koninklijke Bibliotheek van België'],
2,18585150,['be'],,['Voorname aanwinsten'],,['dienstjaar 1977'],['Brussel'],['Koninklijke Bibliotheek van België'],
3,16474587,['be'],,['Voorname aanwinsten'],['Voorname aanwinsten'],,['Brussel'],['Koninklijke Bibliotheek van België'],['http://www.kbr.be']
4,16474568,['be'],,['Acquisitions majeures'],"['Acquisitions majeures', 'Acquisitions import...",,['Bruxelles'],['Bibliothèque royale de Belgique'],
...,...,...,...,...,...,...,...,...,...
26817,18443022,['be'],,['Rusthuis Aalmoezenier Cuypers te Stabroek'],,['1908-1983'],['Antwerpen'],['Heemkundige Kring van de Antwerpse Polder'],
26818,22482029,['be'],,['Bulletins et comptes rendus de la Société Cl...,,,['Bruxelles'],"['Le Scalpel', 'Imprimerie Raymond Fischlin', ...",
26819,15288181,['be'],,['Bulletin de la Société scientifique et litté...,,,['Tongres'],['imprimerie Collée'],
26820,16796898,['be'],,['Le coq rouge'],,['revue littéraire'],['Bruxelles'],['Imprimerie Xavier Havermans'],


In [22]:
dfISSN, dfsISSNCols = utils.createInputDataframe('issn-plus-data', 'issn-plus', commonColumns + ['keyTitle'])

In [23]:
dfISSN

Unnamed: 0,autID,country,udc,title,alternateTitle,shortTitle,keyTitle,place,publisher,url
0,3041-5543,['be'],['53'],['Annales générales des sciences physiques.'],,['Ann. gen. sci. phys.'],['Annales générales des sciences physiques'],['Bruxelles'],['Weissenbruck'],
1,3041-5608,['be'],"['785.6', '7 (059.3)']",['Curious.'],,['Curious'],['Curious'],['Brugge'],['Concertgebouw Brugge'],
2,3041-5659,['be'],"['631', '636']",['Journal grandes cultures.'],,['J. gd. cult.'],['Journal grandes cultures'],['Antwerpen'],['Prosu Media Producties'],
3,3041-5667,['be'],['005.92'],['Miscellanea archivistica -Studia.'],,['Misc. arch. Stud.'],['Miscellanea archivistica - Studia'],['Bruxelles'],"[""Archives générales du Royaume et Archives de...",
4,3041-5675,['be'],['005.92'],['Miscellanea archivistica - Manuale.'],,['Misc. arch. man.'],['Miscellanea archivistica - Manuale'],['Bruxelles'],"[""Archives générales du Royaume et Archives de...",
...,...,...,...,...,...,...,...,...,...,...
36560,2795-9066,['be'],['82‑1'],['\x98Le \x9cTaureau.'],,,['\x98Le \x9cTaureau'],['Bruxelles'],['Le Taureau'],
36561,0779-3235,['be'],['070.445'],['\x98Le \x9cJournal des enfants.'],,,['\x98Le \x9cJournal des enfants'],['Namur'],['Le journal des enfants'],
36562,3041-7929,['be'],['070.445'],['Pour nos enfants.'],,,['Pour nos enfants'],['Bruxelles'],['Cinéac'],
36563,3041-7619,['be'],"['272/273', '614.253.4']",['Lenteweelde.'],['J.V.K.A. Orgaan voor katholieke studentenakt...,,['Lenteweelde. J.V.K.A. Orgaan voor katholieke...,['Averbode'],['Averbode'],


In [None]:
# We standardized based on the 1:n relationship files, we still have to merge the results to the main dataframe

### 2.2 Standardize KBR data

In [None]:
# Break down classifications such as '654.165 (05) (493.2 B.)'