# BELTRANS Data Integration Analysis

This notebook contains an analysis of data from the BELTRANS project.

Before this analysis, each data source was loaded into different "namespaces" of our **Blazegraph** triple store. Data sources which are already in RDF were directly uploaded to the triple store, heterogeneous data were first mapped to RDF via [rml.io](https://rml.io).

We use `SPARQL` queries to investigate for the different data sources how they overlap. The objective is to identify optimal integration strategies with increase the size and interoperability of KBR authority data.

# Table of Contents

1. [Goals](#Goals)
2. [Approach](#Approach)
3. [Integration Steps](#Integration-steps)<br>
  3.1. [Enrich with identifiers](#Enrich-with-identifiers)<br>
  3.2. [Enrich with nationalities](#Enrich-with-nationalities)<br>
  3.3. [Enrich with new authorities](#Enrich-with-new-authorities)<br>

In [1]:
import sys
sys.path.insert(0, './utils/')

from SPARQLWrapper import SPARQLWrapper, TURTLE, JSON
import sparql_utils

# Establish connections to SPARQL endpoints
endpointKBR = SPARQLWrapper('http://wikibase-test-srv01.kbr.be/sparql/namespace/kbr-belgians/sparql')
endpointWikidataBelgians = SPARQLWrapper('http://wikibase-test-srv01.kbr.be/sparql/namespace/wikidata-belgians/sparql')
endpointBnf = SPARQLWrapper('http://wikibase-test-srv01.kbr.be/sparql/namespace/bnf-authors-persons/sparql')
endpointNTA = SPARQLWrapper('http://wikibase-test-srv01.kbr.be/sparql/namespace/kb-authors-nta/sparql')

## Goals
We specify the goals of data integration so we can measure and validate the outcome of our integration activities.

1. We need to get Belgian authors we do not have already in our KBR authors dataset 
  * *-> This increases the size of authority data at KBR*
2. We need to enrich KBR authors without nationality with any found nationality 
  * *-> This marks authors in our dataset as non-Belgians and for the future we can prioritize them lower*
3. We need to enrich KBR authors without nationality with Belgian nationality if they have it 
  * *-> This marks authors in our dataset as Belgians and for the future we can prioritize them*
4. We need to enrich KBR authors with missing ISNI and VIAF numbers
  * *-> This increaes interoperability (similar for the following two goals)*
5. We need to enrich KBR authors with identifiers from other libraries
6. We need to enrich KBR authors with common identifiers such as Wikidata or DBpedia


## Initial statistics
Before integration we measure the following values related to the goals defined above.
We measure the values based on two data sources, an export of Belgian authors from the KBR Syracuse system,
and a Wikidata export of Belgians optionally including other identifeirs, manually enriched with KBR identifiers.

In [2]:
# getInitialGoalStats()

# Todo: get the following data once we have access to a complete Syracuse dump
# number authors: 800,000
# number no nationality known authors
# number of records with ISNI number
# number of records with VIAF number
# number of records with link to BnF
# number of records with link to KB
# number of records with link to DNB
# number of records with link to Wikidata

# number Belgian authors
belgianKBRAuthors = sparql_utils.getNumberOfBelgiansLOCURI(endpointKBR)
belgianKBRAuthorsWithISNI = sparql_utils.getNumberOfBelgiansWithISNI(endpointKBR)
belgianKBRAuthorsWithVIAF = sparql_utils.getNumberOfBelgiansWithIdentifier(endpointKBR, 'VIAF')

belgianWikidata = sparql_utils.getNumberOfBelgiansLOCURI(endpointWikidataBelgians)
belgianWikidataWithKBR = sparql_utils.getNumberOfBelgiansWithIdentifier(endpointWikidataBelgians, 'KBR')
belgianWikidataWithBnF = sparql_utils.getNumberOfBelgiansWithIdentifier(endpointWikidataBelgians, "BnF")
belgianWikidataWithKB = sparql_utils.getNumberOfBelgiansWithIdentifier(endpointWikidataBelgians, "NTA")
belgianWikidataWithDNB = sparql_utils.getNumberOfBelgiansWithIdentifier(endpointWikidataBelgians, "DNB")

# Todo create a data frame to store every value also with percentages to nicely display it


f"{belgianWikidata} Wikidata Belgians, {belgianWikidataWithKBR} with KBR identifier, {belgianWikidataWithBnF} with BnF identifier {belgianWikidataWithKB} with KB identifier, {belgianWikidataWithDNB} with DNB identifier"

'51930 Wikidata Belgians, 2377 with KBR identifier, 6083 with BnF identifier 7583 with KB identifier, 5346 with DNB identifier'

In [3]:
f"The KBR Belgian export contains {belgianKBRAuthors} Belgians from which {belgianKBRAuthorsWithISNI} ({(belgianKBRAuthorsWithISNI*100)/belgianKBRAuthors}%) have an ISNI number and {belgianKBRAuthorsWithVIAF} ({(belgianKBRAuthorsWithVIAF*100)/belgianKBRAuthors}%) have a VIAF number"


'The KBR Belgian export contains 18009 Belgians from which 6509 (36.14303959131545%) have an ISNI number and 219 (1.2160586373479927%) have a VIAF number'


## Approach
We define how we want to achieve the goals using existing data sources.

Each data source is represented as RDF and can be queried using SPARQL. If a data source is not yet represented in RDF we lift it to RDF using the framework [rml.io](https://rml.io).

Generally, we have two phases. Firstly we identify missing information based on linking, and secondly we query information to enriche our KBR data. The second phase may involve manual tasks.

Please note that ISNI and VIAF can be used in two ways: either as identifiers within KBR data and other data sources which enable linking, or as dedicated data source because ISNI (issued by OCLC) and VIAF provide data dumps themselves.

### Complement KBR data with new Belgians
To achieve goal *1.* we can ask the following question per data source to identify relevant data:

* Which authors have Belgian nationality?

Therefore we identify Belgians in the source. However, some of these identififed Belgians may already be in our KBR data. Using identifiers such as ISNI or VIAF which are present in the KBR data we can narrow the initial search result down to a smaller dataset. This smaller dataset can then be further analyzed automatically regarding overlap to KBR data based on entity linking techniques, manual checks or a combination of both.


### Identify unknown nationalities in KBR data
To achieve goal *2.* and *3.* we can ask the following questions per data source to identify relevant data.

* Which KBR authors without nationality are actually Belgians?
* Which KBR authors with other than a Belgian nationality are *also* Belgians?
* Which KBR authors without nationality do have a known nationality other than Belgian?

KBR authors with unknown nationality can be supplemented with nationality information based on linking KBR authors to data sources based on library-related identifiers such as ISNI and VIAF or common identifiers such as Wikidata or DBpedia. 

### Enrich KBR data with authority-related identifiers
To achieve goal *4.* we can ask the following questions per data source to identify relevant data.

* Which (Belgian) KBR authors are present in a data source and have a VIAF or ISNI number?

Since obvously we only use KBR authors without such identifiers as starting point, we have to link to data sources via other identifiers such as library identifiers or Wikidata to find matches and extract possible VIAF and ISNI numbers.

### Enrich KBR data with identifiers from other libraries
This is similar to the previous section, but instead of using identifiers of other libraries to identify VIAF and ISNI numbers we want to find identifiers of other libraries. Therefore the following question can be asked per data source to achieve goal *5.* and identify relevant data.

* Which (Belgian) KBR authors are present in a data source and have other library identifiers?

### Enrich KBR data with common identifiers
There is already a wikidata dump manually enriched with KBR identifiers. For other sources such as DBpedia we can ask the following question per data source to identify relevant data.

* Which (Belgian) KBR authors are present in a data source and have a link to a common identifier?


## Integration steps
We apply the aforementioned approach to our data sources. We start with steps to enrich data with identifiers and enrich nationality information before executing steps to add new data because the former steps help to determine what is "new".

### Enrich with identifiers

### Enrich with nationalities


### Enrich with new authorities