# BELTRANS Data Integration Analysis

This notebook contains an analysis of data from the BELTRANS project.

Before this analysis, each data source was loaded into different "namespaces" of our **Blazegraph** triple store. Data sources which are already in RDF were directly uploaded to the triple store, heterogeneous data were first mapped via rml.io.


For each data source we investigate how complete they are with respect to different properties. 
The different analysis steps are performed based on `SPARQL` queries. 

In [43]:
from SPARQLWrapper import SPARQLWrapper, TURTLE, JSON
import os

bnfDir = "/home/slieber/repos/beltrans-data/data-sources/bnf/"
kbDir = "/home/slieber/repos/beltrans-data/data-sources/kb/"

## Functions
The following blocks define helper functions which are used throughout this analysis. We follow the *don't repeat yourself* (DRY) principle: functionality is broken down into small reusable functions. Therefore the analysis also becomes more readable as the reading flow is not interrupted by large chunks of code, only functions with self explanatory name are called.

In [69]:
# ------------------------------------------------------------
def readSPARQLQuery(filename):
    """Read a SPARQL query from file and return the content as a string."""
    content = ""
    with open(filename, 'r') as reader:
        content = reader.read()
    return content

# ------------------------------------------------------------
def _convertValueCount(result):
    """Converts SPARQL result bindings of a '?value ?count' query to a common format."""
    values = list()
    for r in result['results']['bindings']:
        values.append([r['count']['value'], r['value']['value']])
    return values

# ------------------------------------------------------------
def getPropertyOverview(sparqlObject):
    """Query the given SPARQL endpoint (SPARQLWrapper object) for all properties and their number."""
    sparqlObject.setQuery(readSPARQLQuery('/home/slieber/repos/beltrans-data/data-sources/get-property-overview.sparql'))
    sparqlObject.setReturnFormat(JSON)
    result = sparqlObject.queryAndConvert()
    return _convertValueCount(result)

# ------------------------------------------------------------
def getClassOverview(sparqlObject):
    """Query the given SPARQL endpoint (SPARQLWrapper object) for all classes and number of their instances."""
    sparqlObject.setQuery(readSPARQLQuery('/home/slieber/repos/beltrans-data/data-sources/get-class-overview.sparql'))
    sparqlObject.setReturnFormat(JSON)
    result = sparqlObject.queryAndConvert()
    return _convertValueCount(result)
    

## Koninlijke Bibliotheek van Nederland (KB)

This data source contains information about (i) authors, and (ii) publications.

Regarding **authors**, three different datasets are provided, the thesaurus of Dutch authors (NTA), the digital library of Dutch letters (DBNLA), and organizations from the corporation thesaurus.

Regarding **publications** ...

### Properties from the KB author thesaurus
The following is a list of properties and how often they occur. This has the aim to get a broad overview of the data source.

In [7]:
kbAuthorsNTA = SPARQLWrapper('http://localhost:8090/bigdata/namespace/kb-authors-nta')

In [75]:
getClassOverview(kbAuthorsNTA)

[['2751633', 'http://schema.org/Dataset'],
 ['2751633', 'http://schema.org/Person'],
 ['2751633', 'http://schema.org/WebPage'],
 ['41', 'http://www.w3.org/2000/01/rdf-schema#Resource'],
 ['22', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property'],
 ['18', 'http://www.w3.org/2000/01/rdf-schema#Class'],
 ['1', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#List'],
 ['1', 'http://www.w3.org/2000/01/rdf-schema#Datatype']]

In [70]:
getPropertyOverview(kbAuthorsNTA)

[['8254982', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'],
 ['4446148', 'http://schema.org/sameAs'],
 ['2826041', 'http://www.w3.org/2000/01/rdf-schema#label'],
 ['2793771', 'http://schema.org/name'],
 ['2793768', 'http://schema.org/familyName'],
 ['2751633', 'http://data.bibliotheken.nl/def#ppn'],
 ['2751633', 'http://schema.org/dateModified'],
 ['2751633', 'http://schema.org/isBasedOn'],
 ['2751633', 'http://schema.org/isPartOf'],
 ['2751633', 'http://schema.org/license'],
 ['2751633', 'http://schema.org/mainEntity'],
 ['2751633', 'http://schema.org/mainEntityOfPage'],
 ['2751633', 'http://www.w3.org/2002/07/owl#sameAs'],
 ['2734833', 'http://schema.org/givenName'],
 ['942380', 'http://schema.org/alternateName'],
 ['751540', 'http://schema.org/description'],
 ['732968', 'http://schema.org/birthDate'],
 ['234985', 'http://schema.org/deathDate'],
 ['115470', 'http://schema.org/nationality'],
 ['47', 'http://www.w3.org/2000/01/rdf-schema#subClassOf'],
 ['25', 'http://www.w3.org/20

### Properties from the KB DBNLA dataset
The following lists the properties from the **digital bibliotheek voor de Nederlandse letteren (DBNL)**.

In [None]:
kbAuthorsDBNLA = SPARQLWrapper('http://localhost:8090/bigdata/namespace/kb-authors-dbnla')

In [74]:
getClassOverview(kbAuthorsDBNLA)

[['109739', 'http://schema.org/Dataset'],
 ['109739', 'http://schema.org/WebPage'],
 ['109739', 'http://schema.org/Person'],
 ['41', 'http://www.w3.org/2000/01/rdf-schema#Resource'],
 ['22', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property'],
 ['18', 'http://www.w3.org/2000/01/rdf-schema#Class'],
 ['1', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#List'],
 ['1', 'http://www.w3.org/2000/01/rdf-schema#Datatype']]

In [71]:
getPropertyOverview(kbAuthorsDBNLA)

[['329300', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'],
 ['124285', 'http://schema.org/alternateName'],
 ['109739', 'http://schema.org/license'],
 ['109739', 'http://schema.org/mainEntity'],
 ['109739', 'http://schema.org/mainEntityOfPage'],
 ['109739', 'http://schema.org/familyName'],
 ['109739', 'http://schema.org/name'],
 ['109739', 'http://schema.org/isPartOf'],
 ['109739', 'http://schema.org/dateModified'],
 ['109739', 'http://schema.org/identifier'],
 ['109739', 'http://schema.org/gender'],
 ['109739', 'http://schema.org/url'],
 ['109739', 'http://www.w3.org/2000/01/rdf-schema#label'],
 ['109739', 'http://www.w3.org/2002/07/owl#sameAs'],
 ['107021', 'http://schema.org/givenName'],
 ['102230', 'http://schema.org/birthDate'],
 ['59213', 'http://schema.org/hasOccupation'],
 ['53343', 'http://schema.org/deathDate'],
 ['32029', 'http://schema.org/birthPlace'],
 ['23998', 'http://schema.org/deathPlace'],
 ['47', 'http://www.w3.org/2000/01/rdf-schema#subClassOf'],
 ['25', 'http:

## Bibliothèque Nationale de France

In [65]:
bnfAuthorsPersons = SPARQLWrapper('http://localhost:8090/bigdata/namespace/kb')

In [73]:
getClassOverview(bnfAuthorsPersons)

[['1965432', 'http://xmlns.com/foaf/0.1/Person'],
 ['1965432', 'http://www.w3.org/2004/02/skos/core#Concept'],
 ['4654', 'http://xmlns.com/foaf/0.1/Document'],
 ['4359', 'http://data.bnf.fr/ontology/bnf-onto/ExpositionVirtuelle']]

In [72]:
getPropertyOverview(bnfAuthorsPersons)

[['4783927', 'http://www.w3.org/2002/07/owl#sameAs'],
 ['3939877', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'],
 ['2162921', 'http://www.w3.org/2000/01/rdf-schema#seeAlso'],
 ['1965432', 'http://data.bnf.fr/ontology/bnf-onto/FRBNF'],
 ['1965432', 'http://purl.org/dc/terms/created'],
 ['1965432', 'http://rdaregistry.info/Elements/u/P61160'],
 ['1965432', 'http://xmlns.com/foaf/0.1/focus'],
 ['1965432', 'http://xmlns.com/foaf/0.1/page'],
 ['1965432', 'http://www.w3.org/2004/02/skos/core#prefLabel'],
 ['1965432', 'http://purl.org/dc/terms/modified'],
 ['1965431', 'http://xmlns.com/foaf/0.1/familyName'],
 ['1891029', 'http://xmlns.com/foaf/0.1/givenName'],
 ['1891029', 'http://xmlns.com/foaf/0.1/name'],
 ['1668932', 'http://purl.org/dc/terms/creator'],
 ['1663837', 'http://rdaregistry.info/Elements/a/P50097'],
 ['1663837', 'http://rdvocab.info/ElementsGr2/countryAssociatedWithThePerson'],
 ['1659347', 'http://rdaregistry.info/Elements/a/P50113'],
 ['1659347', 'http://rdvocab.info/El

## Integrated data quality
