# Integration Analysis

This notebook contains statistics about the integration for the BELTRANS project

# Table of Contents

1. [Goals](#Goals)
2. [Approach](#Approach)
3. [Integration Steps](#Integration-steps)<br>
  3.1. [Enrich with identifiers](#Enrich-with-identifiers)<br>
  3.2. [Enrich with nationalities](#Enrich-with-nationalities)<br>
  3.3. [Enrich with new authorities](#Enrich-with-new-authorities)<br>

In [1]:
import sys
import pandas as pd
import matplotlib
sys.path.insert(0, '../utils/')

from SPARQLWrapper import SPARQLWrapper, TURTLE, JSON
import sparql_utils

# Establish a connection to our database
sparqlEndpoint = SPARQLWrapper('http://wikibase-test-srv01.kbr.be/sparql/namespace/integration/sparql')

# Integration steps

On the one hand we can start from already identified translations, measure their quality and the quality of related contributors.

With respect to the integration it is important to measure the following

## Current number of translations - KBR

The following statistics provide an intitial overview:

* how many translations do have an ISBN identifier?
* how many translations do have a source specified?
* how many translations specify an author?
* how many translations specify a translator?
* how many translations specify an illustrator?
* how many translations specify a scenarist?
* how many translations do not specify a role for contributors?
  
Missing attributes and contributors can be identified by identifying these translations in other data sources.

In [3]:
sparql_utils.getPublicationStatsOverview(sparqlEndpoint)

Unnamed: 0,%,found,total,missing
Translations with ISBN identifiers,90,11830,13146,1316
Translations with specified author,65,8566,13146,4580
Translations with specified translator,62,8139,13146,5007
Translations with specified publisher,90,11855,13146,1291
Specified illustrator,18,2362,13146,10784
Specified scenarist,5,617,13146,12529


## Current number of related contributors

The following statistics provide an initial overview of linked authorities:

* how many contributors have an ISNI identifier?
* how many contributors have a VIAF identifier?
* how many authors have a Belgian nationality?
* how many authors have no nationality?
* how many illustrators have a Belgian nationality?
* how many scenarists have a Belgian nationality?

On the one hand missing attributes can be identified by lookup up the contributors in the authorative datasets of the respective identifiers.
And on the other hand, other third-party identifier can be obtained for which also missing attributes can be identified.

## New translations via contributor lookup or catalog searches

Based on our initial KBR dataset we already have contributors, furthermore, we have third-party identifier of those contributors. This enables us to look those contributors up in other data sources to identify translations. We may find translations we do not yet have or translations we have but without knowing it because for example we do not have an ISBN identifier. Similarly we may find translations using the search function of other catalogs.

For translations found in other catalogs/datasets via search by known contributors or direct search for translations:

* how many translations do we already have?
* how many translations are similar to translation without ISBN identifier we already have? (e.g. title and publication year)
