# JudaicaLink Interlinking

General approach:
- Interlinking 2 datasets
    - pick 2 datasets
    - create sameas links based on various strategies
    - save links as own linksets
- Interlinking all datasets
    - run above step for all combinations of 2 datasets
- Create link closure
    - Check sameas (B) for all resources (A):
        - if B is linked to another resource C that is not directly linked to A, create new link A -> C, C -> A
    - Repeat until no further links are found
    


## RDF Helpers


In [48]:
import rdflib
import SPARQLWrapper as sw

prefixes = []
prefixes.append(('skos', 'http://www.w3.org/2004/02/skos/core#'))

sparql = sw.SPARQLWrapper2("http://data.judaicalink.org/sparql/query")

def get_prefixes():
    return "\n".join(["PREFIX {}: <{}>".format(prefix, url) for prefix, url in prefixes])

def sparql_query(q):
    q = get_prefixes() + "\n\n" + q
    sparql.setQuery(q)
    return sparql.query()

def get_named_graphs():
    result = sparql_query('SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o } }')
    return [ b['g'].value for b in result.bindings]


## Linking Helpers

In [41]:

def get_all_resources(dataset):
    query = "SELECT DISTINCT ?s WHERE {{ GRAPH <{0}> {{?s ?p ?o}} }}".format(dataset)
    result = sparql_query(query)
    return [ b['s'].value for b in result.bindings]

def get_labels(uri):
    query = """
    SELECT DISTINCT ?l WHERE {{
    
        {{ <{}> skos:prefLabel ?l  }}
        UNION
        {{ <{}> skos:altLabel ?l  }}
    
    }}
    """.format(uri, uri)
    result = sparql_query(query)
    return [ b['l'].value for b in result.bindings]

def get_resource_by_label(ds, labels):
    query = """
        select DISTINCT ?s WHERE {{
            GRAPH <{}> {{
                {{
                
                {}
                
                }}
            
            }}
        }} 
    """.format(ds, "\n} UNION {\n".join(['{{ ?s skos:prefLabel "{}" }} UNION {{ ?s skos:altLabel "{}" }}'.format(l, l) for l in labels]))
    # print(query)
    result = sparql_query(query)
    return [ b['s'].value for b in result.bindings]






## Currently loaded datasets

In [49]:
print ("\n".join(get_named_graphs()))

http://data.judaicalink.org/data/yivo
http://data.judaicalink.org/data/2014links
http://data.judaicalink.org/data/interlinks
http://data.judaicalink.org/data/djh
http://data.judaicalink.org/data/rujen
http://data.judaicalink.org/data/dbpedia-persons
http://data.judaicalink.org/data/gnd-persons
http://data.judaicalink.org/data/bhr
http://data.judaicalink.org/data/enjudaica
http://data.judaicalink.org/data/geo-coor
http://data.judaicalink.org/data/hirsch
http://data.judaicalink.org/data/stolpersteine
http://data.judaicalink.org/data/nli
http://data.judaicalink.org/data/ubffm


## Pick two datasets

In [50]:
ds1 = 'http://data.judaicalink.org/data/yivo'
ds2 = 'http://data.judaicalink.org/data/dbpedia-persons'
ds3 = 'http://data.judaicalink.org/data/gnd-persons'

## Strategies

### 1: String Match

In [5]:
yivo_resources = get_all_resources(ds1)
len(yivo_resources)

2374

In [6]:
labels = get_labels('http://data.judaicalink.org/data/yivo/Abeles_Shimon')

In [7]:
get_resource_by_label(ds2, labels)

[]

In [8]:
labels

['Abeles, Shim‘on', 'Shim‘on Abeles']

In [9]:
labels

['Abeles, Shim‘on', 'Shim‘on Abeles']

In [10]:
labels[0]


'Abeles, Shim‘on'

In [12]:
shimon = yivo_resources[0]

In [14]:
labels = get_labels(shimon)

In [17]:
get_resource_by_label(ds2, labels)

[]

In [28]:
testlabels = ['ʾLPNDʾRY, ʾHRN BN MŠH', 'another name']

In [31]:
get_resource_by_label(ds2, testlabels)

['http://data.judaicalink.org/data/dbpedia/Aaron_Alfandari']

In [42]:
linked_resources = []
count = 0
for res in yivo_resources:
    count += 1
    print('.', end='')
    if count % 100 == 0:
        print(' {}'.format(count))
    labels = get_labels(res)
    try:
        result = get_resource_by_label(ds2, labels)
        if len(result) > 0:
            linked_resources.extend(result)
    except Exception as e:
        print('Error on {} with these labels: {}'.format(res, labels))


.................................................................................................... 100
.................................................................................................... 200
.................................................................................................... 300
.................................................................................................... 400
.................................................................................................... 500
.................................................................................................... 600
.................................................................................................... 700
.................................................................................................... 800
.................................................................................................... 900
.......................................................

In [45]:
len(linked_resources)

96

In [53]:
get_resource_by_label(ds3,['Abeles, Shim‘on'])

[]