### Spacy DbPedia Spotlight for Dutch

This is an older, but widely-used, entity linker that can be considered a baseline for Dutch entity linking. 

See https://github.com/MartinoMensio/spacy-dbpedia-spotlight for integration with Spacy. 

1. Note however that this is an entity linker that produces dbpedia links. So one challenge is to map these to wikidata links. 
2. Next is to do evaluation on standard benchmarks. 


In [1]:
import spacy

import spacy_dbpedia_spotlight

nlp = spacy.load('nl_core_news_lg')

nlp.add_pipe('dbpedia_spotlight')


  from .autonotebook import tqdm as notebook_tqdm


<spacy_dbpedia_spotlight.entity_linker.EntityLinker at 0x7f2456de1ff0>

## Testing

Note that the api uses Dutch dbpedia as resource and returns links to nl.dbpedia.org


In [2]:
# doc = nlp('Een minderjarig meisje uit Belgie is opgepakt voor het dreigen met schietpartijen op scholen in Breda.')
# doc = nlp('Ook de keersluizen, zoals die bij Goingarijp, Broek, Joure en Lange Sleat, zijn sinds gisteren gesloten.')
#doc = nlp('Het KNMI heeft daarom voor het noorden code geel afgegeven vanwege de kans op wateroverlast en zware windstoten.')
#doc = nlp('Minister Barry Madlener (VVD) van Infrastructuur en Waterstaat moet ProRail dwingen de problemen met de stationslift in Winsum op te lossen. Dat ziet de raadsfractie van Lokaal Sociaal als het laatste redmiddel voor de jarenlange storingen en overlast rondom de lift.')
doc = nlp('In Georgië is er onrust in de deelrepublieken Abchazië en Zuid-Ossetië.') 

#for w in doc :
#    print(w.text)
    
print([(ent.text, ent.start, ent.end, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore'], ent._.dbpedia_raw_result['@types']) for ent in doc.ents])


[('Georgië', 1, 2, 'http://nl.dbpedia.org/resource/Georgische_Socialistische_Sovjetrepubliek', '0.9997244742901871', 'Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country'), ('Abchazië', 8, 9, 'http://nl.dbpedia.org/resource/Abchazische_Socialistische_Sovjetrepubliek', '0.9755126219272254', 'Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country'), ('Zuid-Ossetië', 10, 11, 'http://nl.dbpedia.org/resource/Zuid-Ossetië', '0.9988970501701991', 'Wikidata:Q3455524,Schema:Place,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Region')]


## Locating the wikidata id

Unfortunately, the nl.dbpedia link produces an error and it is unclear how to access the nl.dbpedia data as page or through the dbpedia sparql endpoint.

One alternative is to use the (generic/english) dbpedia sparql endpoint and owl:sameAs. This still is noisy, as e.g. 'Belgie' is sameAs _Belgica railway station_ and also _Belgium Wisconsin_. Checking strict identiy of the match could help (or string edit distance). or the raw result type values (as these include wikidata and schema ) or checking the nl label on wikidata

Experimenting with label checking below, we need to query both wikipedia and dbpedia, find Dutch label(s), and prefer matching cases

In [3]:
import requests
wikidata_sparql = 'https://query.wikidata.org'

# Double {{}} used in query string to prevent interference with format statement 
def sameAs(nlDbpediaId) :
    encoded = nlDbpediaId.replace('(','%28').replace(')','%29')
    query='''
PREFIX dbnl: <http://nl.dbpedia.org/resource/>

SELECT DISTINCT ?wikidata_id

WHERE {{  ?dbpedia owl:sameAs dbnl:{0} .
          ?dbpedia owl:sameAs ?wikidata_id .
          FILTER( strstarts(str(?wikidata_id), "http://www.wikidata.org")) 
}}'''.format(encoded)
    queryResult = requests.get('https://dbpedia.org/sparql', params={'query': query, 'format': 'json'}).json()
    ids = [] 
    for result in queryResult['results']['bindings'] :
        ids.append(result['wikidata_id']['value'].split('/')[-1])
    return(ids)

def wikidataLabel(id) :
    query = '''
SELECT ?dutchLabel 
WHERE {{ wd:{0} rdfs:label ?dutchLabel .
       FILTER( LANG(?dutchLabel)="nl" )
              }}'''.format(id)
    #print(query)
    queryResult = requests.get('https://query.wikidata.org/sparql', params={'query':query, 'format': 'json'}).json()
    #print(queryResult)
    labels = []
    for result in queryResult['results']['bindings'] :
        labels.append(result['dutchLabel']['value'])
    return labels 
        
def type_check(arg) : # to be completed 
    wikidata_type = ent._.dbpedia_raw_result['@types'].split(',')[0] # but sometimes there are many wikidata types 
    wikidata_type_key = wikidata_type.split(':')[1]

def find_wikidata_id(ent) :
    key = ent.kb_id_.split('/')[-1]
    if key :
        dbpedia2wikidata = sameAs(key)
        keystring = key.replace('_',' ')
        print(keystring)
        lc_keystring = keystring[0].lower() + keystring[1:] 
        match = 'Wikidata_NotFound'
        for wikiId in dbpedia2wikidata :
            #print(wikiId)
            labels = wikidataLabel(wikiId)
            if keystring in labels : # Keersluis /keersluis (case sensitive) spaces vs underscores, other spelling issues?
                match = wikiId      
            elif lc_keystring in labels :
                match = wikiId
            elif match == 'Wikidata_NotFound' :
                match = wikiId
        return match 
    else :
        return 'Dbpedia_NotFound'
    

In [4]:
for ent in doc.ents :
    find_wikidata_id(ent)

Georgische Socialistische Sovjetrepubliek
Abchazische Socialistische Sovjetrepubliek
Zuid-Ossetië


## Evaluation on WiNNL data

Note that the input has been tokenized, and not always appropriately, ie Zuid-Ossetie is split in 3 tokens.

Solution: process input string by spacy/wikidata as usual. Iterate over tokens and if they match a prefix of an entity string, start the match with upcoming tokens. note that if the entity contains a space  ('Barry Mathlener') space needs to be skipped. 

In [3]:
import pandas

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [4]:
winnl = pandas.read_json('WiNNL/dutch_winnl_data.json')

In [12]:
winnl

Unnamed: 0,original,tokens,labels,qid,language,url
4123,Dit resulteerde in meer dan 200 bewerkingen op...,"[Dit, resulteerde, in, meer, dan, 200, bewerki...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/2023-11_Nieuwsbri...
4124,We hebben in een panel van experts van digital...,"[We, hebben, in, een, panel, van, experts, van...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/Wikimedia_Foundat...
4125,Weyts roept de slachthuissector op de 'rotte a...,"[Weyts, roept, de, slachthuissector, op, de, '...","[B-PER, O, O, O, O, O, O, O, O, O, O, O, O, O]","[B-Q2271035, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/Slachthuis_in_Ize...
4126,Het is al de derde keer dit jaar dat Animal Ri...,"[Het, is, al, de, derde, keer, dit, jaar, dat,...","[O, O, O, O, O, O, O, O, O, B-ORG, I-ORG, O, O...","[O, O, O, O, O, O, O, O, O, B-Q41444901, I-Q41...",nl,https://nl.wikinews.org/wiki/Slachthuis_in_Ize...
4127,Vandaag hield Animal Rights en Bite Back nog e...,"[Vandaag, hield, Animal, Rights, en, Bite, Bac...","[O, O, B-ORG, I-ORG, O, B-ORG, I-ORG, O, O, O,...","[O, O, B-Q41444901, I-Q41444901, O, B-Q2158339...",nl,https://nl.wikinews.org/wiki/Slachthuis_in_Ize...
...,...,...,...,...,...,...
5617,Zeman won in de tweede ronde nipt van de 68-ja...,"[Zeman, won, in, de, tweede, ronde, nipt, van,...","[B-PER, O, O, O, O, O, O, O, O, O, O, O, B-PER...","[B-Q29032, O, O, O, O, O, O, O, O, O, O, O, B-...",nl,https://nl.wikinews.org/wiki/Milo%C5%A1_Zeman_...
5618,Hiermee begint hij aan zijn tweede vijfjaarste...,"[Hiermee, begint, hij, aan, zijn, tweede, vijf...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/Milo%C5%A1_Zeman_...
5619,Het officieel vastgestelde aantal doden wereld...,"[Het, officieel, vastgestelde, aantal, doden, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/Wereldwijd_nu_mee...
5620,De onrust bereikte een hoogtepunt toen de demo...,"[De, onrust, bereikte, een, hoogtepunt, toen, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",nl,https://nl.wikinews.org/wiki/Demonstraties_in_...


## Experiments

__Use only up to first 500 items or so (4123 - 4622) to ensure that we do not evaluate on dev set__ 

Sometimes an entity has not dbpedia id, sometimes the mapping to wikidata fails. Sometimes a NE is not recognized as such (_Lord Ismay, Rudi van Straten, Universiteit van Luxemburg_). Is spotlight or spacy to blame for the NE misses? 

Note that spotlight annotates any term that has a dbpedia entry, whereas winnl only seems to annotate entities. How to zoom in on entities? Use the NEC class (attrib in ents?), use the types attribute (if present) but might require quite a bit of manual tuning. 

On the other hand, dbpedia misses genitives (_Reagans_). 

Evaluation: precision and recall of the 2 lists of QIDs? Seems accurate enough. Report Macro P/R/F1.

Corpus: these are individual sentences, which makes task hard (ie no context, _Zeman, Bush, Weyts_ -- these might be in dbpedia if referred to by full name which is likely the case in the full article.) Sometimes multiple sentences from same article, but still not guaranteed to be complete (eg sentences without NE are skipped)

_Boris Berezovski (zakenman)_ is in Wikidata as _Boris Berezovski_ with the (_zakenman_) in the Description. So strip the parts in () before checking for same as?

Sometimes wikidata page with exact same name as the entity (_ClubFM_) exists, but still sameAs does not produce a match. Also try brute force label search, or even wikidata entity search api as fall-back options? 


In [41]:
def winnl_annotations(row) :
    gold = set()
    for token in row['qid'] :
        if token.startswith('B-') :
            qid = token.replace('B-','')
            gold.add(qid)
    return gold 

def evaluate_winnl_items(First,Last) :
    score = {'overlap':0, 'system' : 0, 'gold' : 0}
    for Id in list(range(First,Last)) :
        row = winnl.loc[Id]
        gold = winnl_annotations(row)
        if gold and 'Q404' not in gold : 
            ## old: ['EVENT','GPE','LOC','ORG','PERSON','DATE','TIME']
            system = annotate_text(row['original'], ['LOC','GPE','ORG','PERSON','EVENT'] )
            print(system,gold)
            update_score(system,gold,score)
    print(system & gold)
    print(system.difference(system & gold))
    print(gold.difference(system & gold))
    print_score(score)

def evaluate_winnl_item(Id) :
    item = winnl.loc[Id,['original','qid']]
    print(Id, item['original'])
    parse = nlp(item['original'])
    print(parse.ents)
    #if parse.ents :
    #    print([(ent.text, ent.start_char, ent.end_char, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore'], ent._.dbpedia_raw_result['@types']) for ent in parse.ents])
    system = set() # do we ever have multiple identical entities, if so, set functions should not be used?
    for ent in parse.ents :  # one case is not-found errors, should they be unique? notfound+string?
        print(ent.text,ent.kb_id_)
        id = find_wikidata_id(ent)
        print(id)
        system.add(id)
    gold = winnl_annotations(row)
    print(system,gold)
    Overlap = len(system & gold)
    if system :
        Precision = Overlap / len (system)
    else : 
        Precision = 0
    if gold:
        Recall = Overlap / len(gold)
    else :
        Recall = 0
    return((Precision,Recall))



In [44]:
evaluate_winnl_item(4130)
    

4130 Op de Krim brak in 2014 een crisis uit.
(Krim,)
Krim http://nl.dbpedia.org/resource/Krim
Krim
Q7835
{'Q7835'} {'Q7835'}


(1.0, 1.0)

In [39]:
# time out issue? 4130 works in isolation, not if we start at 4124
def evaluate_winnl_items(First,Last) :
    for Id in list(range(First,Last)) :
        (P,R) = evaluate_winnl_item(Id)
        print(P,R)


In [40]:
evaluate_winnl_items(4124,4134)

We hebben in een panel van experts van digitale rechten (en) bij elkaar gebracht tijdens de SXSW in Austin om dit en gerelateerde kwesties te bespreken.
(SXSW, Austin)
SXSW http://nl.dbpedia.org/resource/South_by_Southwest
South by Southwest
Q959755
Austin http://nl.dbpedia.org/resource/Austin_Motor_Company
Austin Motor Company
Q781156
{'Q959755', 'Q781156'} {'Q959755', 'Q404'}
0.5 0.5
Weyts roept de slachthuissector op de 'rotte appels' eruit te pikken:
(Weyts, rotte)
Weyts 
Dbpedia_NotFound
rotte http://nl.dbpedia.org/resource/Rotte_(rivier)
Rotte (rivier)
Wikidata_NotFound
{'Dbpedia_NotFound', 'Wikidata_NotFound'} {'Q2271035'}
0.0 0.0
Het is al de derde keer dit jaar dat Animal Rights beelden uitbrengt van dierenmishandeling in slachthuizen.
(Animal Rights, dierenmishandeling)
Animal Rights http://nl.dbpedia.org/resource/Animal_Rights
Animal Rights
Wikidata_NotFound
dierenmishandeling http://nl.dbpedia.org/resource/Dierenmishandeling
Dierenmishandeling
Q40053
{'Wikidata_NotFound', '

JSONDecodeError: Expecting value: line 1 column 1 (char 0)