### Evaluation of Named Entity Linking for Dutch


| System  | Damuel | Mewsli-X | WiNNL | MultiNERD|
| ----- | ------ | ------ | ------ | ------- | 
| Spacy+Spotlight  |   | | 0.397/0.453/0.423
| Spacy+wikidata entity finder   |0.391/0.491/0.435 | | 0.589/0.495/0.538 |  0.517/0.602/0.556
| BabelFy | | 
| BELA | |
| mGenre | |
| mRefined | | 

 - Scores are for Precision/Recall/F-score
 - MultiNERD: over first 1000 lines (52 sentences), with filtering on NE types
 - WiNNL: over 100 items, 4124-4224, some 404 issues in annotation (those sentences are ignored), also does not annotate dates (it seems), also: evaluating single sentences underperforms as previous sent often contains full name 
 - damuel : over first 100 articles (not all with text), no NE types, so filtering for named entities only is very hard. Also: on full texts with lots of repeated names, evaluation on unique qids (ie using sets) might give misleading scores.
 - damuel with NEC filtering: precision:0.308, recall:0.463, fscore:0.370, with NORP class (noisy): precision:0.273, recall:0.474, fscore:0.346 This is even lower than the old approximate approach evaluating only on strings that contain a PROPN pos-tag. Could it be improved?
 - damuel with lang=nl, with NORP, with filtering for family name:  precision:0.299, recall:0.516, fscore:0.378 it seems switching to nl language has big effect (was mistake in code to set it to en), words like 'Niettegenstaande' now also labeled, many NORP cases, many no_qid_found (exclude from evaluation?) without NORP: precision:0.336, recall:0.505, fscore:0.403, ignore no_qid_found: precision:0.358, recall:0.505, fscore:0.419

GERBIL as evaluation platform? (but seems complicated), inKB recall, accuracy (for experiments where entities are given)





## Evaluation Datasets 

### WiNNL

1500 sentences of Wikinews, 

_WiNNL’s annotation scheme prioritises three core categories of entities: PER, ORG and LOC._ While PER,LOC,ORG are the most frequent, there are also 79 entities from other classes (see stats below). For now, it seems best to include all annotated entities in the evaluation, while for an automatic system, we might want to keep only PER, LOC, and ORG entities for evaluation (thus potentially increasing precision). 

| count | tag |
| ------------- | ------------- |
| 2150 | total (B-Q) |
|   745 | B-PER |
|    678 | B-ORG |
| 648 | B-LOC |
|     37| B-OTH|
|    13 | B-EVT |
|     10 | B-DATE |
|      8 | B-AMB |
 |     7 | B-SPE |
  |    4 | B-DISEASE |
  

_The system scans through all n-grams of the article text and creates offset-based annotations for each combination of n-grams that
matches one of the recognised aliases._ Note that this should probably be taken into account as well. I.e. if the data contains both _Bart De Pauw_ and _De Pauw_, ensure that _De Pauw_ is linked to same QID as full name. (If not, it might be linked to another person or to the family name). 

The  article url can be used to ensure that we evaluate on full article texts in this way. (Note that escape \ needs to be removed from string, encoding of diacritics and other special chars is apparently no problem). 

### issues

Note that the most frequent QID is Q404 which is the wikidata page describing the http error for non existing pages. This is clearly a mistake in the annotation process. Either we manually correct these cases (see mail from developers) or ignore these in evaluation (easiest might be to just skip all sentences with a 404 error?)

Some sentences have no B- items annotated (in spite of the fact that paper says that only sentences with a NE are preserved.) Ignore these as well? Or just score them (effect on macro recall is zero, on macro precision could be negative if the system over-annotates.)


### dev set

Either use only up to first 500 items or so (4123 - 4622) to ensure that we do not evaluate on test set, or else (better) do it on the basis of article urls, so we can do some global optimization for partial names (see above)


In [None]:
import pandas

winnl = pandas.read_json('WiNNL/dutch_winnl_data.json')

winnl

In [None]:
winnl.loc[winnl['url'] == "https://nl.wikinews.org/wiki/Allerlaatste_voor_1900_geboren_persoon_overleden"]

In [None]:
#url = "https://nl.wikinews.org/wiki/Allerlaatste_voor_1900_geboren_persoon_overleden"
#url = "https://nl.wikinews.org/wiki/Catalaans_president_Puigdemont_gevlucht_naar_Belgi%C3%AB"
url = "https://nl.wikinews.org/wiki/Al_Jazeera:_%27poging_tot_vliegtuigkaping_verijdeld%27"

for index,row in winnl.loc[winnl['url'] == url].iterrows() :
    gold = set()
    for token in row['qid'] :
        if token.startswith('B-') :
            id = token.replace('B-','')
            gold.add(id)        
    print(row['original'],gold)

In [None]:
from collections import defaultdict

# collect urls and count sentences from same article
urls = defaultdict(int)

for index,row in winnl.iterrows() :
    urls[row['url']] += 1 

for (key,val) in urls.items() :
    print(key,val)

# now collect subset with total of 500 sentences (for development and initial evaluation)


### MultiNERD

https://github.com/Babelscape/multinerd

MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation).

silver quality entity linking corpus, including dutch

Alignment of multiNERD and Spacy NE classes (NERD counts from nl_dev.tsv)

| Count         | NERD  | Spacy | Notes 
| ------- | ------------- | --- | ---- |
 |  9336 | B-LOC | LOC, GPE | 
 | 4865 | B-PER | PERSON |
 |  3587 | B-TIME | TIME, DATE| could include CARDINAL as well
 | 3256 | B-ANIM | | animals, mostly lower case 
 |  1281 | B-ORG | ORG |
  |  834 | B-FOOD | | mostly lower case 
  |  785 | B-PLANT | | mostly lower case 
   | 671 | B-DIS | | mostly lower case 
  |  475 | B-EVE | EVENT | 
  |  337 | B-MEDIA | WORK_OF_ART | Spel Zonder Grenzen, etc 
  |  163 | B-CEL || celestial bodies, mix of lower (zon) and upper (Mars)
| 142 | B-MYTH | PERSON| mythical figure
  |   47 |B-VEHI | PRODUCT | vehicle
   |  28 | B-BIO ||mostly upper case (latin medical terms)
    | 19  | B-INST | mostly upper case| instruments 

Spacy https://spacy.io/models/nl NE classes: CARDINAL, DATE, EVENT, FAC (facilities), GPE, LANGUAGE, LAW, LOC, MONEY, NORP (national, religious groups), ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

Note that years are frequently confused as DATE or CARDINAL. So we could implement a heuristic where 4-digit CARDINALs are included and resolved to the QID that is of type year, not number. 

#### What to include in the evaluation? 

- From NERD ignore those that are not predominantly named entities: ANIM, FOOD, PLANT, DIS
- From Space, ignore those that are not annotated by NERD: CARDINAL (?), FAC, LANGUAGE, LAW, MONEY, NORP, ORDINAL, PERCENT, QUANTITY

### Damuel

[Damuel](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-5047?show=full) consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. 

Note that the KG with just those entities that have a NE type contains over 27M entities. Loading the file with all nec (1.1G) works on colossus. 

### issues

Links are for concepts (_zonnewende, paganistisch, runen, Noorse_), not just named entities (_Heinrich Himmler, Dachau, SS_). But these can be obtained from the KG. So why are things like zonnewende or dates (see below) included?
So either just evaluate for precision, or else evaluate on those entities that are labelled as PROPN in the token layer. 

PROPN only misses 'Jacobus II van Schotland' as not all parts are PROPN....check first token only or reverse the logic: entity if one of the tokens is PROPN. 

Some dates are included as well, these could be included if we can identify them in the annotation. Note that in the KG they are listed as well, but without NE class. 

__Alternative__: evaluate on those entities with ne class loc/per/etc. but this requires finding them in the (20G) KG or some clever preprocessing. Extracted all QID/NE type pairs from the KG (in all.nec). You can read this into a dict, with qid as key and list of NE types as value. Reading the data takes a while (therefore saved it as a pickle file, just 600M), but look-up is efficient (much better than hiding qid in a pandas df for instance). Works on colossus, not tested on local workstation yet. 

As we are evaluating on longer texts, the decision to count only unique links seems somewhat unmotivated. Should we go for all occurrences? (and while we are at it: include string (positions) and evaluate these as well?)

In [None]:
import pandas 

damuel = pandas.read_json('damuel_1.0_nl/part-00000', lines=True)

In [None]:
damuel

In [None]:
wiki = damuel.loc[34]['wiki']
print(wiki['text'])
# print(wiki['links'])
#print(wiki['tokens'])

#tokenpos = 0
#for token in wiki['tokens'] :
#    for link in wiki['links'] :
#        if link['start'] == tokenpos :
#            try :
#                qid = link['qid']
#            except :
#                qid = 'missing'
#            print(token['upostag'], token['lemma'], qid, link['title'])
#    tokenpos += 1

for link in wiki['links'] :
    start = link['start']
    end  = link['end']
    upostags = []
    string = []
    propn = 0 
    for token in wiki['tokens'][start:end] : 
        upostags.append(token['upostag'])
        string.append(token['lemma'])
        if token['upostag'] == 'PROPN' :
            propn = 1
    try :
        qid = link['qid']
    except :
        qid = 'missing'
    #if propn:
    print(qid, string, upostags, link['title'])
        


In [None]:
import json

nec_dict = {}
for line in open('damuel_1.0_nl/damuel_1.0_wikidata/all.nec') :
    nec = json.loads(line)
    nec_dict[nec['qid']] = nec['type']

In [None]:
import pickle

pickle.dump( nec_dict, open( "damuel_1.0_nl/damuel_1.0_wikidata/all_nec_dict.p", "wb" ) )

In [None]:
nec_dict['Q16701841']

In [None]:
from collections import defaultdict

nec_types = defaultdict(list)

wiki = damuel.loc[35]['wiki']

for link in wiki['links'] :
    try :
        qid = link['qid']
    except :
        qid = 'no_qid'
    if qid != 'no_qid' :
        if nec_types[qid] :
            True
        else :
            try :
                nec_types[qid] = nec_dict[qid]
            except :
                nec_types[qid] = 'no_nec'

for key,val in nec_types.items() :
    print(key,val)
