# Named Entity Linking

- what is NEL and what systems we are going to use
- limitations of these systems

## Imports

In [9]:
import os
import json
import codecs
import numpy as np

## Data loader functions

In [178]:
import os
import codecs

# loads raw text files from the impresso dataset
# into a dictionary where keys are filenames and
# values are the actual documents
def load_impresso_data():
    
    data_dir = "../datasets/impresso/raw/"
    docs = {}
    
    for filename in os.listdir(data_dir):
        if ".txt" not in filename:
            continue
        with codecs.open(os.path.join(data_dir, filename)) as infile:
            text = infile.read()
        docs[filename] = text
        
    print(f'{len(docs)} documents were read from {data_dir}')
    
    return docs

## NEL with Babelfy

Demo: http://babelfy.org/

### Getting an API Key

- It takes only two minues
- Create an account https://babelnet.org/register
- Log in and copy your key from https://babelnet.org/home into the cell below... 
- Each key gives you up to 1k API calls per day

In [13]:
# here goes your Babelfy API key
babelfy_key = "a1ac4b12-4b56-4226-816d-1e682aa6b1c6"

### Babelfy API

Use with API:
- how to get the key
- where to put it

- Read Quaero test data
- query API
- basic display of results

### Helpers functions

In [187]:
import urllib
import json 

# a simple wrapper around Babely's HTTP API
def babelfy_request(text, key, lang='AGNOSTIC', annotation_type='ALL', matching='EXACT_MATCHING'):
    service_url = 'https://babelfy.io/v1/disambiguate'
    matching_strategies = ['PARTIAL_MATCHING', 'EXACT_MATCHING']
    annotation_types = [
        'ALL',
        'NAMED_ENTITIES',
        'CONCEPTS'
    ]
    
    # sanity check on input parameters
    assert annotation_type in annotation_types
    assert matching in matching_strategies

    params = {
    'text' : text,
    'lang' : lang,
    'key'  : key,
    'annType': annotation_type,
    'match': matching
    }

    params = urllib.parse.urlencode(params)
    params = params.encode('utf8') # POST data must be bytes
    req = urllib.request.Request(service_url, data=params, method='POST')
    resp = urllib.request.urlopen(req)
    data = json.loads(resp.read().decode('utf8'))
    print("response successful")
    return data

In [188]:
def display_babelfy_output(input_doc, b_output):
    """Print Babeblfy's output with minimal formatting."""
    
    print(f"Babelfy found {len(b_output)} links:")
    
    for n, entry in enumerate(b_output):
        start_offset = entry['charFragment']['start']
        end_offset = entry['charFragment']['end'] + 1
        surface = input_doc[start_offset:end_offset]
        entity_link = entry['DBpediaURL'] if entry['DBpediaURL'] else entry['BabelNetURL']
        
        print(f"[{n + 1}] {surface} -> {entity_link}")
    return

### Let's test it

`Babelfy` cannot take as an input a document with **already extracted** mentions.

In [179]:
# load the impresso dataset
docs = load_impresso_data()

6 documents were read from ../datasets/impresso/raw/


In [181]:
# pick one random test document
one_random_doc = np.random.choice(list(docs.keys()))
test_doc = docs[one_random_doc]

# this prints some basic information about the file
print(f'[{one_random_doc}] Contains {len(test_doc.split())} words.\nBegin: {test_doc[:100]}...')

[GDL-1898-08-02-a-i0008.txt] Contains 1997 words.
Begin: CONFÉDÉRATION SUISSE Simplon. — Un journal du matin annonce que le premier coup de pioche pour le pe...


In [189]:
%%time

# call Babelfy's API and store its JSON response
# into a dictionary
babelfy_output = babelfy_request(test_doc, babelfy_key, lang='FR')

# pretty print the entities/concepts + links found by Babelfy
display_babelfy_output(test_doc, babelfy_output)

response successful
Babelfy found 757 links:
[1] CONFÉDÉRATION -> http://dbpedia.org/resource/Confederation
[2] CONFÉDÉRATION SUISSE -> http://dbpedia.org/resource/Switzerland
[3] SUISSE -> http://dbpedia.org/resource/Switzerland
[4] journal -> http://dbpedia.org/resource/Newspaper
[5] matin -> http://dbpedia.org/resource/Morning
[6] annonce -> http://babelnet.org/rdf/s00082637v
[7] premier -> http://babelnet.org/rdf/s00103006a
[8] coup -> http://babelnet.org/rdf/s00011442n
[9] pioche -> http://dbpedia.org/resource/Pickaxe
[10] aurait -> http://babelnet.org/rdf/s00088733v
[11] été -> http://babelnet.org/rdf/s00083187v
[12] donné -> http://babelnet.org/rdf/s00088815v
[13] lundi -> http://dbpedia.org/resource/Monday
[14] lundi matin -> http://babelnet.org/rdf/s03992674n
[15] matin -> http://dbpedia.org/resource/Morning
[16] forme -> http://dbpedia.org/resource/Pattern_(architecture)
[17] information -> http://dbpedia.org/resource/Information
[18] est -> http://babelnet.org/rdf/s00083187v

In [190]:
babelfy_output = babelfy_request(
    test_doc,
    babelfy_key,
    lang='FR',
    annotation_type='NAMED_ENTITIES'
)
display_babelfy_output(test_doc, babelfy_output)

response successful
Babelfy found 84 links:
[1] CONFÉDÉRATION SUISSE -> http://dbpedia.org/resource/Switzerland
[2] SUISSE -> http://dbpedia.org/resource/Switzerland
[3] Conseil fédéral -> http://dbpedia.org/resource/Federal_Council_(Switzerland)
[4] gouvernement italien -> http://dbpedia.org/resource/Politics_of_Italy
[5] Jura-Simplon -> http://dbpedia.org/resource/Jura–Simplon_Railway
[6] Jean -> http://dbpedia.org/resource/John_the_Apostle
[7] Montreux -> http://dbpedia.org/resource/Montreux
[8] Sixte IV -> http://dbpedia.org/resource/Pope_Sixtus_IV
[9] Zurich -> http://dbpedia.org/resource/Zürich
[10] prise de Rome -> http://dbpedia.org/resource/Capture_of_Rome
[11] La fédération -> http://babelnet.org/rdf/s04844699n
[12] Olten -> http://dbpedia.org/resource/Olten
[13] juin 1898 -> http://babelnet.org/rdf/s16268044n
[14] comité central -> http://dbpedia.org/resource/Central_Committee
[15] Berne -> http://dbpedia.org/resource/Bern
[16] Bienne -> http://dbpedia.org/resource/Biel/Bien

### tweak parameters. run. compare. repeat.

Changing the API's settings does change the output you will get.

Parameters you can fiddle with:
- `annotation_type`
- `lang`
- `matching` (try to `PARTIAL_MATCHING` and time it with `%%time`)

### Test robustness

Try to insert OCR errors before calling `Babelfy`.

## NEL with Aida

System demo: [Aida Web](https://gate.d5.mpi-inf.mpg.de/webaida/)

Reference publication: [Hoffart et al. *Robust Disambiguation of Named Entities in Text*](https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/AuthorEditorIndividualView/058d5590b427985bc12578e70045ebbc/$FILE/aida_emnlp.pdf)

There are two ways of using it:
1. let Aida recognize named entity mentions
2. pass to Aida already extracted mentions

### Helper functions

In [230]:
def read_iob(filepath):
    """
    A list of strings. Each entry in the list is
    a document from the IOB input.
    """
    data = []
    instance = []
    
    with codecs.open(filepath) as input_file:
        lines = input_file.readlines()
    
    for line in lines:
        
        if line.startswith('#'):
            continue
        
        if line == '\n' and instance:
            data.append(instance)
            instance = []
        elif line == '\n' and not instance:
            pass
        else:
            line = line.replace('\n', '')
            instance.append(tuple(line.split('\t')))
    
    return data

In [242]:
def iob2text(iob_data):
    texts = []
    for doc in iob_data:
        tokens = [token[1] for token in doc]
        texts.append(" ".join(tokens))
    return texts

In [333]:
def prepare_aida_input(iob_data):
    """
    list -> list -> dict
    return a list of strings with square brackets around mentions. 
    Each entry in the list is a document from the IOB input.
    """
    prepared_input = ""
    
    for n, token in enumerate(iob_data):
        seq, surface, pos_tag, ne_tag = token
        
        if n - 1 > 0:
            if not iob_data[n-1][-1].startswith('O') and ne_tag == 'O':
                prepared_input += ']]'
        
        if n > 0 and n < len(iob_data):
            prepared_input += " "
        
        if ne_tag.startswith('B-'):
            
            if n - 1 > 0:
                if iob_data[n-1][-1].startswith('B') and ne_tag != 'O':
                    prepared_input += ']]'
            
            prepared_input += f'[[{surface}'
        
        elif ne_tag.startswith('I-'):
            
            prepared_input += f'{surface}'
            
        else:
            prepared_input += (f'{surface}')
    
    return prepared_input

### Test with `quaero` data

In [243]:
iob_data = read_iob('../datasets/quaero/quaero_test_5000_iob.tsv')

#### Aida does NER

First test, we let Aida do the NER part (it uses a Stanford tagger trained on English materials):

In [323]:
# run the code in this cell again (and again)
# to get each time a new (random) document from the quaero dataset

quaero_texts = iob2text(iob_data)
one_random = np.random.choice(quaero_texts)
print(one_random)

Le ministre des finances a soumis au conseil le texte du projet de loi qu' il doit déposer aujourd'hui aux le bureau de la Chambre et qui a pour objet do .


Now you can copy the text above and paste it into Aida's input text field [here](https://gate.d5.mpi-inf.mpg.de/webaida/).

#### We do NER, Aida disambiguates

In [315]:
one_random = np.random.choice(iob_data)
aida_input = prepare_aida_input(one_random)
test_doc = aida_input
print(test_doc)

[[Economiste distingué]] , [[administrateur habile]] , [[M. Ellena]] a fait ses preuves .


In [331]:
test = [('1', 'A', 'ADP', 'O'),
 ('2', 'la', 'DET', 'O'),
 ('3', 'Chambre', 'NOUN', 'B-org.adm'),
 ('4', ',', 'PUNCT', 'O'),
 ('5', 'M.', 'NOUN', 'B-pers.ind'),
 ('6', 'Constans', 'PROPN', 'I-pers.ind'),
 ('7', 'doit', 'AUX', 'O'),
 ('8', 'répondro', 'VERB', 'O'),
 ('9', "aujourd'hui", 'NOUN', 'B-time.date.rel'),
 ('10', 'a', 'VERB', 'O'),
 ('11', "l'", 'DET', 'O'),
 ('12', 'interpellation', 'NOUN', 'O'),
 ('13', 'du', 'DET', 'O'),
 ('14', 'docteur', 'NOUN', 'B-func.ind'),
 ('15', 'Després', 'ADJ', 'B-pers.ind'),
 ('16', 'sur', 'ADP', 'O'),
 ('17', 'la', 'DET', 'O'),
 ('18', 'laïcisation', 'NOUN', 'O'),
 ('19', 'des', 'DET', 'O'),
 ('20', 'hôpitaux', 'NOUN', 'B-org.ent'),
 ('21', 'do', 'ADP', 'I-org.ent'),
 ('22', 'Paris', 'PROPN', 'I-org.ent'),
 ('23', '.', 'PUNCT', 'O')]

In [334]:
prepare_aida_input(test)

"A la [[Chambre]] , [[M. Constans]] doit répondro [[aujourd'hui]] a l' interpellation du [[docteur ]][[Després]] sur la laïcisation des [[hôpitaux do Paris]] ."

Now you can copy the text above and paste it into Aida's input text field here.

### Test with `impresso` data


In [5]:
def load_flair_data():
    
    data_dir = os.path.expanduser('~/datasets/impresso/flair_output/')
    docs = {}
    
    for filename in os.listdir(data_dir):
        if ".json" not in filename:
            continue
        
        with codecs.open(os.path.join(data_dir, filename)) as infile:
            data = json.load(infile)
        docs[filename] = data
        
    print(f'{len(docs)} documents were read from {data_dir}')
    
    return docs

In [10]:
d = load_flair_data()

1 documents were read from /home/jovyan/datasets/impresso/flair_output/


In [11]:
d

{'GDL-1898-08-02-a-i0008.json': [{'text': '— Un journal du matin annonce que le premier coup de pioche pour le percement du Simplon aurait été donné à Briguo lundi matin.',
   'labels': [],
   'entities': [{'text': 'Simplon',
     'start_pos': 81,
     'end_pos': 88,
     'type': 'loc.phys.hydro',
     'confidence': 0.5749263167381287},
    {'text': 'Briguo',
     'start_pos': 108,
     'end_pos': 114,
     'type': 'loc.adm.town',
     'confidence': 0.9882785081863403},
    {'text': 'lundi matin',
     'start_pos': 115,
     'end_pos': 126,
     'type': 'time.date.abs',
     'confidence': 0.5901570022106171}]},
  {'text': "Sous celle forme, l'information est inexacte, attendu que le Conseil fédéral n'a pas encore autorisé le commencement des travaux.",
   'labels': [],
   'entities': [{'text': 'Conseil fédéral',
     'start_pos': 61,
     'end_pos': 76,
     'type': 'org.adm',
     'confidence': 0.9232593178749084}]},
  {'text': 'Il ne pourra le faire que de concert avec le gouvernemen