<a href="https://colab.research.google.com/github/mmessmer/ML-NLP-Projects/blob/master/NEL_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Linking based on Wikidata
A prototype for identifying and disambiguating named entities using Wikipedia as a knowledge base via the Mediawiki API.

## Structure
* DB of (person) entities created by using Mediawiki API. For each entry a pageid, title (name) and extract (short description) is stored
* For query NER is done using spaCy
* For each (person) NER a list of candidates is created using fuzzy string search of names
* Using pre-trained BERT model, semantic embedding for candidates and query sentence are created
* Semantic similarity calculated via cosine similarity
* Closest matching candidates output to JSON

## Next steps

### Data collection
Use Wikipedia dump

### Creating training and test sets

Train on wikipedia data with NERs:
Get Wiki articles on persons and sentences from internal/external links, create training set from them
Try https://www.clips.uantwerpen.be/conll2003/ner/

### Named entity recognition
Use purpose-trained model

### Training/Fine-tuning language model to classify NEs

Or Labusch et al. (2020): Named Entity Disambiguation and Linking on Historic Newspaper OCR with BERT

### TODOs
* Get full list of person entities from Wikipedia
* Fine-tune model on extracted Wiki sentences
* Fine-tune with news or social media sentences
* Train classifier on sentence pairs with identical/non-identical entities
* Extend for other languages
* Extend for different entitites
* Extend for different data sources

In [None]:
!pip install sentence_transformers
!pip install sentencepiece
!python -m spacy download de_core_news_sm

import transformers
import spacy
import de_core_news_sm
nlp = de_core_news_sm.load()

import pandas as pd
from sentence_transformers import SentenceTransformer, util
import torch
import json

import requests
import re
import time

Collecting sentence_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/aa/f672ce489063c4ee7a566ebac1b723c53ac0cea19d9e36599cc241d8ed56/sentence-transformers-1.0.4.tar.gz (74kB)
[K     |████████████████████████████████| 81kB 3.7MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 7.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 23.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-

## Getting the data
### Exploring "Person" categories in Wikipedia

In [None]:
session = requests.Session()

url = "https://de.wikipedia.org/w/api.php"

params = {
    "action": "query",
    "cmtitle": "Kategorie:Person nach Staatsangehörigkeit",
    "cmlimit": 10,
    "list": "categorymembers",
    "format": "json"
}

response = session.get(url=url, params=params)
response_data = response.json()

country_cats = [c['title'] for c in response_data['query']['categorymembers']]

len(country_cats)

10

### Exploring DBPedia
DBPedia allows querying categories with (exact) name matching, but the returned categories seemed to be messy and include lists and other unrelated pages.

In [None]:
!pip install SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://dbpedia.org/sparql")
query = '''select ?person
where {
  ?person foaf:name "Thomas Müller"@en.
  ?person a foaf:Person.
}
LIMIT 50'''

sparql.setQuery(query)
sparql.setReturnFormat(JSON)
res = sparql.query().convert()
res

Collecting SPARQLWrapper
  Downloading https://files.pythonhosted.org/packages/00/9b/443fbe06996c080ee9c1f01b04e2f683b2b07e149905f33a2397ee3b80a2/SPARQLWrapper-1.8.5-py3-none-any.whl
Collecting rdflib>=4.0
[?25l  Downloading https://files.pythonhosted.org/packages/d0/6b/6454aa1db753c0f8bc265a5bd5c10b5721a4bb24160fb4faf758cf6be8a1/rdflib-5.0.0-py3-none-any.whl (231kB)
[K     |████████████████████████████████| 235kB 7.4MB/s 
[?25hCollecting isodate
[?25l  Downloading https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 5.1MB/s 
Installing collected packages: isodate, rdflib, SPARQLWrapper
Successfully installed SPARQLWrapper-1.8.5 isodate-0.6.0 rdflib-5.0.0


{'head': {'link': [], 'vars': ['person']},
 'results': {'bindings': [{'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/2013_FIFA_Club_World_Cup_squads'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/2014_MLS_All-Star_Game__Thomas_Müller__1'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/2016–17_FC_Bayern_Munich_season__Thomas_Müller__1'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/List_of_German_football_transfers_summer_2009'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/FC_Bayern_Munich__Thomas_Müller__1'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/Thomas_Müller'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/Thomas_Müller_(SS_officer)'}},
   {'person': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/2007–08_FC_Bayern_Munich_season__Thomas_Müller__1'}},
   {'person': {'type': 

### Approach 1: Entity retrievel via local sample taken from MediaWiki API
Back to Wikipedia: get a list of pageids by a deliberate sample of persons categorized by country: US and German persons.
For persons with an extract (summary) of their Wiki article collect these extracts.

#### Improvements
*  Fix issue with missing German data (after letter 'P')
*  Get more persons from more countries
*  Use full text data instead of just extracts
*  Get other NEs

In [None]:
def get_ids_from_categories(categories):
  n_entries = 0
  for category in categories:
    outf_path = './' + category.replace(':','_')+ '.csv'
    cc_params = {
        "action": "query",
        "cmtitle": category,
        "cmlimit": "max",
        "list": "categorymembers",
        "cmdir": "desc",
        "format": "json"
    }
    response = session.get(url=url, params=cc_params)
    if response.status_code != 200:
      print("Error retrieving pages for category {}.".format(category))
      print(response.reason)
      continue
    r = response.json()
    cont = r['continue']['cmcontinue']
    with open(outf_path, 'a') as outf:
      for p in r['query']['categorymembers']:
        n_entries += 1
        outf.write(p['title'] + '\t' + str(p['pageid']) + '\n')
    time.sleep(1.5)

    # Keep going through the list
    while cont:
      cc_params = {
          "action": "query",
          "cmtitle": category,
          "cmcontinue": cont,
          "cmlimit": 500,
          "list": "categorymembers",
          "format": "json"
      }
      response = session.get(url=url, params=cc_params)
      if response.status_code != 200:
        print("Error retrieving pages for category {}.".format(category))
        print(response.reason)
        continue
      r = response.json()
      try:
        cont = r['continue']['cmcontinue']
      except KeyError:
        continue

      with open(outf_path, 'a') as outf:
        for p in r['query']['categorymembers']:
          n_entries += 1
          outf.write(p['title'] + '\t' + str(p['pageid']) + '\n')
      print("Wrote {} entries.".format(str(n_entries)))
      time.sleep(1.5)

In [None]:
# To get only approx. 380k entries
sample_countries = [
                    "Kategorie:Deutscher",
                    #"Kategorie:US-Amerikaner"
                    ]
get_ids_from_categories(sample_countries)

Go through list of entries and batchwise get extracts by pageids. A maximum of 1000 characters is retrieved per extract.

In [None]:
from functools import reduce

def extract_text_from_pageids(infile_name, outfile_name):
  names, name_batch, pageids, pid_batch = [], [], [], []
  with open(infile_name) as inf:
    for line in inf:
      name, pageid = line.rstrip('\n').split('\t')
      name_batch.append(name)
      pid_batch.append(pageid)
      if len(pid_batch) == 20:
        names.append(name_batch)
        name_batch = []
        pageids.append(pid_batch)
        pid_batch = []


  n = 0
  for batch in pageids:
    outf = open(outfile_name, 'a')
    pids = reduce(lambda x, y: str(x) + '|' + str(y), batch)

    params = {
      "action": "query",
      "format": "json",
      "prop": "extracts",
      "pageids": pids,
      "formatversion": "2",
      "exchars": "1000",
      "explaintext": 1,
      "exsectionformat": "plain",
      "exintro": 1,
    }

    response = session.get(url=url, params=params)
    if response.status_code != 200:
      print("Error retrieving pages for category {}.".format(country))
      print(response.reason)
      continue
    r = response.json()
    for page in r['query']['pages']:
      try:
        extract = page['extract']
        extract = extract.replace('\n','').rstrip('…')
        title = page['title']
        pid = page['pageid']
        outf.write('"{}"\t"{}"\t"{}"\n'.format(str(pid), title, extract))
      except KeyError:
        continue
    outf.close()
    n += 1
    if not n % 10:
      print("Processed batch {} of {}".format(str(n),str(len(pageids))))
      
    time.sleep(0.20)

Read results and put them in Pandas DataFrame for next steps

In [None]:
#extract_text_from_pageids('Kategorie_Deutscher.csv','de_person_extracts.csv')
# Extracting previously cut off entries starting at "Pungs"
extract_text_from_pageids('Kategorie_Deutscher_Pungs.csv','de_person_extracts_pungs.csv')


de_persons = pd.read_csv('de_person_extracts_pungs.csv', sep='\t', names=["pageid", "title", "extract"], dtype='string')
de_persons = de_persons.dropna(axis=0)
de_persons_lower = pd.concat([de_persons['pageid'], de_persons['title'].str.lower(), de_persons['extract']],axis=1)
de_persons_lower

### Approach 2: Entity retrieval via Wikipedia search and DBpedia data
Get candidates via Wikipedia API search and check their entity category via DBpedia. Abstracts with short descriptions are then also retrieved from DBpedia. This approach has several advantages
- Mostly language-independent
- Data doesn't need to be scraped
- Data is always up to date

Downsides are that the search may slow it down and the results depend on the external (black-box) Wikipedia API search algorithm

#### Improvements:
- Map categories for other entity types (LOC, ORG...)

In [None]:
!pip install wikipedia
import wikipedia as wiki



In [109]:
# DBPedia results:

def get_dbpedia_abstract(name,ent_type='PER',lang='de'):
  # Get the name in the wiki URL format
  query = name.replace(' ','_')

  # Search under this node for the person category, located
  # under /resource/<name>/
  rdf_type = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"

  # Type name mapping.
  # Type names are under /resource/<name>/<rdf_type>
  ent_map = {
      "PER" : { "type" : "uri", "value" : "http://dbpedia.org/ontology/Person" }
  }
  # Directly under /resource/<name>
  abstract_url = "http://dbpedia.org/ontology/abstract"

  # Get the query (page title) into the right format by replacing spaces 
  # with an underscore and insert into the URL format
  lang_prefix = {'de':'de.','en':''}
  url = "http://{}dbpedia.org/data/{}.json".format(lang_prefix[lang],query)
  res_name = "http://{}dbpedia.org/resource/{}".format(lang_prefix[lang],query)
  session = requests.Session()

  response = session.get(url=url).json()
  if not response:
    #print('No entry found for name "{}"'.format(name))
    return None

  
  try:
    properties = response[res_name]
    # Entity is a person
    if ent_map[ent_type] in properties[rdf_type]:
      # Return abstract in requestsed language
      for abstract in properties[abstract_url]:
        if abstract['lang'] == lang:
          return abstract['value']
      return None
  except KeyError:
    #print('Matched entry "{}" does not contain required information'.format(url))
    return None
  
def get_candidates(ner_string,ent_type='PER',lang='de',n=15):
  search_results = wiki.search(ner_string,results=n)
  names, abstracts = [], []

  # Retrieve abstract and, if present, add to dict of candidates
  for cand in search_results:
    abstract = get_dbpedia_abstract(cand,ent_type,lang)
    # Check if there is an abstract and if the NE string is contained in the
    # candidate. Wiki API search sometimes returns irrelevant results
    if abstract:
      names.append(cand)
      abstracts.append(abstract)

  # Put everything in dict format suitable for pandas
  cand_list = {'entities':names, 'abstracts':abstracts}
  return cand_list

## NEL via semantics similarity of contextual embeddings
A string search is done to collect candidates from the small DB of extracts and by computing embeddings for each candidate and the query sentence the most similar candidate is chosen.

#### Improvements
*   Train the model on provided data from Wikipedia and related links (e.g. to news sites)
*   Store already retrieved embeddings in persons DF to improve performance
*   Get prior probabilities of entities by collecting frequency statistics and use them for the candidate selection



In [None]:
# Load model suited for semantic similarity, must be trained for German data,
# in this case a multilingual model
# TODO: train model on appropriate data
# distilbert-base-multilingual-cased
model = SentenceTransformer('bert-base-multilingual-cased')

test_sentences = [
                  "Merkel sprach am Sonntag im Bundeskanzleramt mit ihrem Kabinett über das Vorgehen in der Corona-Krise.",
                  "Hansi #Flick: „Es ist wichtig, dass man zufrieden ist mit dem 5:1. Den Sieg haben wir auch in der Höhe verdient.\" #MiaSanMia #FCBayern",
                  "Kanzleramtschef Braun (CDU) ein Jahr nach Pandemiebeginn zu Mängeln der Corona-Warn-App: „Warum muss in Deutschland immer alles der Staat anbieten?“",
                  "Knapp zwei Wochen vor der Landtagswahl in Baden-Württemberg trafen sich Ministerpräsident Kretschmann und seine CDU-Herausforderin Eisenmann zum TV-Duell"
]

### Approach 1 : Via local data from MediaWiki API
NEs are extracted with spaCy. For now only PER entities are then used for the linking algorithm.
For linking the entities, semantic similarity is used: contextual embeddings 


In [92]:
from datetime import datetime

# Get matches of passed named entity in the title column, 
# create embeddings for matched sentences and output to results
def nel_mediawiki(sentence, ne, k=5):
  # Find string matches in Wiki article extracts
  matched = de_persons_lower.loc[de_persons_lower['title'].str.contains(ne)]
  if matched.empty:
    return None

  # Create embeddings for wiki candidates and input sentence and calculate
  # semantics similarity based on cosine distance
  wiki_embeddings = model.encode(matched['extract'].tolist(), batch_size=8, 
                                 convert_to_tensor=True)
  sent_embeddings = model.encode(sentence, batch_size=8,
                                 convert_to_tensor=True)
  cos_scores = util.pytorch_cos_sim(sent_embeddings,wiki_embeddings)[0]
  candidates = torch.topk(cos_scores, min(k,matched.count()[0]))

  # Collect results with top k candidates

  ranking_results = []
  print("Top matches for entity '{}' in query '{}':".format(ne, sentence))
  for score, index in zip(candidates[0], candidates[1]):
    print("Score: {:.3f} Person: {}".format(score,matched['extract'].iloc[int(index)]))
    ranking_results.append({
        "candidate_name" : matched['title'].iloc[int(index)],
        "candidate_pageid" : matched['pageid'].iloc[int(index)],
        "candidate_score": "{:.3f}".format(score)
    })
  return ranking_results


# Get entities via spaCy. Default type of entities returned is persons
# PER Persons
# LOC locations
# ORG
def get_all_nes_mediawiki(sentences, entity_types=['PER']):
  docs = list(nlp.pipe(sentences))
  entities = []
  for doc, sent in zip(docs,sentences):
    doc_entities = []
    for ent in doc.ents:
      if ent.label_ in entity_types:
        print("Found entity: " + ent.text + "_" + ent.label_)
        # Search for full string, normalize whitespaces
        full_m = re.sub('\s+',' ',ent.text.lower())
        results = nel_mediawiki(sent, full_m)
        if results:
          doc_entities.append({
              "entity" : ent.text,
              "entity_type" : ent.label_,
              "candidates" : results
          })
          continue

        # If full match is not found try with partial string matches
        partial_matches = ent.text.lower().split()
        for m in partial_matches:
          results = nel_mediawiki(sent, m)
          if results:
            doc_entities.append({
                "entity" : ent.text,
                "entity_type" : ent.label_,
                "candidates" : results
            })
    entities.append({
        "sentence" : sent,
        "entities" : doc_entities
    })
  # Write results to JSON file
  with open('query_' + datetime.now().isoformat() + '.json','w', encoding='utf-8') as result_file:
    json.dump(entities,result_file, ensure_ascii=False)
  return entities

In [None]:
results = get_all_nes(test_sentences)
correct = 0
total_entities = 5

# Sentence 1: expected Angela Merkel|145
if results[0]['entities'][0]['candidates'][0]['candidate_pageid'] == '145':
    correct = correct + 1

# Sentence 2: expected Hansi Flick|992828; 
if results[1]['entities'][0]['candidates'][0]['candidate_pageid'] == '992828':
    correct += 1

#Sentence 3: expected Helge Braun|602284;
if results[2]['entities'][0]['candidates'][0]['candidate_pageid'] == '602284':
    correct += 1

# Sentence 4: expected Winfried Kretschmann|161981
if results[3]['entities'][0]['candidates'][0]['candidate_pageid'] == '161981':
    correct += 1

# Sentence 4: expected Susanne Eisenmann|1223031
if results[3]['entities'][1]['candidates'][0]['candidate_pageid'] == '1223031':
    correct += 1

print("Accuracy for correct candidate first: {:.4f}".format(correct/total_entities))

Found entity: Merkel_PER
Top matches for entity 'merkel' in query 'Merkel sprach am Sonntag im Bundeskanzleramt mit ihrem Kabinett über das Vorgehen in der Corona-Krise.':
Score: 0.636 Person: Angela Dorothea Merkel (geb. Kasner; * 17. Juli 1954 in Hamburg) ist eine deutsche Politikerin (CDU). Sie ist seit dem 22. November 2005 Bundeskanzlerin der Bundesrepublik Deutschland. Vom 10. April 2000 bis zum 7. Dezember 2018 war sie CDU-Bundesvorsitzende.Merkel wuchs in der DDR auf und war dort als Physikerin am Zentralinstitut für Physikalische Chemie tätig. Bei der Bundestagswahl am 2. Dezember 1990 errang sie erstmals ein Bundestagsmandat. Bei den folgenden sieben Bundestagswahlen wurde sie in ihrem Wahlkreis in Vorpommern direkt gewählt. Von 1991 bis 1994 war Merkel Bundesministerin für Frauen und Jugend im Kabinett Kohl IV und von 1994 bis 1998 Bundesministerin für Umwelt, Naturschutz und Reaktorsicherheit im Kabinett Kohl V. 1998 bis zu ihrer Wahl zur Bundesvorsitzenden der Partei 2000 

### Approach 2: Via Wikipedia search and DBpedia

In [106]:
from datetime import datetime

# Get matches of passed named entity in the title column, 
# create embeddings for matched sentences and output to results
def nel_wiki_search(sentence, ne, k=5):
  # Find candidates via Wikipedia search and get abstracts for them
  candidate_list = get_candidates(ne)

  if len(candidate_list['abstracts']) == 0:
    return None

  # Create embeddings for wiki candidates and input sentence and calculate
  # semantics similarity based on cosine distance
  wiki_embeddings = model.encode(candidate_list['abstracts'], batch_size=8, 
                                 convert_to_tensor=True)
  sent_embeddings = model.encode(sentence, batch_size=8,
                                 convert_to_tensor=True)
  cos_scores = util.pytorch_cos_sim(sent_embeddings,wiki_embeddings)[0]
  candidates = torch.topk(cos_scores, min(k,len(candidate_list['entities'])))

  # Collect results with top k candidates
  ranking_results = []
  print("Top matches for entity '{}' in query '{}':".format(ne, sentence))
  for score, index in zip(candidates[0], candidates[1]):
    print("Score: {:.3f} Person: {}".format(score,candidate_list['entities'][int(index)]))
    ranking_results.append({
        "candidate_name" : candidate_list['entities'][int(index)],
        "candidate_abstract" : candidate_list['abstracts'][int(index)],
        "candidate_score": "{:.3f}".format(score)
    })
  return ranking_results


# Get entities via spaCy. Default type of entities returned is persons
# PER Persons
# LOC locations
# ORG
def get_all_nes_wiki_search(sentences, entity_types=['PER']):
  docs = list(nlp.pipe(sentences))
  entities = []
  for doc, sent in zip(docs,sentences):
    doc_entities = []
    for ent in doc.ents:
      if ent.label_ in entity_types:
        print("Found entity: " + ent.text + "_" + ent.label_)
        # Search for full string, normalize whitespaces
        full_m = re.sub('\s+',' ',ent.text.lower())
        results = nel_wiki_search(sent, full_m)
        if results:
          doc_entities.append({
              "entity" : ent.text,
              "entity_type" : ent.label_,
              "candidates" : results
          })
          #continue

        # Try with partial string matches
        partial_matches = ent.text.lower().split()
        for m in partial_matches:
          if m == full_m:
            continue
          results = nel_wiki_search(sent, m)
          if results:
            ent_result = {
                "entity" : ent.text,
                "entity_type" : ent.label_,
                "candidates" : results
            }
            if not ent_result in doc_entities:
              doc_entities.append(ent_result)
    entities.append({
        "sentence" : sent,
        "entities" : doc_entities
    })
  # Write results to JSON file
  with open('query_' + datetime.now().isoformat() + '.json','w', encoding='utf-8') as result_file:
    json.dump(entities,result_file, ensure_ascii=False)
  return entities

In [107]:
get_all_nes(test_sentences)

Found entity: Merkel_PER
Top matches for entity 'merkel' in query 'Merkel sprach am Sonntag im Bundeskanzleramt mit ihrem Kabinett über das Vorgehen in der Corona-Krise.':
Score: 0.726 Person: Kabinett Merkel I
Score: 0.725 Person: Kabinett Merkel II
Score: 0.699 Person: Angela Merkel
Score: 0.698 Person: Kabinett Merkel III
Score: 0.494 Person: Alexander Merkel
Found entity: Hansi_PER
Top matches for entity 'hansi' in query 'Hansi #Flick: „Es ist wichtig, dass man zufrieden ist mit dem 5:1. Den Sieg haben wir auch in der Höhe verdient." #MiaSanMia #FCBayern':
Score: 0.618 Person: Hans Pflügler
Found entity: Kanzleramtschef Braun_PER
Top matches for entity 'kanzleramtschef braun' in query 'Kanzleramtschef Braun (CDU) ein Jahr nach Pandemiebeginn zu Mängeln der Corona-Warn-App: „Warum muss in Deutschland immer alles der Staat anbieten?“':
Score: 0.709 Person: Hans Globke
Score: 0.683 Person: Thomas de Maizière
Top matches for entity 'kanzleramtschef' in query 'Kanzleramtschef Braun (CDU

[{'entities': [{'candidates': [{'candidate_abstract': 'Das Kabinett Merkel I war die vom 22. November 2005 bis zum 28. Oktober 2009 amtierende deutsche Bundesregierung in der 16. Legislaturperiode. Grundlage für die Arbeit dieser Regierung war der Koalitionsvertrag „Gemeinsam für Deutschland. Mit Mut und Menschlichkeit.“',
      'candidate_name': 'Kabinett Merkel I',
      'candidate_score': '0.726'},
     {'candidate_abstract': 'Das Kabinett Merkel II war die vom 28. Oktober 2009, dem Tag ihrer Vereidigung, bis zum 17. Dezember 2013 amtierende deutsche Bundesregierung in der 17. Legislaturperiode. Das Kabinett der schwarz-gelben Koalition aus CDU, CSU und FDP stellte sich am 23. beziehungsweise 24. Oktober 2009 der Öffentlichkeit vor; es war das 22. Kabinett der Bundesrepublik Deutschland. Am 22. Oktober 2013 wurde die Bundesregierung durch Bundespräsident Joachim Gauck entlassen, blieb aber bis zur Ernennung der neuen Regierung geschäftsführend im Amt. Der Koalitionsvertrag trug den 

### Test with News API
Do some test queries from German news headlines

In [101]:
news_session = requests.Session()
news_url = 'https://newsapi.org/v2/everything'

# Simple session-based cache for already made requests depending on parameters
cached_requests = {}

In [111]:
import json
def get_news_articles_json(news_query):
  news_params = {
    'apiKey' : '2a2998bc466248bf9d93f6b20668a822',
    'q' : news_query,
    'language' : 'de',
    'sortBy': 'popularity',
    'pageSize': 10,
  }
  json_params = json.dumps(news_params)
  if json_params in cached_requests:
    print('INFO: Returning cached request')
    return cached_requests[json_params]

  response = news_session.get(url=news_url, params=news_params)
  cached_requests[json_params] = response.json()
  return response.json()

In [110]:
articles = get_news_articles_json('Deutschland')

news_sentences = [desc['description'] for desc in articles['articles']]
get_all_nes_wiki_search(news_sentences)

INFO: Returning cached request


['In Deutschland produzierte Serien sind weltweit gefragt. Eine Bundesförderung von bis zu 10 Millionen Euro soll mehr Produktionen nach Deutschland locken.',
 'In der aktuellen Wirtschaftskrise horten die Bürger Rekordsummen, doch einen baldigen einen Konsumboom erwartet niemand – auch nicht nach der Pandemie.',
 'Anders als sein Vorgänger Donald Trump will US-Präsident Joe Biden seine Truppen hierzulande verstärken, auch mit Spezialisten für elektronische Kriegsführung.',
 'Surfen, streamen, spielen: In der Pandemie nutzen die Menschen das Internet deutlich intensiver. Das zeigt sich am Datenvolumen – beim Festnetz und mobil.',
 'Wie sich Manganknollen in der Tiefsee abbauen lassen, untersucht die Bundesanstalt für Geowissenschaften in einem Forschungsprojekt. Umweltschützer warnen.',
 'Amazon bietet das Smart Display Echo Show 10 nun auch in Deutschland an. Neu ist eine Funktion, mit der sich das Zuhause auf dem Handy überwachen lässt.',
 "c't-Chefredakteur Dr. Jürgen Rink spricht m

## Further Approaches
End-to-end entitiy linking via embeddings of mentions and entities + probabilistic mapping of mentions to entities: [Kolsitas (2018)](https://github.com/dalab/end2end_neural_el)

Candidate selection by ANN indexes from a DB then ranking with a BERT classifier trained on sentences pairs with either same or different entities: [Labusch & Neudecker (2020)](https://github.com/qurator-spk/sbb_ned)
- high precision but tradeoff in recall

