# Naive Entity Linker

In this exercise, you'll need to create the simplest possible entity linker, which recognizes city names in text and links those to the corresponding Wikipedia page.

Since we aim for a minimalistic solution, we won't be dealing with the problem of disambiguation explicitly.  Instead, we construct a surface form dictionary such that each mention is mapped to a single entity.  Then, the task boils down to (i) the construction of the surface form dictionary, and (ii) the recognition of mentions (dictionary keys) in the text.

We use part of the Wikipedia URL after the last slash to uniquely represent entities, i.e., as `entity_ID`.

The goal is to have a minimalistic working solution first, then expand it iteratively.
- For the minimal version, consider only capitals in Europe. (All those cities should have unique names.)
- Expand the list of cities by considering other Wikipedia lists and/or DBpedia properties. At this point, we don't deal with ambiguity; if a city name has already been added to the surface form dictionary, we ignore it.
- Finally, we resolve ambiguity by considering the more popular sense of the entity. Since we don't have link information easily available, we'll use the population of the city as a proxy. That is, if there are multiple cities with the same name, keep the one in the surface form dictionary with the highest population.

In [6]:
import ipytest
import pytest
import wikipediaapi

from typing import Dict, List

ipytest.autoconfig()

## 1) Obtain a list of cities

We collect potential city names from Wikipedia categories. Note that not all the pages will be actual cities. We'll filter those later in the surface form dictionary creation step.

In [3]:
def get_cities(seed_cats: List[str]) -> List[str]:
    """Returns a list of cities represented by their entity_IDs.
    
    Args:
        seed_cats: Names of wikipedia categories that may be used for collecting cities.
    
    Returns:
        List of entity_IDs.
    """
    wiki_wiki = wikipediaapi.Wikipedia("en")
    cities = []
    for seed_cat in seed_cats:
        cat = wiki_wiki.page(f"Category:{seed_cat}")
        for c in cat.categorymembers.values():
            if c.ns != wikipediaapi.Namespace.CATEGORY:
                cities.append(c.title)
    
    return cities

In [7]:
cities = get_cities(["Capitals_in_Europe"])

In [9]:
# Look at a sample.
cities[:10]

['European Capital of Culture',
 'European Youth Capital',
 'Amsterdam',
 'Andorra la Vella',
 'Athens',
 'Belgrade',
 'Berlin',
 'Bern',
 'Bratislava',
 'City of Brussels']

## 2) Create a surface form dictionary

We create a dictionary that maps surface forms to entity_IDs (single entity_ID per surface form).

In [10]:
def create_sf_dict(cities: List[str]) -> Dict[str, str]:
    """Creates a surface form dictionary for a given list of cities.
    
    Args:
        cities: List of entity_IDs of cities.
    
    Returns:
        Dictionary with mention as key and entity_ID as value.
    """
    sf_dict = {}
    for city in cities:
        sf_dict[city] = city.replace(" ", "_")
    return sf_dict

In [15]:
sf_dict = create_sf_dict(cities)

In [16]:
sf_dict

{'European Capital of Culture': 'European_Capital_of_Culture',
 'European Youth Capital': 'European_Youth_Capital',
 'Amsterdam': 'Amsterdam',
 'Andorra la Vella': 'Andorra_la_Vella',
 'Athens': 'Athens',
 'Belgrade': 'Belgrade',
 'Berlin': 'Berlin',
 'Bern': 'Bern',
 'Bratislava': 'Bratislava',
 'City of Brussels': 'City_of_Brussels',
 'Bucharest': 'Bucharest',
 'Budapest': 'Budapest',
 'Chișinău': 'Chișinău',
 'Copenhagen': 'Copenhagen',
 'Dublin': 'Dublin',
 'Gibraltar': 'Gibraltar',
 'Helsinki': 'Helsinki',
 'Kyiv': 'Kyiv',
 'Lisbon': 'Lisbon',
 'Ljubljana': 'Ljubljana',
 'London': 'London',
 'Luxembourg City': 'Luxembourg_City',
 'Madrid': 'Madrid',
 'Minsk': 'Minsk',
 'Monaco': 'Monaco',
 'Moscow': 'Moscow',
 'Nicosia': 'Nicosia',
 'Nuuk': 'Nuuk',
 'Oslo': 'Oslo',
 'Paris': 'Paris',
 'Podgorica': 'Podgorica',
 'Prague': 'Prague',
 'Pristina': 'Pristina',
 'Reykjavík': 'Reykjavík',
 'Riga': 'Riga',
 'Rome': 'Rome',
 'City of San Marino': 'City_of_San_Marino',
 'Sarajevo': 'Sarajev

## 3) Perform entity linking

Perform entity linking with the help of the surface form dictionary. In case of overlapping mentions, link the longer one (e.g., if both "Amsterdam" and "New Amsterdam" are present in the surface form dictionary, then the mention "New Amsterdam" should be linked to the latter).

In [17]:
def perform_linking(input_text: str, sf_dict: Dict[str, str]) -> str:
    """Annotates a given input text with entities.
    
    Args:
        input_text: Input text.
        sf_dict: Surface form dictionary, mapping each mention to a canonical entity.
    
    Returns:
        Annotated text where linked entity mentions are marked up as `[mention](entity_ID)`.
    """
    annotated_text = input_text
    # TODO: Complete.
    return annotated_text

### Tests.

Note: tests assume `sf_dict` as a global variable.

In [19]:
%%run_pytest[clean]

@pytest.mark.parametrize("input_text,correct_value", [
    ("no city mentioned", "no city mentioned"),  # no entity mentioned 
    ("This is an example mention of Amsterdam.",  # single entity
     "This is an example mention of [Amsterdam](Amsterdam)."),
    ("Luxembourg City is located in ...",  # multi-term city name
     "[Luxembourg City](Luxembourg_City) is located in ..."),
    ("London is a popular city name",  # cities sharing the same name
     "[London][London] is a popular city name"),
    ("Have you ever been to New Amsterdam?",  # overlapping entity mentions
     "Have you ever been to [New Amsterdam](New_Amsterdam)?")
])
def test_perform_linking(input_text, correct_value):
    assert perform_linking(input_text, sf_dict) == correct_value

.FFFF                                                                              [100%]
_ test_perform_linking[This is an example mention of Amsterdam.-This is an example mention of [Amsterdam](Amsterdam).] _

input_text = 'This is an example mention of Amsterdam.'
correct_value = 'This is an example mention of [Amsterdam](Amsterdam).'

    @pytest.mark.parametrize("input_text,correct_value", [
        ("no city mentioned", "no city mentioned"),  # no entity mentioned
        ("This is an example mention of Amsterdam.",  # single entity
         "This is an example mention of [Amsterdam](Amsterdam)."),
        ("Luxembourg City is located in ...",  # multi-term city name
         "[Luxembourg City](Luxembourg_City) is located in ..."),
        ("London is a popular city name",  # cities sharing the same name
         "[London][London] is a popular city name"),
        ("Have you ever been to New Amsterdam?",  # overlapping entity mentions
         "Have you ever been to [New Amsterda

## 4) Render links

Assuming that you have added entity annotations to the input text in `[mention](entity_ID)` format, render that text as clickable links.  Essentially, you'll need to replace `[mention](entity_ID)` with `<a href="https://en.wikipedia.org/wiki/{entity_ID}">{mention}</a>` (where `{}` indicates variable placeholders).

In [23]:
def render_links(annotated_text: str) -> str:
    # TODO: complete.
    return annotated_text

Instead of writing an automated test, we will simply look at the results.

In [25]:
annotated_text = perform_linking("This is an example mention of Amsterdam.", sf_dict)
print(render_links(annotated_text))

This is an example mention of Amsterdam.
