# Naive Entity Linker

In this exercise, you'll need to create the simplest possible entity linker, which recognizes city names in text and links those to the corresponding Wikipedia page.

Since we aim for a minimalistic solution, we won't be dealing with the problem of disambiguation explicitly.  Instead, we construct a surface form dictionary such that each mention is mapped to a single entity.  Then, the task boils down to (i) the construction of the surface form dictionary, and (ii) the recognition of mentions (dictionary keys) in the text.

We use part of the Wikipedia URL after the last slash to uniquely represent entities, i.e., as `entity_ID`.

The goal is to have a minimalistic working solution first, then expand it iteratively.
- For the minimal version, consider only capitals in Europe. (All those cities should have unique names.)
- Expand the list of cities by considering other Wikipedia lists and/or DBpedia properties. At this point, we don't deal with ambiguity; if a city name has already been added to the surface form dictionary, we ignore it.
- Finally, we resolve ambiguity by considering the more popular sense of the entity. Since we don't have link information easily available, we'll use the population of the city as a proxy. That is, if there are multiple cities with the same name, keep the one in the surface form dictionary with the highest population.

In [10]:
from typing import Dict, List

## 1) Obtain a list of cities

In [17]:
def get_cities(seed_cats: List[str]) -> List[str]:
    """Returns a list of cities represented by their entity_IDs.
    
    Args:
        seed_cats: Names of wikipedia categories that may be used for collecting cities.
    
    Returns:
        List of entity_IDs.
    """
    return []

In [18]:
cities = get_cities([])

## 2) Create a surface form dictionary

We create a dictionary that maps surface forms to entity_IDs (single entity_ID per surface form).

In [19]:
def create_sf_dict(cities: List[str]) -> Dict[str, str]:
    """Creates a surface form dictionary for a given list of cities.
    
    Args:
        cities: List of entity_IDs of cities.
    
    Returns:
        Dictionary with mention as key and entity_ID as value.
    """
    # TODO: Complete.
    return {}

In [21]:
sf_dict = create_sf_dict(cities)

## 3) Perform entity linking

Perform entity linking with the help of the surface form dictionary. In case of overlapping mentions, link the longer one (e.g., if both "Amsterdam" and "New Amsterdam" are present in the surface form dictionary, then the mention "New Amsterdam" should be linked to the latter).

In [22]:
def perform_linking(input_text: str, sf_dict: Dict[str, str]) -> str:
    """Annotates a given input text with entities.
    
    Args:
        input_text: Input text.
        sf_dict: Surface form dictionary, mapping each mention to a canonical entity.
    
    Returns:
        Annotated text where linked entity mentions are marked up as `[mention](entity_ID)`.
    """
    annotated_text = input_text
    # TODO: Complete.
    return annotated_text

Tests.

In [8]:
# TODO: It is your task to write some tests.

## 4) Render links

Assuming that you have added entity annotations to the input text in `[mention](entity_ID)` format, render that text as clickable links.  Essentially, you'll need to replace `[mention](entity_ID)` with `<a href="https://en.wikipedia.org/wiki/{entity_ID}">{mention}</a>` (where `{}` indicates variable placeholders).

In [23]:
def render_links(annotated_text: str) -> str:
    # TODO: complete.
    return annotated_text

Instead of writing an automated test, we will simply look at the results.

In [25]:
annotated_text = perform_linking("This is an example mention of Amsterdam.", sf_dict)
print(render_links(annotated_text))

This is an example mention of Amsterdam.
