# Entity linking

First draft, code copied from disambiguation_candidates.ipynb notebook

## 1. Import required software libraries

Entity linking requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

In [1]:
%load_ext autoreload
%autoreload 2

First we import standard libraries which should always be available

In [2]:
from dotenv import load_dotenv
import json
from IPython.display import HTML
import os
import requests
import time
import utils

Next we import packages which may require installation on this device

In [3]:
openai = utils.safe_import("openai")
pd = utils.safe_import("pandas")
pydantic = utils.safe_import("pydantic")
spacy = utils.safe_import("spacy")

Finally we set settings required for Google Colab

In [4]:
in_colab = utils.check_google_colab()

## 2. Read the texts that require processing

The texts should have been processed by the disambiguation_candidates.ipynb notebook. The file read here is an output file of that notebook. We read the texts and the associated metadata and show the first text with its entities.

In [6]:
infile_name = "output_disambiguation_ba25101ddbe8830789bfdfdb3a5ba6312d6853e6.json"

with open(infile_name, "r") as infile:
    texts_input = json.load(infile)
    infile.close()
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})


{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': [{'text': 'Anubis', 'label': 'PERSON', 'start_char': 21, 'end_char': 27, 'wikidata_id': {'id': 'Q47534', 'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head', 'model': 'gpt-4o-mini'}}]}


## 3. Disambiguate entities with GPT

We disambiguate entities by sending a prompt with each entity text, the context text and the WikiData candidates ids and their descriptions to an LLM. We ask the LLM to return the id associated with the description that best matches the entity in the given context.

First we define a helper function for creating the LLM prompt

In [56]:
def make_linking_prompt(entity, text):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Considering the following description of a museum artefact:

{text}

retrieve the relationship between this artefact and the following named entity, mentioned in the description:

{entity}

Please answer the following question: Why is this entity mentioned in the description? Please select your answer from the following options:

1. the artefact represents the entity
2. the entity represents the geographical context of the artefact
3. the entity represents the historical context of the artefact
4. the entity is related to an object or person represented in the artefact
5. the entity is a previous owner of the artefact
6. the entity is mentioned for another reason

Answer only with a number. If you choose for option 6, you  may add a clarification text
"""

In [67]:
LINKING_CACHE_FILE = "linking_cache.json"
model = "gpt-4o-mini"


openai_api_key = utils.get_openai_api_key()
openai_client = utils.connect_to_openai(openai_api_key)
entities = utils.extract_entities_from_ner_input(texts_input)
linking_cache = utils.read_json_file(LINKING_CACHE_FILE)
for entity in entities:
    if (entity["entity_text"] in linking_cache and 
        entity["text"] in linking_cache[entity["entity_text"]] and
        model in linking_cache[entity["entity_text"]][entity["text"]]):
        utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
        entity["link"] = linking_cache[entity["entity_text"]][entity["text"]][model] | {"model": model}
    else:
        utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to GPT")
        time.sleep(1)
        prompt = make_linking_prompt(entity["entity_text"], entity["text"])
        entity["link"] = {model: utils.process_text_with_gpt(openai_client, model, prompt)}
        if entity["entity_text"] not in linking_cache:
            linking_cache[entity["entity_text"]] = {}
        if entity["text"] not in linking_cache[entity["entity_text"]]:
            linking_cache[entity["entity_text"]][entity["text"]] = {}
        linking_cache[entity["entity_text"]][entity["text"]][model] = {key: value 
                                                                       for key, value in entity["link"].items()}
utils.write_json_file(LINKING_CACHE_FILE, linking_cache)
print("Finished processing")

Sending entity "Bagnani" of text 100 to GPT


In [68]:
len(linking_cache)

126