<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/tasks/entity_linking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Entity linking

This notebook links the entities found in the previous steps (NER, Disambiguation) to the artefacts

### Rationale

Making explicit the relations between the named entities helps us to better understand the data. In the context of museum artefact descriptions there are two major types of relations: relation between named entities among each other and relations between named entities and the artefacts. In the main data files used for testing the software, from the Egyptian Museum of Turin, the first group of relations is quite rare. Therefore we focus on finding relations/links between artefacts and the named entities in their description texts.

We use large language models (LLMs) to derive the relations. To make the task manageable, we have restricted the relations to eleven types:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

We include the definitions of these types in the prompt sent to the LLMs and ask them to select the best matching one. We offer each entity to the LLMs separately.

### Processing overview

The process consists of the following steps:

1. Import required software libraries: We start with importing required software libraries
2. Read the text that requires processing: Next we obtain the input text from the Disambiguation notebook
3. Linking: The text is sent to GPT with a prompt that instructs it to select the best type for the link between the artefact and the entity
4. Linking visualization: The link type is displayed in text with colour-coded entities
5. Save results: Save the results of the linking process for future processing

### Dependencies

This notebook depends on three files:

* utils.py: helper functions
* output_disambiguation_ba25101ddbe8830789bfdfdb3a5ba6312d6853e6.json: output file of disambiguation task
* linking_cache.json: context-dependent cache of linking analysis performed earlier

Please make sure they are available in this folder so that the notebook can run smoothly. You can download them from Github.

## 1. Import required software libraries

Entity linking requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

We start with checking if the notebook is running on Google Colab. If that is the case, we need to connect to the notebook's environment

In [1]:
import os

def check_notebook_environment_on_colab():
    """Test if run on Colab, if so test in environment is available, if not install it"""
    try:
        from google.colab import files
        try:
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
            print("Found notebook environment")
        except:
            print("notebook environment not found, installing...")
            !git clone https://github.com/pelagios/llm-lod-enriching-heritage.git
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
    except:
        print("Not running on Google Colab")

check_notebook_environment_on_colab()

Not running on Google Colab


Next we import standard libraries which should always be available

In [2]:
from dotenv import load_dotenv
import json
from IPython.display import HTML
import os
import requests
import time
import utils

Next we import packages which may require installation on this device

In [3]:
openai = utils.safe_import("openai")
pd = utils.safe_import("pandas")
pydantic = utils.safe_import("pydantic")
spacy = utils.safe_import("spacy")

Finally we set settings required for Google Colab

In [4]:
in_colab = utils.check_google_colab()

## 2. Read the texts that require processing

The texts should have been processed by the `ner.ipynb` notebook. The file read here is an output file of the `disambiguation-candidates.ipynb` notebook which in turn processed the `ner.ipynb` output. We read the texts and the associated metadata and show the first text with its entities.

In [5]:
infile_name = "output_disambiguation_ba25101ddbe8830789bfdfdb3a5ba6312d6853e6.json"

with open(infile_name, "r") as infile:
    texts_input = json.load(infile)
    infile.close()
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})

{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': [{'text': 'Anubis', 'label': 'PERSON', 'start_char': 21, 'end_char': 27, 'wikidata_id': {'id': 'Q47534', 'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head', 'model': 'gpt-4o-mini'}}]}


## 3. Link entities with GPT

We link entities to the artefacts by sending a prompt with each entity text, the context text and six candidate link types to an LLM. We ask the LLM to return the id associated with the type that best matches the type of link between the entity and the artefact.

First we define three helper functions

In [6]:
def make_linking_prompt(entity, text):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Considering the following description of a museum artefact:

{text}

Retrieve the relationship between this artefact and the following named entity, 
mentioned in the description:

{entity}

Please answer the following question: Why is this entity mentioned in the description? 
Please select your answer from the following options:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

Answer only with a number. If you choose for option 11, you  may add a clarification text
"""

In [7]:
def add_linking_data_to_texts_input(texts_input, entities):
    """Insert the retrieved linking data into the variable text_inputs and return it"""
    entities_per_text = {}
    for entity in entities:
        if entity["text_id"] not in entities_per_text:
            entities_per_text[entity["text_id"]] = {}
        entities_per_text[entity["text_id"]][entity["entity_text"]] = entity
    for text_id, text in enumerate(texts_input):
        for entity in text["entities"]:
            if text_id in entities_per_text and entity["text"] in entities_per_text[text_id]:
                entity["link"] = entities_per_text[text_id][entity["text"]]["link"]
    return texts_input

In [8]:
LINKING_CACHE_FILE = "linking_cache.json"
model = "gpt-4o-mini"


def openai_link_suggestion(model, texts_input):
    entities = utils.extract_entities_from_ner_input(texts_input)
    linking_cache = utils.read_json_file(LINKING_CACHE_FILE)
    for entity in entities:
        if (entity["entity_text"] in linking_cache and 
            entity["text"] in linking_cache[entity["entity_text"]] and
            model in linking_cache[entity["entity_text"]][entity["text"]]):
            utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            if "link" not in entity: entity["link"] = {}
            entity["link"][model] = linking_cache[entity["entity_text"]][entity["text"]][model]
        else:
            if openai_client not in vars():
                openai_api_key = utils.get_openai_api_key()
                openai_client = utils.connect_to_openai(openai_api_key)                
            utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to GPT")
            time.sleep(1)
            prompt = make_linking_prompt(entity["entity_text"], entity["text"])
            if "link" not in entity: entity["link"] = {}
            entity["link"][model] = utils.process_text_with_gpt(openai_client, model, prompt)
            if entity["entity_text"] not in linking_cache:
                linking_cache[entity["entity_text"]] = {}
            if entity["text"] not in linking_cache[entity["entity_text"]]:
                linking_cache[entity["entity_text"]][entity["text"]] = {}
            linking_cache[entity["entity_text"]][entity["text"]][model] = entity["link"][model]
            utils.write_json_file(LINKING_CACHE_FILE, linking_cache)
    print("Finished processing")
    utils.save_data_to_json_file(linking_cache, file_name=LINKING_CACHE_FILE, in_colab=in_colab)
    return entities

Next, we call GPT to suggest the types of links between the entities and the artefact. We call the GPT separately for each unique entity. In case the model used already predicted an entity, we used the link type stored in the cache. The links are collected in the variable `entities` and are later stored in the variable `texts_output`. We show the first item of this variable.

In [9]:
entities = openai_link_suggestion(model, texts_input)
texts_output = add_linking_data_to_texts_input(texts_input, entities)
print({"text_cleaned": texts_output[0]["text_cleaned"],
       "entities": [{"entity_text": entity["text"], 
                     "wikidata_id": entity["wikidata_id"]["id"], 
                     "link": list(entity["link"].values())[0]} for entity in texts_output[0]["entities"]]})

Retrieving entity "Bagnani" of text 100 from cache
Finished processing
️✅ Saved data to file linking_cache_7d1cb217b7ded2bec61ee23fc7daae71fde7b271.json
{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': [{'entity_text': 'Anubis', 'wikidata_id': 'Q47534', 'link': '1'}]}


## 4. Linking visualization

We visualize the results of the linking process by displaying the numeric linking code in superscript next to the entity in its context. Please note that the six numeric codes represent the relations between the entity and the artefact and stand for the following:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

In [10]:
for text_id, text in enumerate(texts_output):
    if text_id < 3:
        display(HTML(utils.mark_entities_in_text(text["text_llm_output"], text["entities"])))

## 5. Save results

We save the results in a json file. The helper function used for this is defined in the file `utils.py`

In [11]:
utils.save_data_to_json_file(texts_output, file_name="output_linking.json", in_colab=in_colab)

️✅ Saved data to file output_linking_34e26bfd19c837e400a5fcb214cd1e7a25304a12.json


In [12]:
utils.save_entities_as_table("entities_table.csv", texts_output)

️✅ Saved data to file entities_table.csv
