<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/tasks/demo_qwen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Prepare Cultural Heritage Data for Named Entity Analysis

This demo notebook is a combination of the four notebooks on data preparation, named entity recognition (NER), entity disambiguation and entity linking using Qwen. 

### Rationale

This recipe takes some source content and preprocesses it for use across the rest of the cookbook.

You can either:

* upload your own sample text (e.g. transcribed text from a digitised item), or
* run queries to get sample content from the Cleveland Museum of Art API from a keyword (e.g. "Manet") or record ID search

The notebook shows the necessary steps to process the results and format them into a JSON file that can then be fed into a NER process.

### Overview of the process

Most parts of the recipe simply need to be run in a Notebook environment like Colab, Binder, Jupyter Notebooks.

1. Initialise the notebook: run the first 'cells' to import the required libraries
2. Fetch input: this is the part where you can provide some specific input, in csv format
3. Preprocess text: clean text, detect language and split the text in tokens and sentences
4. Save the results: combine results into a structured json record: text (original and cleaned), language, sentences, tokens and meta information (counts and ids)

The fifth chapter provides alternatives for handling texts from other sources than websites:

5. Alternatives for reading text data: text defined in code, text from file, text from pdf file and text from a website

### Dependencies

This notebook depends on four files:

* utils.py: helper functions
* input_em.csv: default example input file (table)
* input_wanderer.pdf: example pdf input file
* input_wikipedia.txt: example text input file

Please make sure they are available in this folder so that the notebook can run smoothly. You can download them from [Github](https://github.com/pelagios/llm-lod-enriching-heritage/tree/main/notebooks/tasks).

## 1.1. Install required software libraries

Preprocessing data requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

We start with checking if the notebook is running on Google Colab. If that is the case, we need to connect to the notebook's environment

In [1]:
import os

def check_notebook_environment_on_colab():
    """Test if run on Colab, if so test in environment is available, if not install it"""
    try:
        from google.colab import files
        try:
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
            print("Found notebook environment")
        except:
            print("notebook environment not found, installing...")
            !git clone https://github.com/pelagios/llm-lod-enriching-heritage.git
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
    except:
        print("Not running on Google Colab")

check_notebook_environment_on_colab()

Not running on Google Colab


Next, we import standard libraries which should always be available

In [5]:
import copy
import hashlib
import importlib
from IPython.display import clear_output, HTML
import json
import requests
import subprocess
import sys
import time
from typing import List, Dict, Any, Tuple, Optional
import unicodedata
import utils

üì¶ dotenv not found. Installing...
Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.2.1-py3-none-any.whl (21 kB)
Installing collected packages: python-dotenv, dotenv
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [dotenv]
[1A[2KSuccessfully installed dotenv-0.9.9 python-dotenv-1.2.1
Finished installing dotenv
üì¶ langid not found. Installing...
Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.9/1.9 MB[0m [31m41.7 MB/s[0m  [33m0:00:00[0m
[?25h  Installing build dependencies: started
  Installing 

Next we import packages which may require installation on this device

In [6]:
spacy = utils.safe_import("spacy")
langid = utils.safe_import("langid")
regex = utils.safe_import("regex")
pl = utils.safe_import("polars")
pypdf = utils.safe_import("pypdf")

üì¶ pypdf not found. Installing...
Collecting pypdf
  Downloading pypdf-6.2.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.2.0-py3-none-any.whl (326 kB)
Installing collected packages: pypdf
Successfully installed pypdf-6.2.0
Finished installing pypdf


Finally we set setting required for Google Colab

In [7]:
in_colab = utils.check_google_colab()

## 1.2. Reading texts from a csv file

We use the artefact descriptions from the Egyptian Museum in Turin as example texts. We use two columns of the file, one with the identifier of the artefact and one with the description text

First we define a function for reading the texts

In [None]:
import yaml

with open('../config.yml') as yaml_file:
    data = yaml.safe_load(yaml_file)



{'museum': {'name': 'egizio', 'file_name': 'input_em.csv', 'data_source': 'EMT', 'id_column_name': 'Inventory Number', 'text_column_name': 'Description'}, 'global': {'architecture': 'mac', 'model_name': 'qwen3:8b', 'max_processed': 6}}


In [None]:
file_name = data['museum']['file_name'] # to comment
data_source = data['museum']['data_source']
id_column_name = data['museum']['id_column_name']
text_column_name = data['museum']['text_column_name']

def read_emt_data(file_name):
    """Read texts from the Egyptian Museum in Turin from a csv file"""
    try:
        table_pl = pl.read_csv(file_name)[id_column_name, text_column_name]
        table_pl.write_csv("tmp.csv")
        return [{"id": row[0], "data_source": data_source, "text_original": row[1]}
                for row in table_pl.iter_rows()]
    except:
        print(f"{utils.CHAR_FAILURE} Cannot read data from file {file_name}!")
        return []

Next we call the function and store the variable in the variable `texts`. We show the first text to check if the process was successful. Note that the text includes (`id`) the text identifier as metadata

In [26]:
texts = read_emt_data(file_name)
if len(texts) <= 0:
    print(f"{utils.CHAR_FAILURE} No texts found in file {file_name}!")
else:
    print("Text found:", texts[0])

Text found: {'id': 'C. 0115', 'data_source': 'EMT', 'text_original': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


## 1.3. Preprocess text for named entity analysis

Three steps are performed while preprocessing the texts:

1. Text cleanup: remove non-text characters, urls and email addresses
2. Detect the language of the text
3. Split the text in sentences and tokens

We start with defining five functions for performing the preprocessing tasks

In [19]:
def cleanup_text(text: str) -> str:
    """Cleanup text: remove non-text characters, urls and email addresses"""
    text = unicodedata.normalize("NFC", text)
    text = regex.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", text)
    text = regex.sub(r"[ \t\u00A0]+", " ", text)
    text = regex.sub(r"\s+", " ", text)
    text = regex.sub(r"https?://\S+", "<URL>", text)
    text = regex.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", text)
    return text

In [20]:
def preprocess_text(text, language_id):
    """Preprocess a single text: divide it in sentences and tokens"""
    try:
        spacy_model = spacy.blank(language_id)
    except:
        print(f"{utils.CHAR_FAILURE} Cannot load model for language {language_id}")
        spacy_model = spacy.blank("xx")
    spacy_model.add_pipe("sentencizer")
    preprocessed_text = spacy_model(text)
    sentences = [{"id": sentence_id, 
                  "start": sentence.start_char, 
                  "end": sentence.end_char, 
                  "text": sentence.text} for sentence_id, sentence in enumerate(preprocessed_text.sents)]
    tok2sent = {token.i: sentence_id for sentence_id, sentence in enumerate(preprocessed_text.sents) 
                                     for token in sentence}
    tokens = [{"id": token.i,
               "text": token.text,
               "start": token.idx,
               "end": token.idx + len(token.text),
               "ws": token.whitespace_ != "",
               "is_punct": token.is_punct,
               "sent_id": tok2sent.get(token.i)} for token in preprocessed_text]
    return sentences, tokens

In [22]:
def preprocess_texts(texts) -> List[Dict[str, Any]]:
    """Preprocess a list of texts and return the results as a list of dictionaries"""
    results: List[Dict[str, Any]] = []
    for index, text in enumerate(texts):
        utils.squeal(f"Processing text {index + 1}")
        sentences, tokens = preprocess_text(text["text_cleaned"], text["language_id"])
        results.append({"meta": {**{key: text[key] for key in text if not regex.search("text", key)},
                                 "char_count": len(text),
                                 "token_count": len(tokens),
                                 "sentence_count": len(sentences)},
                        "text_original": text["text_original"],
                        "text_cleaned": text["text_cleaned"],
                        "sentences": sentences,
                        "tokens": tokens})
    print("Finished processing")
    return results

In [24]:
def show_example_text(text, skipped_fields=[]):
    """Show example text"""
    text_shown = {key: text[key] for key in text if key not in skipped_fields}
    if "tokens" in skipped_fields:
        text_shown = text_shown | {"tokens": text["tokens"][:3] + 
                                             ["..."] if len(text["tokens"]) >= 3 else []}
    print(text_shown)

Next, we apply the cleanup function to the texts and store the results in the variable `text_cleaned`. We show the first text to check if the process was successful

In [27]:
texts_cleaned = [text | {"text_cleaned": cleanup_text(text["text_original"])} for text in texts]
show_example_text(texts_cleaned[0], skipped_fields=["text_original"])

{'id': 'C. 0115', 'data_source': 'EMT', 'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


After this, we apply the language derivation function to the texts and store the results in the variable `text_with_language_ids`. Again, we show the first text to check if the process was successful. A parameter `language_id` should appear at the start of the data.

In [29]:
texts_with_language_ids = [{"language_id": utils.detect_text_language(text["text_cleaned"])} | text
                            for text in texts_cleaned]
show_example_text(texts_with_language_ids[0], skipped_fields=["text_original"])

{'language_id': 'en', 'id': 'C. 0115', 'data_source': 'EMT', 'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


Finally, we apply the preprocess function to the texts and store the results in the variable `text_preprocessed`. We show the first text to check if the process was successful. The texts have been divided in sentences and tokens

In [30]:
texts_preprocessed = preprocess_texts(texts_with_language_ids)
show_example_text(texts_preprocessed[0], skipped_fields=["text_original", "text_cleaned", "tokens"])

Processing text 100
Finished processing
{'meta': {'language_id': 'en', 'id': 'C. 0115', 'data_source': 'EMT', 'char_count': 5, 'token_count': 23, 'sentence_count': 4}, 'sentences': [{'id': 0, 'start': 0, 'end': 28, 'text': 'Statuette of the god Anubis.'}, {'id': 1, 'start': 29, 'end': 36, 'text': 'Bronze.'}, {'id': 2, 'start': 37, 'end': 85, 'text': 'Late Period (722-332 BC).. Acquired before 1882.'}, {'id': 3, 'start': 86, 'end': 92, 'text': 'C. 115'}], 'tokens': [{'id': 0, 'text': 'Statuette', 'start': 0, 'end': 9, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 1, 'text': 'of', 'start': 10, 'end': 12, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 2, 'text': 'the', 'start': 13, 'end': 16, 'ws': True, 'is_punct': False, 'sent_id': 0}, '...']}


# 2. Named Entity Recognition

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that involves identifying and classifying named entities (like people, places, organizations) within text. For example, in the sentence "Shakespeare wrote Romeo and Juliet in London", a NER system would identify "Shakespeare" as a person, "Romeo and Juliet" as a work of art, and "London" as a location. NER is crucial for extracting structured information from unstructured text, making it valuable for tasks like information retrieval, question answering, and metadata enrichment. In this notebook, we'll explore how to perform NER using both traditional NLP approaches and modern Large Language Models.

### Rationale

This notebook demonstrates how to use OpenAI's GPT models to perform Named Entity Recognition (NER) by converting input text into annotated markdown format. Rather than using traditional NLP libraries, we leverage a Large Language Model's natural language understanding capabilities to identify and classify named entities. The notebook takes plain text as input and outputs markdown where entities are annotated in the format [Entity](TYPE), such as [London](LOCATION). This approach showcases how LLMs can be used for structured information extraction tasks in cultural heritage metadata enrichment.


### Process Overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the preprocessing notebook
3. **Named entity recognition**: The text is sent to GPT with a prompt that instructs it to identify entities. The LLM marks entities in markdown format: \[entity text\]\(entity type\)
4. **Named entity visualization**: The annotated text is displayed with colour-coded entity highlighting
5. **Save results**: Save the results of named entity recognition for future processing

This approach leverages the LLM's natural language understanding while producing structured, machine-readable output.

### Dependencies

This notebook depends on three files:

* utils.py: helper functions
* output_data_preparation_11f98441067263d80ee1a6bac27babf0f2c6734b.json: output file of data preparation task
* ner_cache.json: context-dependent cache of names found earlier

Please make sure they are available in this folder so that the notebook can run smoothly. You can download them from [Github](https://github.com/pelagios/llm-lod-enriching-heritage/tree/main/notebooks/tasks).

## 2.1. Import required software libraries

This function is needed in different sections, we define it here.

In [32]:
CONTEXT = """extracted from records of objects in the collection of
             the Egyptian museum, Torino, the Museo Egizio ‚Äì Torino"""

def make_prompt(texts_input, target_labels):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Convert the following text {CONTEXT} into a structured markdown format,
where you annotate the entities in the text in the following format:
[Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
{target_labels}

Do this for the following text:
{texts_input}

Only return the markdown output, nothing else.
"""

## 2.2. Read the text that requires processing

The text should have been preprocessed by the `data_preparation.ipynb` notebook. The file read here is an output file of that notebook. We select the first text and show it.

In [33]:
input_data = copy.deepcopy(texts_preprocessed)
texts_input = [text["text_cleaned"] for text in input_data]
print(texts_input[0])

Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115


## 2.3. Named entity recognition with Qwen

Here we use [Qwen](https://en.wikipedia.org/wiki/Qwen), a locally-run model developed by the company Alibaba. The model has hardware requirements which your computer may not satisfy: it needs a GPU and about 12 gigabytes of memory. When running this model on Google Colab, it is recommended to use the runtime environment `T4 GPU`. If the model is too slow for you, the model `qwen3:0.6b` can be used as an alternative.

In [41]:
importlib.reload(utils)

<module 'utils' from '/Users/marco/Documents/unito/llm-lod-enriching-heritage/notebooks/tasks/utils.py'>

In [53]:
MODEL = data['global']['model_name']
MAX_PROCESSED = data['global']['max_processed']

texts_output = utils.ollama_run(MODEL, texts_input[:MAX_PROCESSED], make_prompt, in_colab)
print(texts_output[1])

Processing text 6 with model qwen3:8b
[GIN] 2025/11/14 - 11:20:45 | 200 | 42.656292875s |       127.0.0.1 | POST     "/api/generate"
Finished processing
Ô∏è‚úÖ Saved data to file ner_cache_005121f7aa72cf9a7ed1f0b00db3836e9a948352.json
Statue of the goddess Sekhmet. Granodiorite. New Kingdom, 18th Dynasty, reign of [Amenhotep III](PERSON) (1390-1353 BC). [Thebes](LOCATION).. [Drovetti](PERSON) collection (1824). C. 256


## 2.4. Named entity visualization

Here we show the results of named entity recognition in a more readable format

First we define three helper functions

In [54]:
def extract_entities_from_markdown(texts_output):
    """Extract the locations and labels of the entities from the markdown and return these"""
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    text_prefix = ""
    current_char = 0
    entities = []
    for match in regex.finditer(pattern, texts_output):
        text_prefix += texts_output[current_char: match.start()]
        entities.append({"text": match.group(1),
                         "label": match.group(2),
                         "start_char": len(text_prefix),
                         "end_char": len(text_prefix) + len(match.group(1))})
        text_prefix += match.group(1)
        current_char = match.end()
    text_prefix += texts_output[current_char:]
    return entities, text_prefix

In [55]:
def check_llm_output_text(text_llm_output, texts_input):
    if text_llm_output != texts_input:
        print(f"{utils.CHAR_FAILURE} Output text of named entity recognition is different from input text!")

Next we call the function to extract the entities from the results, check the output text and visualize the results, for the first five results

In [56]:
entities_all = []
for index, text in enumerate(texts_output):
    entities, text_llm_output = extract_entities_from_markdown(text)
    entities_all.append({"entities": entities, "text_llm_output": text_llm_output})
    if index < 5:
        check_llm_output_text(text_llm_output, texts_input[index])
        display(HTML(utils.mark_entities_in_text(text_llm_output, entities)))

‚ùå Output text of named entity recognition is different from input text!


‚ùå Output text of named entity recognition is different from input text!


## 2.5. Save results

Here we add the named entity analysis to the input data

In [57]:
for index, entities in enumerate(entities_all):
    input_data[index]["entities"] = entities["entities"]
    input_data[index]["text_llm_output"] = entities["text_llm_output"]
for rest_index in range(index, len(input_data)):
    input_data[rest_index]["entities"] = []
    input_data[rest_index]["text_llm_output"] = []   

# 3. Disambiguation with Candidates

This notebook disambiguates the entities found by the NER notebook by linking them to WikiData concepts

### Rationale

When using LLMs, we can leverage context to help determine correct identifiers for entities found. One of the largest challenges with LLMs is getting them to generate the correct identifier for specific entities. Without context, an LLM will confidently generate a believable looking identifier code. When checked, however, users will often find these codes do not exist or are entirely wrong.

We solve this problem with context. LLMs can receive context in one of two ways: either we can give it the context or we can use an LLM agentically with tools so that it can retrieve the context for itself. Both have their advantages, but both work within the same principal: context allows the LLM to get the correct identifier code so that it does not need to hallucinate one. While hallucinations are still possible, the chances are reduced if we provide a list of options to an LLM to choose from.

In this notebook, we will explore the first of these options, where we provide the LLM with a list of candidates that were generated in the previous data notebook. To make things easier, we have pasted the output from that notebook here.

It is also worth noting that providing the LLM with the necessary context is often quite cheaper (assuming you are using a paid-model), than letting the model agentically query the web or use other tools. We will see this in the next notebook.

### Processing overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the NER notebook
3. **Candidate extraction**
4. **Disambiguation**: The text is sent to an LLM with a prompt that instructs it to disambiguate entities based on the available candidates.
5. **Disambiguation visualization**: The annotated text is displayed with colour-coded entity highlighting
6. **Save results**: Save the results of the disambiguation process for future processing

### Dependencies

This notebook depends on four files:

* utils.py: helper functions
* output_ner_5f12fa7c16d33ea378148197569f999f774f7481.json: output file of NER task
* disambiguation_cache_wikidata.json: context-dependent cache of WikiData information found earlier
* disambiguation_cache_llm.json: context-dependent cache of LLM choices make earlier

Please make sure the files are available in this folder so that the notebook can run smoothly. You can download the files from Github.

## 3.1. Import required software libraries

These two helper functions are needed in different sections, we define them here.

In [58]:
CONTEXT = """extracted from records of objects in the collection of 
             the Egyptian museum, Torino, the Museo Egizio ‚Äì Torino"""

def make_prompt(entity, text, candidates):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Disambiguate the entity "{entity}" in the following text {CONTEXT}.

{text}

Here are the candidates, in json format:

{candidates}

Only return the relevant id, nothing else.
"""

In [59]:
def add_wikidata_ids_to_texts_input(texts_input, entities):
    """Insert the retrieved wikidata ids into the variable text_inputs and return it"""
    entities_per_text = {}
    for entity in entities:
        if entity["text_id"] not in entities_per_text:
            entities_per_text[entity["text_id"]] = {}
        entities_per_text[entity["text_id"]][entity["entity_text"]] = entity
    for text_id, text in enumerate(texts_input):
        for entity in text["entities"]:
            if text_id in entities_per_text and entity["text"] in entities_per_text[text_id]:
                entity["wikidata_id"] = entities_per_text[text_id][entity["text"]]["wikidata_id"]
    return texts_input

## 3.2. Read the texts that require processing

The texts should have been processed by the ner.ipynb notebook. The file read here is an output file of that notebook. We read the texts and the associated metadata and show the first text with its entities.

In [60]:
texts_input = copy.deepcopy(input_data)
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})

{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': []}


## 3.3. Disambiguation candidate generation

We need candidate ids for the entities to make the task for the LLM easier. We obtain these candidate ids by searching for the entities in wikidata.org. The search step returns seven candidate ids with a description text for each of them. In order not to overload the website, we use a cache for storing entities that were looked up earlier. We also wait for 10 seconds (variable `SLEEP_TIME_FETCH_PAGE`) between the searches. 

First we define five helper functions

In [None]:
SLEEP_TIME_FETCH_PAGE = data['global']['sleep_time']
USER_AGENT = data['global']['user_agent']

def query_wikidata_for_single_entity(entity_text):
    """Read Wikidata data on entity_text and return it"""
    headers = {
        "User-Agent": USER_AGENT,
        "Accept-Encoding": "gzip, deflate",
    }
    params = {
        'action': 'wbsearchentities',
        'language': 'en',
        'format': 'json',
        'limit': '7',
        'search': entity_text 
    }
    url = 'https://www.wikidata.org/w/api.php'
    time.sleep(SLEEP_TIME_FETCH_PAGE)
    wikidata_data = requests.get(url, params=params, headers=headers)
    return wikidata_data

In [62]:
def extract_entities_from_ner_input(texts_input):
    """For each entity in the input text return the entity text, context text and context text id""" 
    return [{"entity_text": entity["text"],
             "text_id": index,
             "text": text["text_cleaned"]} 
            for index, text in enumerate(texts_input) for entity in text["entities"]]

In [None]:
DISAMBUGUATION_CACHE_FILE_WIKIDATA = data["utils"]["disambiguation_cache_wikidata"]


def find_wikidata_candidates_for_entities(entities):
    """Lookup candidate ids for the entities on wikidata.org and return them"""
    wikidata_cache = utils.read_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA)
    for entity in entities:
        if entity["entity_text"] not in wikidata_cache.keys():
            utils.squeal(f"Looking up entity {entity['entity_text']} (text id {entity['text_id'] + 1}) on WikiData.org...")
            wikidata_data = query_wikidata_for_single_entity(entity["entity_text"])
            wikidata_cache[entity["entity_text"]] = [{"id": candidate["id"], "description": candidate["description"]} 
                                                     for candidate in json.loads(wikidata_data.text)["search"]
                                                     if "description" in candidate.keys()]
            utils.write_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA, wikidata_cache)
    print("Finished processing")
    utils.save_data_to_json_file(wikidata_cache, file_name=DISAMBUGUATION_CACHE_FILE_WIKIDATA, in_colab=in_colab)
    for entity in entities:
        entity["candidates"] = wikidata_cache[entity["entity_text"]]
    return entities

Next we extract the entities from the input data and call a helper function for finding the candidate ids. These will be stored in the `entities` variable. We show the first item of the results.

In [64]:
entities = extract_entities_from_ner_input(texts_input)
entities = find_wikidata_candidates_for_entities(entities)
print({key: entities[0][key] for key in ['entity_text', 'candidates']})

Looking up entity god Harpocrates (text id 4) on WikiData.org...
Finished processing
Ô∏è‚úÖ Saved data to file disambiguation_cache_wikidata_c04128363cdef2ae6a8a1a81c597e38285555a11.json
{'entity_text': 'Amenhotep III', 'candidates': [{'id': 'Q42606', 'description': 'ninth Pharaoh of the Eighteenth dynasty of Egypt'}, {'id': 'Q55018696', 'description': 'operatic character in the opera Akhnaten by Philip Glass; father of Akhenaton'}, {'id': 'Q96185335', 'description': 'Facsimile, Anen (TT 120), Amenhotep III, Queen Tiye, kiosk by Nina de Garis Davies (MET, 33.8.8)'}, {'id': 'Q96185392', 'description': 'Facsimile, Anen, TT 226, Amenhotep III, Mutemwia by Nina de Garis Davies (MET, 15.5.1)'}, {'id': 'Q97778073', 'description': 'scholarly book'}, {'id': 'Q116247775', 'description': 'head, Amenhotep III, Blue Crown at the Metropolitan Museum of Art (MET, 56.138)'}]}


## 3.4. Disambiguate entities with Qwen

Here we use [Qwen](https://en.wikipedia.org/wiki/Qwen), a locally-run model developed by the company Alibaba. The model has hardware requirements which your computer may not satisfy: it needs a GPU and about 12 gigabytes of memory. When running this model on Google Colab, it is recommended to use the runtime environment `T4 GPU`.

First we define two helper functions|

In [65]:
def process_llm_response(response, candidates, model):
    """cleanup llm response and return associated candidate wikidata lemma"""
    wikidata_id = regex.sub(r"^.*(Q\d+).*$", r"\1", response)
    if not regex.search(r"^Q\d+$", wikidata_id):
        return {"id": "", "description": "", "model": model}
    for candidate in candidates:
        if candidate["id"] == wikidata_id:
            return candidate | {"model": model}
    return {"id": "", "description": ""}

In [None]:
DISAMBUGUATION_CACHE_FILE_LLM = data["utils"]["disambiguation_cache_llm"]

def ollama_wikidata_id_selection(model, entities):
    llm_cache = utils.read_json_file(DISAMBUGUATION_CACHE_FILE_LLM)
    for entity in entities:
        if (entity["entity_text"] in llm_cache and 
            entity["text"] in llm_cache[entity["entity_text"]] and
            model in llm_cache[entity["entity_text"]][entity["text"]]):
            utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            entity["wikidata_id"] = llm_cache[entity["entity_text"]][entity["text"]][model] | {"model": model}
        else:
            if "ollama" in sys.modules:
                ollama = utils.importlib.import_module("ollama")
            else:
                ollama = utils.import_ollama_module()
            utils.install_ollama_model(model, ollama)
            utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to {model}")
            prompt = make_prompt(entity["entity_text"], entity["text"], entity["candidates"])
            response = utils.process_text_with_ollama(model, prompt, ollama)
            entity["wikidata_id"] = process_llm_response(response, entity["candidates"], model)
            if entity["entity_text"] not in llm_cache:
                llm_cache[entity["entity_text"]] = {}
            if entity["text"] not in llm_cache[entity["entity_text"]]:
                llm_cache[entity["entity_text"]][entity["text"]] = {}
            llm_cache[entity["entity_text"]][entity["text"]][model] = {key: value 
                                                                       for key, value in entity["wikidata_id"].items()}
            utils.write_json_file(DISAMBUGUATION_CACHE_FILE_LLM, llm_cache)
            time.sleep(2)
    print("Finished processing")
    utils.save_data_to_json_file(llm_cache, file_name=DISAMBUGUATION_CACHE_FILE_LLM, in_colab=in_colab)
    return entities

In [67]:
MODEL = data['global']['model_name']
MAX_PROCESSED = data['global']['max_processed']
MAX_PROCESSED_ENTITIES = len([entity for entity in entities if entity["text_id"] < MAX_PROCESSED])

processed_entities = ollama_wikidata_id_selection(MODEL, entities[:MAX_PROCESSED_ENTITIES])
texts_output = add_wikidata_ids_to_texts_input(texts_input[:MAX_PROCESSED], processed_entities)
print({key: value for key, value in processed_entities[0].items() if key in ["entity_text", "text", "wikidata_id"]})

Sending entity "Drovetti" of text 5 to qwen3:8b
[GIN] 2025/11/14 - 11:35:10 | 200 | 43.168037208s |       127.0.0.1 | POST     "/api/generate"
Finished processing
Ô∏è‚úÖ Saved data to file disambiguation_cache_llm_82906bd5bcb60d6477641cc8b9126729125d4a6d.json
{'entity_text': 'Amenhotep III', 'text': 'Statue of the goddess Sekhmet. Granodiorite. New Kingdom, 18th Dynasty, reign of Amenhotep III (1390-1353 BC). Thebes.. Drovetti collection (1824). C. 256', 'wikidata_id': {'id': 'Q42606', 'description': 'ninth Pharaoh of the Eighteenth dynasty of Egypt', 'model': 'qwen3:8b'}}


## 3.5. Disambiguation visualization

Here we show the results of the disambiguation task in a more readable format. We use the helper function `mark_entities_in_text` for this. But you will not find the definition of this helper function here: it is defined in the file `utils.py` because it is used by other notebooks as well.

Here we call the helper function and show the first three texts with their entities and the WikiData ids. We do not supply it with the cleaned text but with the text that was output of the named entity recognition (`text_llm_output`), because that is sometimes different from the cleaned text.

In [68]:
for text_id, text in enumerate(texts_output):
    if text_id < 3:
        display(HTML(utils.mark_entities_in_text(text["text_llm_output"], text["entities"])))

# 4. Entity linking

This notebook links the entities found in the previous steps (NER, Disambiguation) to the artefacts

### Rationale

Making explicit the relations between the named entities helps us to better understand the data. In the context of museum artefact descriptions there are two major types of relations: relation between named entities among each other and relations between named entities and the artefacts. In the main data files used for testing the software, from the Egyptian Museum of Turin, the first group of relations is quite rare. Therefore we focus on finding relations/links between artefacts and the named entities in their description texts.

We use large language models (LLMs) to derive the relations. To make the task manageable, we have restricted the relations to eleven types:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

We include the definitions of these types in the prompt sent to the LLMs and ask them to select the best matching one. We offer each entity to the LLMs separately.

### Processing overview

The process consists of the following steps:

1. Import required software libraries: We start with importing required software libraries
2. Read the text that requires processing: Next we obtain the input text from the Disambiguation notebook
3. Linking: The text is sent to GPT with a prompt that instructs it to select the best type for the link between the artefact and the entity
4. Linking visualization: The link type is displayed in text with colour-coded entities
5. Save results: Save the results of the linking process for future processing

### Dependencies

This notebook depends on three files:

* utils.py: helper functions
* output_disambiguation_ba25101ddbe8830789bfdfdb3a5ba6312d6853e6.json: output file of disambiguation task
* linking_cache.json: context-dependent cache of linking analysis performed earlier

Please make sure they are available in this folder so that the notebook can run smoothly. You can download them from Github.

## 4.1. Import required software libraries

These two helper functions are needed in different sections, we define them here.

In [70]:
CONTEXT = """extracted from records of objects in the collection of 
             the Egyptian museum, Torino, the Museo Egizio ‚Äì Torino"""

def make_linking_prompt(entity, text):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Considering the following description of a museum artefact {CONTEXT}:

{text}

Retrieve the relationship between this artefact and the following named entity, 
mentioned in the description:

{entity}

Please answer the following question: Why is this entity mentioned in the description? 
Please select your answer from the following options:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

Answer only with a number. If you choose for option 11, you  may add a clarification text
"""

In [71]:
def add_linking_data_to_texts_input(texts_input, entities):
    """Insert the retrieved linking data into the variable text_inputs and return it"""
    entities_per_text = {}
    for entity in entities:
        if entity["text_id"] not in entities_per_text:
            entities_per_text[entity["text_id"]] = {}
        entities_per_text[entity["text_id"]][entity["entity_text"]] = entity
    for text_id, text in enumerate(texts_input):
        for entity in text["entities"]:
            if text_id in entities_per_text and entity["text"] in entities_per_text[text_id]:
                entity["link"] = entities_per_text[text_id][entity["text"]]["link"]
    return texts_input

## 4.2. Read the texts that require processing

The texts should have been processed by the `ner.ipynb` notebook. The file read here is an output file of the `disambiguation-candidates.ipynb` notebook which in turn processed the `ner.ipynb` output. We read the texts and the associated metadata and show the first text with its entities.

In [72]:
texts_input = copy.deepcopy(texts_output)
print({"text_cleaned": texts_input[1]["text_cleaned"], 
       "entities": texts_input[1]["entities"]})

{'text_cleaned': 'Statue of the goddess Sekhmet. Granodiorite. New Kingdom, 18th Dynasty, reign of Amenhotep III (1390-1353 BC). Thebes.. Drovetti collection (1824). C. 256', 'entities': [{'text': 'Amenhotep III', 'label': 'PERSON', 'start_char': 81, 'end_char': 94, 'wikidata_id': {'id': 'Q42606', 'description': 'ninth Pharaoh of the Eighteenth dynasty of Egypt', 'model': 'qwen3:8b'}}, {'text': 'Thebes', 'label': 'LOCATION', 'start_char': 111, 'end_char': 117, 'wikidata_id': {'id': 'Q101583', 'description': 'ancient Egyptian city', 'model': 'qwen3:8b'}}, {'text': 'Drovetti', 'label': 'PERSON', 'start_char': 120, 'end_char': 128, 'wikidata_id': {'id': 'Q822895', 'description': 'Italian diplomat, explorer and scholar', 'model': 'qwen3:8b'}}]}


## 4.3. Disambiguate entities with Qwen

Here we use [Qwen](https://en.wikipedia.org/wiki/Qwen), a locally-run model developed by the company Alibaba. The model has hardware requirements which your computer may not satisfy: it needs a GPU and about 12 gigabytes of memory. When running this model on Google Colab, it is recommended to use the runtime environment `T4 GPU`.

First we define a helper function

In [None]:
LINKING_CACHE_FILE = data['utils']['link_cache_file']


def ollama_link_suggestion(model, texts_input):
    entities = utils.extract_entities_from_ner_input(texts_input)
    linking_cache = utils.read_json_file(LINKING_CACHE_FILE)
    for entity in entities:
        if (entity["entity_text"] in linking_cache and 
            entity["text"] in linking_cache[entity["entity_text"]] and
            model in linking_cache[entity["entity_text"]][entity["text"]]):
            utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            if "link" not in entity: entity["link"] = {}
            entity["link"][model] = linking_cache[entity["entity_text"]][entity["text"]][model]
        else:
            utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to {model}")
            if "ollama" in sys.modules:
                ollama = utils.importlib.import_module("ollama")
            else:
                ollama = utils.import_ollama_module()
            utils.install_ollama_model(model, ollama)
            prompt = make_linking_prompt(entity["entity_text"], entity["text"])
            if "link" not in entity: entity["link"] = {}
            entity["link"][model] = utils.process_text_with_ollama(model, prompt, ollama)
            if entity["entity_text"] not in linking_cache:
                linking_cache[entity["entity_text"]] = {}
            if entity["text"] not in linking_cache[entity["entity_text"]]:
                linking_cache[entity["entity_text"]][entity["text"]] = {}
            linking_cache[entity["entity_text"]][entity["text"]][model] = entity["link"][model]
            utils.write_json_file(LINKING_CACHE_FILE, linking_cache)
    print("Finished processing")
    utils.save_data_to_json_file(linking_cache, file_name=LINKING_CACHE_FILE, in_colab=in_colab)
    return entities

Next we apply the helper function to choose the best link type for each entity

In [None]:
model = data['global']['model_name']
MAX_PROCESSED = data['global']['max_processed']

processed_entities = ollama_link_suggestion(model, texts_input[:MAX_PROCESSED])
texts_output = add_linking_data_to_texts_input(texts_input[:MAX_PROCESSED], processed_entities)
print({"text_cleaned": texts_output[0]["text_cleaned"],
       "entities": [{"entity_text": entity["text"], 
                     "wikidata_id": entity["wikidata_id"], 
                     "link": list(entity["link"].values())[0]} for entity in texts_output[0]["entities"]]})

Sending entity "Drovetti" of text 3 to qwen3:8b
[GIN] 2025/11/14 - 11:39:05 | 200 |    1.136625ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/14 - 11:39:56 | 200 | 50.421623625s |       127.0.0.1 | POST     "/api/generate"
Finished processing
Ô∏è‚úÖ Saved data to file linking_cache_97fa74d52115689dc382dba7c347d052c0eb9fbd.json
{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': []}


## 4.4. Linking visualization

We visualize the results of the linking process by displaying the numeric linking code in superscript next to the entity in its context. Please note that the six numeric codes represent the relations between the entity and the artefact and stand for the following:

1. the entity is a person depicted on or by the artefact
2. the entity is a person that created the artefact
3. the entity is a person that discovered the artefact
4. the entity is a person that owned the artefact
5. the entity is a person in power during the period of the creation of the artefact
6. the entity is a location depicted on or by the artefact
7. the entity is a location where the artefact was created
8. the entity is a location where the artefact was produced
9. the entity is a location where the artefact was discovered
10. the entity is the artefact's current location
11. other entity type or other relation between entity and artefact

In [75]:
for text_id, text in enumerate(texts_output):
    if text_id < 3:
        display(HTML(utils.mark_entities_in_text(text["text_llm_output"], text["entities"], linking_model=model)))