<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/tasks/disambiguation_candidates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Disambiguation with Candidates

This notebook disambiguates the entities found by the NER notebook by linking them to WikiData concepts

### Rationale

When using LLMs, we can leverage context to help determine correct identifiers for entities found. One of the largest challenges with LLMs is getting them to generate the correct identifier for specific entities. Without context, an LLM will confidently generate a believable looking identifier code. When checked, however, users will often find these codes do not exist or are entirely wrong.

We solve this problem with context. LLMs can receive context in one of two ways: either we can give it the context or we can use an LLM agentically with tools so that it can retrieve the context for itself. Both have their advantages, but both work within the same principal: context allows the LLM to get the correct identifier code so that it does not need to hallucinate one. While hallucinations are still possible, the chances are reduced if we provide a list of options to an LLM to choose from.

In this notebook, we will explore the first of these options, where we provide the LLM with a list of candidates that were generated in the previous data notebook. To make things easier, we have pasted the output from that notebook here.

It is also worth noting that providing the LLM with the necessary context is often quite cheaper (assuming you are using a paid-model), than letting the model agentically query the web or use other tools. We will see this in the next notebook.

### Processing overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the NER notebook
3. **Candidate extraction**
4. **Disambiguation**: The text is sent to an LLM with a prompt that instructs it to disambiguate entities based on the available candidates.
5. **Disambiguation visualization**: The annotated text is displayed with colour-coded entity highlighting
6. **Save results**: Save the results of the disambiguation process for future processing

### Dependencies

This notebook depends on four files:

* utils.py: helper functions
* output_ner_5f12fa7c16d33ea378148197569f999f774f7481.json: output file of NER task
* disambiguation_cache_wikidata.json: context-dependent cache of WikiData information found earlier
* disambiguation_cache_llm.json: context-dependent cache of LLM choices make earlier

Please make sure the files are available in this folder so that the notebook can run smoothly. You can download the files from Github.

## 1. Import required software libraries

Disambiguation requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

We start with checking if the notebook is running on Google Colab. If that is the case, we need to connect to the notebook's environment

In [None]:
import os

def check_notebook_environment_on_colab():
    """Test if run on Colab, if so test if environment is available, if not install it"""
    try:
        from google.colab import files
        try:
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
            print("Found notebook environment")
        except:
            print("notebook environment not found, installing...")
            !git clone https://github.com/pelagios/llm-lod-enriching-heritage.git
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
    except:
        print("Not running on Google Colab")

check_notebook_environment_on_colab()

Next we import standard libraries which should always be available

In [None]:
from dotenv import load_dotenv
import json
from IPython.display import HTML
import os
import requests
import sys
import time
import utils

Next we import packages which may require installation on this device

In [None]:
openai = utils.safe_import("openai")
pd = utils.safe_import("pandas")
pydantic = utils.safe_import("pydantic")
spacy = utils.safe_import("spacy")

Finally we set settings required for Google Colab

In [None]:
in_colab = utils.check_google_colab()

These two helper functions are needed in different sections, we define them here.

In [None]:
CONTEXT = """extracted from records of objects in the collection of 
             the Egyptian museum, Torino, the Museo Egizio â€“ Torino"""

def make_prompt(entity, text, candidates):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Disambiguate the entity "{entity}" in the following text {CONTEXT}.

{text}

Here are the candidates, in json format:

{candidates}

Only return the relevant id, nothing else.
"""

In [None]:
def add_wikidata_ids_to_texts_input(texts_input, entities):
    """Insert the retrieved wikidata ids into the variable text_inputs and return it"""
    entities_per_text = {}
    for entity in entities:
        if entity["text_id"] not in entities_per_text:
            entities_per_text[entity["text_id"]] = {}
        entities_per_text[entity["text_id"]][entity["entity_text"]] = entity
    for text_id, text in enumerate(texts_input):
        for entity in text["entities"]:
            if text_id in entities_per_text and entity["text"] in entities_per_text[text_id]:
                entity["wikidata_id"] = entities_per_text[text_id][entity["text"]]["wikidata_id"]
    return texts_input

## 2. Read the texts that require processing

The texts should have been processed by the ner.ipynb notebook. The file read here is an output file of that notebook. We read the texts and the associated metadata and show the first text with its entities.

In [None]:
infile_name = "output_ner_5f12fa7c16d33ea378148197569f999f774f7481.json"

with open(infile_name, "r") as infile:
    texts_input = json.load(infile)
    infile.close()
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})

## 3. Disambiguation candidate generation

We need candidate ids for the entities to make the task for the LLM easier. We obtain these candidate ids by searching for the entities in wikidata.org. The search step returns seven candidate ids with a description text for each of them. In order not to overload the website, we use a cache for storing entities that were looked up earlier. We also wait for 10 seconds (variable `SLEEP_TIME_FETCH_PAGE`) between the searches. 

First we define five helper functions

In [None]:
SLEEP_TIME_FETCH_PAGE = 10
USER_AGENT = "Pelagios/0.0 (https://github.com/pelagios/; e.tjongkimsang@esciencecenter.nl) generic-library/0.0"

def query_wikidata_for_single_entity(entity_text):
    """Read Wikidata data on entity_text and return it"""
    headers = {
        "User-Agent": USER_AGENT,
        "Accept-Encoding": "gzip, deflate",
    }
    params = {
        'action': 'wbsearchentities',
        'language': 'en',
        'format': 'json',
        'limit': '7',
        'search': entity_text 
    }
    url = 'https://www.wikidata.org/w/api.php'
    time.sleep(SLEEP_TIME_FETCH_PAGE)
    wikidata_data = requests.get(url, params=params, headers=headers)
    return wikidata_data

In [None]:
def extract_entities_from_ner_input(texts_input):
    """For each entity in the input text return the entity text, context text and context text id""" 
    return [{"entity_text": entity["text"],
             "text_id": index,
             "text": text["text_cleaned"]} 
            for index, text in enumerate(texts_input) for entity in text["entities"]]

In [None]:
DISAMBUGUATION_CACHE_FILE_WIKIDATA = "disambiguation_cache_wikidata.json"


def find_wikidata_candidates_for_entities(entities):
    """Lookup candidate ids for the entities on wikidata.org and return them"""
    wikidata_cache = utils.read_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA)
    for entity in entities:
        if entity["entity_text"] not in wikidata_cache.keys():
            utils.squeal(f"Looking up entity {entity['entity_text']} (text id {entity['text_id'] + 1}) on WikiData.org...")
            wikidata_data = query_wikidata_for_single_entity(entity["entity_text"])
            wikidata_cache[entity["entity_text"]] = [{"id": candidate["id"], 
                                                            "description": candidate["description"]} 
                                                            for candidate in json.loads(wikidata_data.text)["search"]
                                                            if "description" in candidate.keys()]
            utils.write_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA, wikidata_cache)
    print("Finished processing")
    utils.save_data_to_json_file(wikidata_cache, file_name=DISAMBUGUATION_CACHE_FILE_WIKIDATA, in_colab=in_colab)
    for entity in entities:
        entity["candidates"] = wikidata_cache[entity["entity_text"]]
    return entities

Next we extract the entities from the input data and call a helper function for finding the candidate ids. These will be stored in the `entities` variable. We show the first item of the results.

In [None]:
entities = extract_entities_from_ner_input(texts_input)
entities = find_wikidata_candidates_for_entities(entities)
print({key: entities[0][key] for key in ['entity_text', 'candidates']})

## 4. Disambiguate entities with ChatGPT

We disambiguate entities by sending a prompt with each entity text, the context text and the WikiData candidates ids and their descriptions to an LLM. We ask the LLM to return the id associated with the description that best matches the entity in the given context.

First we define six helper functions

This helper function creates a prompt for each entity and sends it to the LLM and collect the responses. When the combination of the entity and the context text is available in the cache of previously processed entities, we skip consulting ChatGPT and use the wikidata id stored in the cache. 

We call the function to select the right WikiData id with ChatGPT. Next, we add the selections to the texts_output variable. Change the value of the MAX_PROCESSED variable if you do not want all texts to be processed by the LLM. The results are stored in the `entities` variable. We show the first of the results.

In [None]:
DISAMBUGUATION_CACHE_FILE_LLM = "disambiguation_cache_llm.json"
model = "gpt-4o-mini"

def openai_wikidata_id_selection(model, entities):
    llm_cache = utils.read_json_file(DISAMBUGUATION_CACHE_FILE_LLM)
    for entity in entities:
        if (entity["entity_text"] in llm_cache and 
            entity["text"] in llm_cache[entity["entity_text"]] and
            model in llm_cache[entity["entity_text"]][entity["text"]]):
            utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            entity["wikidata_id"] = llm_cache[entity["entity_text"]][entity["text"]][model] | {"model": model}
        else:
            if "openai_client" not in vars():
                openai_api_key = utils.get_openai_api_key()
                openai_client = utils.connect_to_openai(openai_api_key)                
            utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to {model}")
            prompt = make_prompt(entity["entity_text"], entity["text"], entity["candidates"])
            entity["wikidata_id"] = {model: utils.process_text_with_gpt(openai_client, model, prompt)}
            if entity["entity_text"] not in llm_cache:
                llm_cache[entity["entity_text"]] = {}
            if entity["text"] not in llm_cache[entity["entity_text"]]:
                llm_cache[entity["entity_text"]][entity["text"]] = {}
            llm_cache[entity["entity_text"]][entity["text"]][model] = {key: value 
                                                                       for key, value in entity["wikidata_id"].items()}
            utils.write_json_file(DISAMBUGUATION_CACHE_FILE_LLM, llm_cache)
            time.sleep(2)
    print("Finished processing")
    utils.save_data_to_json_file(llm_cache, file_name=DISAMBUGUATION_CACHE_FILE_LLM, in_colab=in_colab)
    return entities

In [None]:
MAX_PROCESSED = 100
MAX_PROCESSED_ENTITIES = len([entity for entity in entities if entity["text_id"] < MAX_PROCESSED])

processed_entities = openai_wikidata_id_selection(model, entities[MAX_PROCESSED_ENTITIES])
texts_output = add_wikidata_ids_to_texts_input(texts_input[:MAX_PROCESSED], processed_entities)
print({key: value for key, value in processed_entities[0].items() if key in ["entity_text", "text", "wikidata_id"]})

## 5. Disambiguation visualization

Here we show the results of the disambiguation task in a more readable format. We use the helper function `mark_entities_in_text` for this. But you will not find the definition of this helper function here: it is defined in the file `utils.py` because it is used by other notebooks as well.

Here we call the helper function and show the first three texts with their entities and the WikiData ids. We do not supply it with the cleaned text but with the text that was output of the named entity recognition (`text_llm_output`), because that is sometimes different from the cleaned text.

In [None]:
for text_id, text in enumerate(texts_output):
    if text_id < 3:
        display(HTML(utils.mark_entities_in_text(text["text_llm_output"], text["entities"])))

## 6. Save results

We save the results in a json file

In [None]:
utils.save_data_to_json_file(texts_output, file_name="output_disambiguation.json", in_colab=in_colab)

## 7. Alternatives for disambiguation

Here we define some alternatives for disambiguation, for example if you do not have an OpenAI API key or if you do not want to share your data with OpenAI.

### 7.1. Disambiguate entities with Qwen

Here we use [Qwen](https://en.wikipedia.org/wiki/Qwen), a locally-run model developed by the company Alibaba. The model has hardware requirements which your computer may not satisfy: it needs a GPU and about 12 gigabytes of memory. When running this model on Google Colab, it is recommended to use the runtime environment `T4 GPU`.

In [None]:
DISAMBUGUATION_CACHE_FILE_LLM = "disambiguation_cache_llm.json"

def ollama_wikidata_id_selection(model, entities):
    llm_cache = utils.read_json_file(DISAMBUGUATION_CACHE_FILE_LLM)
    for entity in entities:
        if (entity["entity_text"] in llm_cache and 
            entity["text"] in llm_cache[entity["entity_text"]] and
            model in llm_cache[entity["entity_text"]][entity["text"]]):
            utils.squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            entity["wikidata_id"] = llm_cache[entity["entity_text"]][entity["text"]][model] | {"model": model}
        else:
            utils.squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to {model}")
            if "ollama" in sys.modules:
                ollama = utils.importlib.import_module("ollama")
            else:
                ollama = utils.import_ollama_module()
            utils.install_ollama_model(model, ollama)
            prompt = make_prompt(entity["entity_text"], entity["text"], entity["candidates"])
            entity["wikidata_id"] = {model: utils.process_text_with_ollama(model, prompt, ollama)}
            if entity["entity_text"] not in llm_cache:
                llm_cache[entity["entity_text"]] = {}
            if entity["text"] not in llm_cache[entity["entity_text"]]:
                llm_cache[entity["entity_text"]][entity["text"]] = {}
            llm_cache[entity["entity_text"]][entity["text"]][model] = {key: value 
                                                                       for key, value in entity["wikidata_id"].items()}
            utils.write_json_file(DISAMBUGUATION_CACHE_FILE_LLM, llm_cache)
            time.sleep(2)
    print("Finished processing")
    utils.save_data_to_json_file(llm_cache, file_name=DISAMBUGUATION_CACHE_FILE_LLM, in_colab=in_colab)
    return entities

In [None]:
MODEL = "qwen3:8b"
MAX_PROCESSED = 3
MAX_PROCESSED_ENTITIES = len([entity for entity in entities if entity["text_id"] < MAX_PROCESSED])

processed_entities = ollama_wikidata_id_selection(MODEL, entities[:MAX_PROCESSED_ENTITIES])
texts_output = add_wikidata_ids_to_texts_input(texts_input[:MAX_PROCESSED], processed_entities)
print({key: value for key, value in processed_entities[0].items() if key in ["entity_text", "text", "wikidata_id"]})