# Disambiguation with Candidates

This notebook disambiguates the entities found by the NER notebook by linking them to WikiData concepts

### Rationale

When using LLMs, we can leverage context to help determine correct identifiers for entities found. One of the largest challenges with LLMs is getting them to generate the correct identifier for specific entities. Without context, an LLM will confidently generate a believable looking identifier code. When checked, however, users will often find these codes do not exist or are entirely wrong.

We solve this problem with context. LLMs can receive context in one of two ways: either we can give it the context or we can use an LLM agentically with tools so that it can retrieve the context for itself. Both have their advantages, but both work within the same principal: context allows the LLM to get the correct identifier code so that it does not need to hallucinate one. While hallucinations are still possible, the chances are reduced if we provide a list of options to an LLM to choose from.

In this notebook, we will explore the first of these options, where we provide the LLM with a list of candidates that were generated in the previous data notebook. To make things easier, we have pasted the output from that notebook here.

It is also worth noting that providing the LLM with the necessary context is often quite cheaper (assuming you are using a paid-model), than letting the model agentically query the web or use other tools. We will see this in the next notebook.

### Processing overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the NER notebook
3. **Candidate extraction**
4. **Disambiguation**: The text is sent to GPT with a prompt that instructs it to disambiguate entities based on the available candidates.
5. **Disambiguation visualization**: The annotated text is displayed with colour-coded entity highlighting
6. **Save results**: Save the results of the disambiguation process for future processing

## 1. Import required software libraries

Disambiguation requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

First we import standard libraries which should always be available

In [1]:
from dotenv import load_dotenv
import hashlib
import importlib
from IPython.display import clear_output
import json
import os
import requests
import subprocess
import sys
import time

Next we import packages which may require installation on this device

In [2]:
char_package = "📦"
char_success = "✅"
char_failure = "❌"


def safe_import(package_name):
    """Import a package;. If it missing, download it first"""
    try:
        return importlib.import_module(package_name)
    except ImportError:
        print(f"{char_package} {package_name} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"Finished installing {package_name}")
        return importlib.import_module(package_name)


openai = safe_import("openai")
pd = safe_import("pandas")
pydantic = safe_import("pydantic")
spacy = safe_import("spacy")

Here is code for monitoring the progress of the task:

In [3]:
def squeal(text=None):
    """Clear the output buffer of the current cell and print the given text"""
    clear_output(wait=True)
    if not text is None: 
        print(text)

Finally we set settings required for Google Colab

In [4]:
from IPython.display import HTML, display


def set_css():
    """Fix line wrapping of output texts for Google Colab"""
    display(HTML("<style> pre { white-space: pre-wrap; </style>"))


try:
    from google.colab import files
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    get_ipython().events.register('pre_run_cell', set_css)


## 2. Read the texts that require processing

The texts should have been processed by the ner.ipynb notebook. The file read here is an output file of that notebook. We read the texts and the associated metadata and show the first text with its entities.

In [5]:
infile_name = "output_ner_5f12fa7c16d33ea378148197569f999f774f7481.json"

with open(infile_name, "r") as infile:
    texts_input = json.load(infile)
    infile.close()
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})

{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': [{'text': 'Anubis', 'label': 'PERSON', 'start_char': 21, 'end_char': 27}]}


## 3. Disambiguation candidate generation

We need candidate ids for the entities to make the task for the LLM easier. We obtain these candidate ids by searching for the entities in wikidata.org. The search step returns seven candidate ids with a description text for each of them. In order not to overload the website, we use a cache for storing entities that were looked up earlier. We also wait for 10 seconds (variable `SLEEP_TIME_FETCH_PAGE`) between the searches. 

First we define five helper functions

In [6]:
SLEEP_TIME_FETCH_PAGE = 10
USER_AGENT = "Pelagios/0.0 (https://github.com/pelagios/; e.tjongkimsang@esciencecenter.nl) generic-library/0.0"

def query_wikidata_for_single_entity(entity_text):
    """Read Wikidata data on entity_text and return it"""
    headers = {
        "User-Agent": USER_AGENT,
        "Accept-Encoding": "gzip, deflate",
    }
    params = {
        'action': 'wbsearchentities',
        'language': 'en',
        'format': 'json',
        'limit': '7',
        'search': entity_text 
    }
    url = 'https://www.wikidata.org/w/api.php'
    time.sleep(SLEEP_TIME_FETCH_PAGE)
    wikidata_data = requests.get(url, params=params, headers=headers)
    return wikidata_data

In [7]:
DISAMBUGUATION_CACHE_FILE_WIKIDATA = "disambiguation_cache_wikidata.json"

def read_json_file(file_name):
    """Read a json file and return its contents"""
    with open(file_name, "r") as infile:
        json_data = json.load(infile)
        infile.close()
    return json_data

In [8]:
def write_json_file(file_name, json_data):
    """Write json data to a file with the specified name"""
    json_string = json.dumps(json_data, ensure_ascii=False, indent=2)
    with open(file_name, "w") as outfile:
        print(json_string, end="", file=outfile)
        outfile.close()

In [9]:
def extract_entities_from_ner_input(texts_input):
    """For each entity in the input text return the entity text, context text and context text id""" 
    return [{"entity_text": entity["text"],
             "text_id": index,
             "text": text["text_cleaned"]} 
            for index, text in enumerate(texts_input) for entity in text["entities"]]

In [10]:
def find_wikidata_candidates_for_entities(entities):
    """Lookup candidate ids for the entities on wikidata.org and return them"""
    wikidata_cache = read_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA)
    for entity in entities:
        if entity["entity_text"] not in wikidata_cache.keys():
            squeal(f"Looking up entity {entity['entity_text']} (text id {entity['text_id'] + 1}) on WikiData.org...")
            wikidata_data = query_wikidata_for_single_entity(entity["entity_text"])
            wikidata_cache[entity["entity_text"]] = [{"id": candidate["id"], 
                                                            "description": candidate["description"]} 
                                                            for candidate in json.loads(wikidata_data.text)["search"]
                                                            if "description" in candidate.keys()]
    write_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA, wikidata_cache)
    for entity in entities:
        entity["candidates"] = wikidata_cache[entity["entity_text"]]
    return entities

Next we extract the entities from the input data and call a helper function for finding the candidate ids. These will be stored in the `entities` variable. We show the first item of the results.

In [11]:
entities = extract_entities_from_ner_input(texts_input)
entities = find_wikidata_candidates_for_entities(entities)
print({key: entities[0][key] for key in ['entity_text', 'candidates']})

{'entity_text': 'Anubis', 'candidates': [{'id': 'Q14896497', 'description': 'genus of insects'}, {'id': 'Q47534', 'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head'}, {'id': 'Q134301689', 'description': 'anti-web scraping software'}, {'id': 'Q135055514', 'description': 'video game developed by OA Game Studio'}, {'id': 'Q145772', 'description': 'asteroid'}, {'id': 'Q5514020', 'description': 'daemon that sits between the Mail User Agent (MUA) and the Mail Transfer Agent (MTA)'}, {'id': 'Q108330387', 'description': 'British Drag Queen'}]}


## 4. Disambiguate entities with GPT

We disambiguate entities by sending a prompt with each entity text, the context text and the WikiData candidates ids and their descriptions to an LLM. We ask the LLM to return the id associated with the description that best matches the entity in the given context.

First we define five helper functions

In [12]:
def make_prompt(entity, text, candidates):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Disambiguate the entity "{entity}" in the following text.

{text}

Here are the candidates, in json format:

{candidates}

Only return the relevant id, nothing else.
"""

In [13]:
def get_openai_api_key():
    """Extract OpenAI API key from environment or file and return it"""
    load_dotenv()
    openai_api_key = os.getenv("OPENAI_API_KEY")
    if not openai_api_key:
        try:
            with open("OPENAI_API_KEY", "r") as infile:
                openai_api_key = infile.read().strip()
                infile.close()
        except:
            pass
    if not openai_api_key:
        print(f"{char_failure} no openai_api_key found!")
        return ""
    return openai_api_key

In [14]:
def connect_to_openai(openai_api_key):
    """Connect to OpenAI and return processing space"""
    return openai.OpenAI(api_key=openai_api_key)


In [15]:
def process_text_with_gpt(openai_client, model, prompt):
    """Send text to OpenAI via prompt and return results"""
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except:
        print(f"{char_failure} GPT call failed")
        return []

Next, we create a prompt for each entity, send it to the LLM and collect the responses. When the combination of the entity and the context text is available in the cache of previously processed entities, we skip consulting the gpt and use the wikidata id stored in the cache. 

In [16]:
DISAMBUGUATION_CACHE_FILE_GPT = "disambiguation_cache_gpt.json"
model = "gpt-4o-mini"

def openai_wikidata_id_selection(model, entities):
    openai_api_key = get_openai_api_key()
    openai_client = connect_to_openai(openai_api_key)
    gpt_cache = read_json_file(DISAMBUGUATION_CACHE_FILE_GPT)
    for entity in entities:
        if (entity["entity_text"] in gpt_cache and 
            entity["text"] in gpt_cache[entity["entity_text"]] and
            model in gpt_cache[entity["entity_text"]][entity["text"]]):
            squeal(f"Retrieving entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} from cache")
            entity["wikidata_id"] = gpt_cache[entity["entity_text"]][entity["text"]][model] | {"model": model}
        else:
            squeal(f"Sending entity \"{entity['entity_text']}\" of text {entity['text_id'] + 1} to GPT")
            prompt = make_prompt(entity["entity_text"], entity["text"], entity["candidates"])
            entity["wikidata_id"] = {model: process_text_with_gpt(openai_client, model, prompt)}
            if entity["entity_text"] not in gpt_cache:
                gpt_cache[entity["entity_text"]] = {}
            if entity["text"] not in gpt_cache[entity["entity_text"]]:
                gpt_cache[entity["entity_text"]][entity["text"]] = {}
            gpt_cache[entity["entity_text"]][entity["text"]][model] = {key: value 
                                                                       for key, value in entity["wikidata_id"].items()
                                                                       if key != model}
    write_json_file(DISAMBUGUATION_CACHE_FILE_GPT, gpt_cache)
    print("Finished processing")
    return entities

Next, we call the function to select the right wikidata id with gpt. The results are stored in the `entities` variable. We show the first of the results.

In [17]:
def add_wikidata_ids_to_texts_input(texts_input, entities):
    """Insert the retrieved wikidata ids into the variable text_inputs and return it"""
    entities_per_text = {}
    for entity in entities:
        if entity["text_id"] not in entities_per_text:
            entities_per_text[entity["text_id"]] = {}
        entities_per_text[entity["text_id"]][entity["entity_text"]] = entity
    for text_id, text in enumerate(texts_input):
        for entity in text["entities"]:
            if text_id in entities_per_text and entity["text"] in entities_per_text[text_id]:
                entity["wikidata_id"] = entities_per_text[text_id][entity["text"]]["wikidata_id"]
    return texts_input

In [18]:
entities = openai_wikidata_id_selection(model, entities)
texts_output = add_wikidata_ids_to_texts_input(texts_input, entities)
print({key: value for key, value in entities[0].items() if key in ["entity_text", "text", "wikidata_id"]})

Retrieving entity "Bagnani" of text 100 from cache
Finished processing
{'entity_text': 'Anubis', 'text': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'wikidata_id': {'id': 'Q47534', 'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head', 'model': 'gpt-4o-mini'}}


## 5. Disambiguation visualization

Here we show the results of the disambiguation task in a more readable format

First we define a helper function for performing the visualization

In [19]:
COLORS = {"PERSON": "red", "LOCATION": "green", "OTHER": "blue"}


def mark_entities_in_text(texts_input, entities):
    """Convert the text to HTML with colored antities and return these"""
    for entity in reversed(entities):
        entity_label = entity["label"] if entity["label"] in COLORS.keys() else "OTHER"
        if "wikidata_id" in entity:
            texts_input = texts_input[:entity["end_char"]] + f"<sup>{entity['wikidata_id']['id']}</sup>" + texts_input[entity["end_char"]:]
        texts_input = texts_input[:entity["end_char"]] + "</span>" + texts_input[entity["end_char"]:]
        texts_input = (texts_input[:entity["start_char"]] + 
                      f"<span style=\"border: 1px solid black; color: {COLORS[entity_label]};\">" + 
                      texts_input[entity["start_char"]:])
    return texts_input

Then we call the helper function and show the first three texts with their entities and the WikiData ids:

In [20]:
entities_all = []
for text_id, text in enumerate(texts_output):
    if text_id < 3:
        display(HTML(mark_entities_in_text(text["text_llm_output"], text["entities"])))

## 6. Save results

We save the results in a json file

In [21]:
def save_results(texts, key):
    """Save preprocessed texts in a json file"""
    json_string = json.dumps(texts, ensure_ascii=False, indent=2)
    hash = hashlib.sha1(json_string.encode("utf-8")).hexdigest()
    output_file_name = f"output_{key}{hash}.json"
    with open(output_file_name, "w", encoding="utf-8") as output_file:
        print(json_string, end="", file=output_file)
        output_file.close()
        if IN_COLAB:
            try:
                files.download(output_file_name)
                print(f"️{char_success} Downloaded preprocessed texts to file {output_file_name}")
            except:
                print(f"️{char_failure} Downloading preprocessed texts failed!")
        else:
            print(f"️{char_success} Saved preprocessed texts to file {output_file_name}")

In [23]:
save_results(texts_output, "disambiguation_")

️✅ Saved preprocessed texts to file output_disambiguation_ba25101ddbe8830789bfdfdb3a5ba6312d6853e6.json
