# Disambiguation with Candidates

This notebook disambiguates the entities found by the NER notebook by linking them to WikiData concepts

### Rationale

When using LLMs, we can leverage context to help determine correct identifiers for entities found. One of the largest challenges with LLMs is getting them to generate the correct identifier for specific entities. Without context, an LLM will confidently generate a believable looking identifier code. When checked, however, users will often find these codes do not exist or are entirely wrong.

We solve this problem with context. LLMs can receive context in one of two ways: either we can give it the context or we can use an LLM agentically with tools so that it can retrieve the context for itself. Both have their advantages, but both work within the same principal: context allows the LLM to get the correct identifier code so that it does not need to hallucinate one. While hallucinations are still possible, the chances are reduced if we provide a list of options to an LLM to choose from.

In this notebook, we will explore the first of these options, where we provide the LLM with a list of candidates that were generated in the previous data notebook. To make things easier, we have pasted the output from that notebook here.

It is also worth noting that providing the LLM with the necessary context is often quite cheaper (assuming you are using a paid-model), than letting the model agentically query the web or use other tools. We will see this in the next notebook.

### Processing overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the NER notebook
3. **Candidate extraction**
4. **Disambiguation**: The text is sent to GPT with a prompt that instructs it to disambiguate entities based on the available candidates.
5. **Disambiguation visualization**: The annotated text is displayed with colour-coded entity highlighting
6. **Save results**: Save the results of the disambiguation process for future processing

## 1. Import required software libraries

Disambiguation requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

First we import standard libraries which should always be available

In [1]:
from dotenv import load_dotenv
import importlib
from IPython.display import clear_output
import json
import os
import requests
import subprocess
import sys
import time

Next we import packages which may require installation on this device

In [2]:
char_package = "📦"
char_success = "✅"
char_failure = "❌"


def safe_import(package_name):
    """Import a package;. If it missing, download it first"""
    try:
        return importlib.import_module(package_name)
    except ImportError:
        print(f"{char_package} {package_name} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"Finished installing {package_name}")
        return importlib.import_module(package_name)


openai = safe_import("openai")
pd = safe_import("pandas")
pydantic = safe_import("pydantic")
spacy = safe_import("spacy")

Here is code for monitoring the progress of the task:

In [3]:
def squeal(text=None):
    """Clear the output buffer of the current cell and print the given text"""
    clear_output(wait=True)
    if not text is None: 
        print(text)

## 2. Read the texts that requires processing

The texts should have been processed by the ner.ipynb notebook. The file read here is an output file of that notebook. We read the texts and the associated metadata and show the first text with its entities.

In [4]:
infile_name = "output_ner_5f12fa7c16d33ea378148197569f999f774f7481.json"

with open(infile_name, "r") as infile:
    texts_input = json.load(infile)
    infile.close()
print({"text_cleaned": texts_input[0]["text_cleaned"], 
       "entities": texts_input[0]["entities"]})

{'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115', 'entities': [{'text': 'Anubis', 'label': 'PERSON', 'start_char': 21, 'end_char': 27}]}


## 3. Disambiguation candidate generation

We need candidate ids for the entities to make the task for the LLM easier. We obtain these candidate ids by searching for the entities in wikidata.org. The search step returns seven candidate ids with a description text for each of them. In order not to overload the website, we use a cache for storing entities that were looked up earlier. We also wait for 10 seconds (variable `SLEEP_TIME_FETCH_PAGE`) between the searches. 

First we define five helper functions

In [14]:
SLEEP_TIME_FETCH_PAGE = 10
USER_AGENT = "Pelagios/0.0 (https://github.com/pelagios/; e.tjongkimsang@esciencecenter.nl) generic-library/0.0"

def query_wikidata_for_single_entity(entity_text):
    """Read Wikidata data on entity_text and return it"""
    headers = {
        "User-Agent": USER_AGENT,
        "Accept-Encoding": "gzip, deflate",
    }
    params = {
        'action': 'wbsearchentities',
        'language': 'en',
        'format': 'json',
        'limit': '7',
        'search': entity_text 
    }
    url = 'https://www.wikidata.org/w/api.php'
    time.sleep(SLEEP_TIME_FETCH_PAGE)
    wikidata_data = requests.get(url, params=params, headers=headers)
    return wikidata_data

In [36]:
DISAMBUGUATION_CACHE_FILE_WIKIDATA = "disambiguation_cache_wikidata.json"

def read_json_file(file_name):
    """Read a json file and return its contents"""
    with open(file_name, "r") as infile:
        json_data = json.load(infile)
        infile.close()
    return json_data

In [16]:
def write_json_file(file_name, json_data):
    """Write json data to a file with the specified name"""
    json_string = json.dumps(json_data, ensure_ascii=False, indent=2)
    with open(file_name, "w") as outfile:
        print(json_string, end="", file=outfile)
        outfile.close()

In [17]:
def extract_entities_from_ner_input(texts_input):
    """For each entity in the input text return the entity text, context text and context text id""" 
    return [{"entity_text": entity["text"],
             "text_id": index,
             "text": text["text_cleaned"]} 
            for index, text in enumerate(texts_input) for entity in text["entities"]]

In [37]:
def find_wikidata_candidates_for_entities(entities):
    """Lookup candidate ids for the entities on wikidata.org and return them"""
    wikidata_cache = read_json_file(DISAMBUGUATION_CACHE_FILE_WIKIDATA)
    for entity in entities:
        if entity["entity_text"] not in wikidata_cache.keys():
            squeal(f"Looking up entity {entity['entity_text']} (text id {entity['text_id'] + 1}) on WikiData.org...")
            wikidata_data = query_wikidata_for_single_entity(entity["entity_text"])
            wikidata_cache[entity["entity_text"]] = [{"id": candidate["id"], 
                                                            "description": candidate["description"]} 
                                                            for candidate in json.loads(wikidata_data.text)["search"]
                                                            if "description" in candidate.keys()]
    write_json_file(DISAMBUGUATION_CACHE_FILE, wikidata_cache)
    for entity in entities:
        entity["candidates"] = dwikidata_cache[entity["entity_text"]]
    return entities

Next we extract the entities from the input data and call a helper function for finding the candidate ids. These will be stored in the entities variable. We show the first item of the results.

In [38]:
entities = extract_entities_from_ner_input(texts_input)
entities = find_wikidata_candidates_for_entities(entities)
print({key: entities[0][key] for key in ['entity_text', 'candidates']})

{'entity_text': 'Anubis', 'candidates': [{'id': 'Q14896497', 'description': 'genus of insects'}, {'id': 'Q47534', 'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head'}, {'id': 'Q134301689', 'description': 'anti-web scraping software'}, {'id': 'Q135055514', 'description': 'video game developed by OA Game Studio'}, {'id': 'Q145772', 'description': 'asteroid'}, {'id': 'Q5514020', 'description': 'daemon that sits between the Mail User Agent (MUA) and the Mail Transfer Agent (MTA)'}, {'id': 'Q108330387', 'description': 'British Drag Queen'}]}


## 4. Disambiguate entities with GPT

We disambiguate entities by sending a prompt with each entity text, the context text and the WikiData candidates ids and their descriptions to an LLM. We ask the LLM to return the id associated with the description that best matches the entity in the given context.

First we define four helper functions

In [32]:
def make_prompt(entity, text, candidates):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Disambiguate the entity "{entity}" in the following text.

{text}

Here are the candidates, in json format:

{candidates}

Only return the relevant id, nothing else.
"""

In [33]:
def get_openai_api_key():
    """Extract OpenAI API key from environment or file and return it"""
    load_dotenv()
    openai_api_key = os.getenv("OPENAI_API_KEY")
    if not openai_api_key:
        try:
            with open("OPENAI_API_KEY", "r") as infile:
                openai_api_key = infile.read().strip()
                infile.close()
        except:
            pass
    if not openai_api_key:
        print(f"{char_failure} no openai_api_key found!")
        return ""
    return openai_api_key

In [34]:
def connect_to_openai(openai_api_key):
    """Connect to OpenAI and return processing space"""
    return openai.OpenAI(api_key=openai_api_key)


In [50]:
def process_text_with_gpt(openai_client, model, prompt):
    """Send text to OpenAI via prompt and return results"""
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except:
        print(f"{char_failure} GPT call failed")
        return []

Next, we create a prompt for each entity, send it to the LLM and collect the responses

In [90]:
entities[0]

{'entity_text': 'Anubis',
 'text_id': 0,
 'text': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115',
 'candidates': [{'id': 'Q14896497', 'description': 'genus of insects'},
  {'id': 'Q47534',
   'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head'},
  {'id': 'Q134301689', 'description': 'anti-web scraping software'},
  {'id': 'Q135055514',
   'description': 'video game developed by OA Game Studio'},
  {'id': 'Q145772', 'description': 'asteroid'},
  {'id': 'Q5514020',
   'description': 'daemon that sits between the Mail User Agent (MUA) and the Mail Transfer Agent (MTA)'},
  {'id': 'Q108330387', 'description': 'British Drag Queen'}],
 'wikidata_id': {'gpt-4o-mini': {'id': 'Q47534',
   'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head'}}}

In [104]:
DISAMBUGUATION_CACHE_FILE_GPT = "disambiguation_cache_gpt.json"
model = "gpt-4o-mini"

openai_api_key = get_openai_api_key()
openai_client = connect_to_openai(openai_api_key)
gpt_cache = read_json_file(DISAMBUGUATION_CACHE_FILE_GPT)
for entity in entities:
    if (entity["entity_text"] in gpt_cache and 
        entity["text"] in gpt_cache[entity["entity_text"]] and
        model in gpt_cache[entity["entity_text"]][entity["text"]]):
        wikidata_id = gpt_cache[entity["entity_text"]][entity["text"]][model]
    else:
        squeal(f"Processing text: {entity['text_id'] + 1}; entity: {entity['entity_text']}")
        prompt = make_prompt(entity["entity_text"], entity["text"], entity["candidates"])
        entity["wikidata_id"] = {model: process_text_with_gpt(openai_client, model, prompt)}
        if entity["entity_text"] not in gpt_cache:
            gpt_cache[entity["entity_text"]] = {}
        if entity["text"] not in gpt_cache[entity["entity_text"]]:
            gpt_cache[entity["entity_text"]][entity["text"]] = {}
        gpt_cache[entity["entity_text"]][entity["text"]][model] = entity["wikidata_id"]
write_json_file(DISAMBUGUATION_CACHE_FILE_GPT, gpt_cache)
print("Finished processing")

Finished processing


## 5. Save results

We save the results in a json file

In [None]:
def save_results(texts, key):
    """Save preprocessed texts in a json file"""
    json_string = json.dumps(texts, ensure_ascii=False, indent=2)
    hash = hashlib.sha1(json_string.encode("utf-8")).hexdigest()
    output_file_name = f"output_{key}{hash}.json"
    with open(output_file_name, "w", encoding="utf-8") as output_file:
        print(json_string, end="", file=output_file)
        output_file.close()
        if IN_COLAB:
            try:
                files.download(output_file_name)
                print(f"️{char_success} Downloaded preprocessed texts to file {output_file_name}")
            except:
                print(f"️{char_failure} Downloading preprocessed texts failed!")
        else:
            print(f"️{char_success} Saved preprocessed texts to file {output_file_name}")

In [72]:
entities[0]

{'entity_text': 'Anubis',
 'text_id': 0,
 'text': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115',
 'candidates': [{'id': 'Q14896497', 'description': 'genus of insects'},
  {'id': 'Q47534',
   'description': 'Egyptian deity of mummification and the afterlife, usually depicted as a man with a canine head'},
  {'id': 'Q134301689', 'description': 'anti-web scraping software'},
  {'id': 'Q135055514',
   'description': 'video game developed by OA Game Studio'},
  {'id': 'Q145772', 'description': 'asteroid'},
  {'id': 'Q5514020',
   'description': 'daemon that sits between the Mail User Agent (MUA) and the Mail Transfer Agent (MTA)'},
  {'id': 'Q108330387', 'description': 'British Drag Queen'}],
 'wikidata_id': {'gpt-4o-mini': 'Q47534'}}

## Old Visualization Code

This bit of code is just for making it easier to visualize our data at the end of the notebook.

In [None]:
import re
import spacy
from spacy.tokens import Doc, Span


def annotated_text_to_spacy_doc(text, nlp=None):
    """
    Converts annotated text in format [Entity](LABEL) to a spaCy Doc with entity spans.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
    
    Returns:
        spacy.tokens.Doc: spaCy document with entity spans set
        
    Example:
        >>> text = "[Tom](PERSON) worked for [Microsoft](ORGANIZATION) in 2020 before he lived in [Rome](LOCATION)."
        >>> doc = annotated_text_to_spacy_doc(text)
        >>> spacy.displacy.render(doc, style="ent")
    """
    if nlp is None:
        nlp = spacy.blank("en")
    
    # Pattern to match [text](LABEL) format
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    
    # Parse the text to extract tokens and entity information
    tokens = []
    entity_spans = []  # List of (start_token_idx, end_token_idx, label)
    custom_labels = set()
    
    # Split text by the pattern and process each part
    last_end = 0
    token_idx = 0
    
    for match in re.finditer(pattern, text):
        # Add tokens before the entity
        before_entity = text[last_end:match.start()]
        if before_entity.strip():
            # Tokenize the text before the entity
            before_tokens = before_entity.split()
            tokens.extend(before_tokens)
            token_idx += len(before_tokens)
        
        # Add the entity tokens
        entity_text = match.group(1)
        entity_label = match.group(2)
        custom_labels.add(entity_label)
        
        # Tokenize the entity text
        entity_tokens = entity_text.split()
        start_token_idx = token_idx
        tokens.extend(entity_tokens)
        token_idx += len(entity_tokens)
        end_token_idx = token_idx
        
        # Store entity span information
        entity_spans.append((start_token_idx, end_token_idx, entity_label))
        
        last_end = match.end()
    
    # Add any remaining tokens after the last entity
    remaining = text[last_end:]
    if remaining.strip():
        remaining_tokens = remaining.split()
        tokens.extend(remaining_tokens)
    
    # Add custom labels to the NLP model if they don't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner")
    else:
        ner = nlp.get_pipe("ner")
    
    for label in custom_labels:
        ner.add_label(label)
    
    # Create spaces array (True for tokens that should have a space after them)
    # Simple heuristic: all tokens except the last one get a space
    spaces = [True] * len(tokens)
    if tokens:
        spaces[-1] = False
    
    # Create the Doc from tokens
    doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
    
    # Create entity spans
    entities = []
    for start_idx, end_idx, label in entity_spans:
        if start_idx < len(doc) and end_idx <= len(doc):
            span = Span(doc, start_idx, end_idx, label=label)
            entities.append(span)
    
    # Set entities on the document
    doc.ents = entities
    
    return doc


def visualize_annotated_text(text, nlp=None, style="ent", jupyter=True):
    """
    Convenience function to convert annotated text and visualize it with displaCy.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
        style (str): displaCy style ("ent" or "dep")
        jupyter (bool): Whether to render for Jupyter notebook
    
    Returns:
        Rendered visualization (HTML string if not in Jupyter)
    """
    doc = annotated_text_to_spacy_doc(text, nlp)
    
    try:
        import spacy
        return spacy.displacy.render(doc, style=style, jupyter=jupyter)
    except ImportError:
        print("spaCy not installed. Please install with: pip install spacy")
        return None


In [None]:
print(output_text)

In [None]:
def parse_json_with_sources(text):
    json_data = text.split("```json")[1]
    json_data, sources = json_data.split("```")
    json_data = json.loads(json_data)
    return json_data, sources


json_output, sources = parse_json_with_sources(output_text)
print(json_output)

In [None]:
from spacy import displacy
import spacy

In [None]:
doc = annotated_text_to_spacy_doc(TEXT)

In [None]:
displacy.render(doc, style="ent")

In [None]:
output_ents = []
pandas_output = []
for ent in doc.ents:
    found=False
    for item in json_output:
        if item["entity_text"] == ent.text:
            output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": f'{ent.label_} <a href="https://www.wikidata.org/wiki/{item["wikidata_id"]}">{item["wikidata_id"]}</a>'})
            pandas_output.append({"entity_text": item["entity_text"], "label": item["label"], "wikidata_id": item["wikidata_id"], "ent_start": ent.start_char, "ent_end": ent.end_char})
            found=True
    if found==False:
        output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": ent.label_})
        pandas_output.append({"entity_text": ent.text, "label": ent.label_, "wikidata_id": None, "ent_start": ent.start_char, "ent_end": ent.end_char})


In [None]:
dic_ents = {
    "text": doc.text,
    "ents": output_ents,
    "title": None
}

displacy.render(dic_ents, manual=True, style="ent")

## Getting the Data as a DataFrame

In [None]:
df = pd.DataFrame(pandas_output)
df

In [None]:
df.to_csv("../../output/entities.csv", index=False)