# Evaluation of three entity tasks

This recipe describes how to perform the evaluation of **named entity recognition (NER)**, **entity disambiguation (ED)** and **entity linking (EL)** output in different scenarios:

1. The user does have groundtruth data, where groundtruth data is the manually verified entity tags of entities found in a given text. In this case **quantitative evaluation** is possible.  
2. The user does not have groundtruth data, but they are willing to manually inspect the NER output in order to spot and flag errors, inconsistencies, hallucinations, etc. In th|is case, **qualitative evaluation** is necessary. As this process is time-consuming, it can be supported by in-notebook visualizations for quick data inspection.  

Note that the recipe only showcases a subset of the possible approaches, cf. [Variations and alternatives](#scrollTo=-riY-m5r7xv_&line=3&uniqifier=1).

### Rationale

These methods help the user assess the quality of **named entity recognition (NER)**, **entity disambiguation (ED)** and **entity linking (EL)** outputs. This is essential for any application, but especially when communicating with lay people, who often have reservations about new technologies.

The cookbook also allows for evaluation both in a situation in which data comes with ground truth labels (quantitative evaluation) and in a situation where data is not labeled (qualitative evaluation, a.k.a. eye-balling).

To run the quantitative evaluation with use the [`HIPE-scorer`](https://github.com/hipe-eval/HIPE-scorer), a set of Python scripts developed as part of the [HIPE shared task](https://hipe-eval.github.io/), focused on named entity processing of historical documents. As such, these scripts have certain requirements, for example when it comes to file naming or data format.

Output data format can be fed to application recipes for visualizing and analyzing errors, making the estimation of the performance easier also for lay people.

### Process overview

The evaluation module takes as input a tsv file where the first column is the token and the others are used to classify the token.

If the file includes gold labels, the user can perform the quantitative evaluation of the annotated test data. The process uses the following steps:

1. Installing the HIPE scorer
2. Downloading the evaluation data and ground truths
3. Reshape data to the format required by the scorer
4. Running the scorer and saving the results

If the file does not include gold labels, the cookbook returns a visualization of the annotation and gives the possibility to the user to give a free-text feedback about the annotation.

## 1. Evaluation preparation

The notebook cells in this section contain the defintion of functions that are used further down in the notebook. These cells **must be run** but you don't need to inspect them closely unless you want to modify the behaviour of this notebook.

First we install some standard Python libraries

In [14]:
from docopt import docopt
import json
import logging
import os
import sys
sys.path.insert(0, '../tasks')
import utils

Next, we test if the HIPE-scorer is available. If it is not, we install it

In [2]:
BASE_DIR = os.getcwd()

if not os.path.exists("HIPE-scorer/clef_evaluation.py"):
    ! git clone https://github.com/enriching-digital-heritage/HIPE-scorer.git
    os.chdir(os.path.join(BASE_DIR, "HIPE-scorer"))
    ! pip install -r requirements.txt
    ! pip install .

Finally, we load the HIPE-scorer

In [3]:
os.chdir(os.path.join(BASE_DIR, "HIPE-scorer"))
import clef_evaluation

## 2. Compare analysis with gold data

We need helper functions for accessing the HIPE-scorer, for converting the data to the scorer format and for interpreting the analysis of the scorer

In [7]:
ARGS_VALUES = {'--glue': None, '--help': False, '--hipe_edition': 'hipe-2020',
 '--log': 'log.txt', '--n_best': '1', '--noise-level': None,
 '--original_nel': False, '--skip-check': False, '--suffix': None,
 '--tagset': None, '--task': 'nerc_coarse', '--time-period': None}


def run_scorer(args):
    tasks = ("nerc_coarse", "nerc_fine", "nel")
    if args["--task"] not in tasks:
        msg = "Please restrict to one of the available evaluation tasks: " + ", ".join(
            tasks
        )
        logging.error(msg)
        sys.exit(1)
    logging.debug(f"ARGUMENTS {args}")
    clef_evaluation.main(args= args | ARGS_VALUES)

In [32]:
def get_label_value(entity, label):
    if label not in entity:
        return ""
    else:
        return entity[label][list(entity[label].keys())[0]]

In [35]:
PRED_FILE_TASKS = "../tasks/output_linking_34e26bfd19c837e400a5fcb214cd1e7a25304a12.json"

with open(os.path.join(BASE_DIR, PRED_FILE_TASKS), "r") as infile:
    pred_data = json.load(infile)
    infile.close()

all_tokens = []
for text in pred_data:
    tokens = {token["start"]: {"text": token["text"], "label": "O"}
              for token in text["tokens"]}
    next_id = -1
    for char_id in reversed(tokens.keys()):
        tokens[char_id]["next_id"] = next_id
        next_id = char_id
    for entity in text["entities"]:
        try:
            wikidata_id = get_label_value(entity, "wikidata_id")
            link = get_label_value(entity, "link")
            tokens[entity["start_char"]]["label"] = "B-" + entity["label"]
            tokens[entity["start_char"]]["wikidata_id"] = wikidata_id
            tokens[entity["start_char"]]["link"] = link 
            next_id = tokens[entity["start_char"]]["next_id"]
            while next_id >= 0 and next_id < entity["end_char"]:
                tokens[next_id]["label"] = "I-" + entity["label"]
                tokens[next_id]["wikidata_id"] = wikidata_id
                tokens[next_id]["link"] = link
                next_id = tokens[next_id]["next_id"]
        except Exception as e:
            print(f"warning: no token at character index {e}")
    all_tokens.append(tokens)



In [45]:
all_tokens[9]

{0: {'text': 'Cult', 'label': 'O', 'next_id': 5},
 5: {'text': 'statue', 'label': 'O', 'next_id': 12},
 12: {'text': 'of', 'label': 'O', 'next_id': 15},
 15: {'text': 'Amenhotep',
  'label': 'B-PERSON',
  'next_id': 25,
  'wikidata_id': 'Q158052',
  'link': '1'},
 25: {'text': 'I.',
  'label': 'I-PERSON',
  'next_id': 28,
  'wikidata_id': 'Q158052',
  'link': '1'},
 28: {'text': 'Limestone', 'label': 'O', 'next_id': 37},
 37: {'text': '.', 'label': 'O', 'next_id': 39},
 39: {'text': 'New', 'label': 'O', 'next_id': 43},
 43: {'text': 'Kingdom', 'label': 'O', 'next_id': 50},
 50: {'text': ',', 'label': 'O', 'next_id': 52},
 52: {'text': '19th', 'label': 'O', 'next_id': 57},
 57: {'text': 'Dynasty', 'label': 'O', 'next_id': 65},
 65: {'text': '(', 'label': 'O', 'next_id': 66},
 66: {'text': '1292–1190', 'label': 'O', 'next_id': 76},
 76: {'text': 'BC', 'label': 'O', 'next_id': 78},
 78: {'text': ')', 'label': 'O', 'next_id': 79},
 79: {'text': '.', 'label': 'O', 'next_id': 81},
 81: {'tex

In [8]:
PRED_FILE = "example.tsv"
GOLD_FILE = "example.tsv"

os.chdir(os.path.join(BASE_DIR, "HIPE-scorer"))
run_scorer(args={"--ref": GOLD_FILE,
                 "--pred": PRED_FILE,
                 "--task": "nerc_coarse",
                 "--outdir": "RESULT_FOLDER"})



true: 1 pred: 1
data_format_true [[5]]
data_format_pred [[5]]


### Preparation for manual assessment

In [None]:
import pandas as pd
from IPython.display import HTML, display

# Function for visualising the entities with link

def highlight_entities(
    data,
    iob_column = "NE-COARSE-LIT",
    base_url="https://www.wikidata.org/wiki/"
    ):
    # 1) Rebuild the text with spacing rules
    text_parts = []
    for idx, row in data.iterrows():
        tok = row["TOKEN"]
        if "NoSpaceAfter" in row["MISC"]:
            text_parts.append(tok)
        else:
            text_parts.append(tok + " ")
    text = "".join(text_parts)

    # 2) Merge contiguous IOB entities of the same type
    entities = []
    current = None  # {"start_char": int, "end_char": int, "label": str}

    for idx, row in data.iterrows():
        tag = row[iob_column]
        if tag == "_" or tag == "O":
            # close any open entity
            if current is not None:
                entities.append(current)
                current = None
            continue

        # Extract type and prefix
        if tag.startswith("B-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            # close previous if open
            if current is not None:
                entities.append(current)
            # start new
            current = {
                "start_char": int(row["start_char"]),
                "end_char": int(row["end_char"]),
                "label": etype,
                "id": eid
            }

        elif tag.startswith("I-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            if current is not None and current["label"] == etype:
                # extend current run
                current["end_char"] = int(row["end_char"])
            else:
                # stray I- (no open run or different type) → treat as B-
                if current is not None:
                    entities.append(current)
                current = {
                    "start_char": int(row["start_char"]),
                    "end_char": int(row["end_char"]),
                    "label": etype,
                    "id": eid
                }

        else:
            # Unknown tag → close any open entity
            if current is not None:
                entities.append(current)
                current = None

    # flush any remaining entity
    if current is not None:
        entities.append(current)

    # 3) Render with spans (note: end_char is inclusive → slice to en+1)
    entities.sort(key=lambda e: e["start_char"])
    result = ""
    last_idx = 0

    for e in entities:
        s, en = int(e["start_char"]), int(e["end_char"])
        etext = text[s:en + 1]  # inclusive end
        etype = e.get("label", "Other")
        eid = e.get("id", "")
        color = label_to_color.get(etype, "#dddddd")

        # decide whether to show eid as link or not
        if eid !="_" and eid !="NIL":
          eid_html = f'<a href="{base_url}{eid}">{eid}</a>'
        else:
          eid_html = ""  # if entity linking was not successful no link is shown

        result += text[last_idx:s]
        result += (
            f'<span style="background-color:{color}; color:black; padding:3px 6px; '
            f'border-radius:16px; margin:0 2px; display:inline-block; '
            f'box-shadow: 1px 1px 3px rgba(0,0,0,0.1);">'
            f'{etext}'
            f'<span style="font-size:0.75em; font-weight:normal; margin-left:6px; '
            f'background-color:rgba(0,0,0,0.05); padding:1px 6px; border-radius:12px;">'
            f'{etype} {eid_html}</span></span>'
        )
        last_idx = en + 1  # continue after inclusive end

    result += text[last_idx:]
    return result,text,entities

### Data preparation

In [None]:
# we need a function to convert the British Museum groundtruth data into the format expected by the HIPE scorer.
# The data is found here: http://145.38.185.232/enriching/bm.txt

In [None]:
def get_demo_data():
  import pandas as pd
  # Loading the data
  data = pd.read_csv(
      'https://raw.githubusercontent.com/wjbmattingly/llm-lod-recipes/refs/heads/main/output/sample.tsv',
      sep='\t'
  )

  # Adding the start and end characters per token to the dataframe
  data['start_char'] = 0
  data['end_char'] = 0
  current_char = 0

  for index, row in data.iterrows():
      data.loc[index, 'start_char'] = current_char
      token = row['TOKEN']
      # Check if the next token should not have a space before it
      if index + 1 < len(data) and 'NoSpaceAfter' in data.loc[index, 'MISC']:
          current_char += len(token)
      else:
          current_char += len(token) + 1  # Add 1 for the space after the token

      data.loc[index, 'end_char'] = current_char - 1 # Subtract 1 because end_char is inclusive

  # Just for testing purposes: adds a Wikidata ID to one entity
  data.loc[0, 'NEL-LIT'] = 'Q1744'
  return data

In [None]:
df = get_demo_data()

In [None]:
# preparing data for groundtruth evaluation
import regex as re
import csv

def clean_format(input_file,output_file):
  file = open(input_file)
  output = open(output_file,mode='w')
  reader = file.readlines() #(file,delimiter="\t")
  #writer = csv.writer(output,delimiter="\t")
  output.write('\t'.join(["TOKEN","NE-COARSE-LIT","NE-COARSE-METO","NE-FINE-LIT","NE-FINE-METO","NE-FINE-COMP","NE-NESTED","NEL-LIT","NEL-METO","MISC\n"]))
  i = 1
  for line in reader:
    line = line.strip()
    mod_line = re.sub('_','-',line)
    if re.search('-DOCSTART- -DOCSTART- -DOCSTART-',mod_line):
      mod_line = re.sub('-DOCSTART- -DOCSTART- -DOCSTART-',f'# document_{i}',mod_line)
      i+=1
    mod_line = mod_line.split(' ')
    try:
      if len(mod_line)==2:
        output.write("\n"+" ".join(mod_line)+"\n")
      else:
        output.write('\t'.join([mod_line[0],mod_line[1],'-','-','-','-','-',mod_line[2],'-','-\n',]))
    except: continue

In [None]:
clean_format('./data/bm_labels.txt','./data/gold/sample.tsv')
clean_format('./data/bm-2-ner-format.txt','./data/predictions/sample.tsv')
#

## NER evaluation with groundtruth data by using the HIPE scorer

In [None]:
! python clef_evaluation.py --help

### Running the scorer

 ‼️ In the cell below it is important to note the parameter `--task`. This parameter value needs to be adjusted depending on the task one wants to evaluate (i.e. NER or EL). When evaluating NER we use `--task nerc_coarse`, while for evaluating EL we use `--task nel`.

In [None]:
import glob
import regex as re

for doc in glob.glob('./data/predictions/*'):
  gold = re.sub('predictions','gold',doc)
  ! python clef_evaluation.py --ref "{gold}" --pred "{doc}" --task nerc_coarse --outdir ./data/evaluations/ --hipe_edition hipe-2020 --log=./data/evaluations/scorer.log


Let's now look at the various bits of data produced by the scorer. They are in the folder specified in the `--outdir` parameter of the scorer.

In [None]:
ls -la ./evaluation_data/evaluations/

Here is an overview of the files created by the scorer:
- `scorer.log` – the log produced by the scorer
- `01_sample_nerc_coarse.tsv` – a TSV file contaning the evaluation results for document `01_sample` and task `nerc_coarse`, at different levels of aggregation etc.
- `01_sample_nerc_coarse.json` – a JSON file with a more granular breakdown of the evaluation, which can be useful for error analysis and to better understand systems' performance.

In [None]:
! cat ./evaluation_data/evaluations/scorer.log

In [None]:
import pandas as pd

eval_df = pd.read_csv('./evaluation_data/evaluations/01_sample_nerc_coarse.tsv', sep='\t')
eval_df.drop(columns=['System'], inplace=True)
eval_df.set_index('Evaluation', inplace=True)

Let's print the **micro-averaged** precision, recall and F-score in a **strict** evaluation regime:

In [None]:
eval_df.loc['NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]

Let's print the **micro-averaged** precision, recall and F-score in a **fuzzy** evaluation regime:

In [None]:
# Let's print the micro-averaged precision, recall and F-score in a *fuzzy* evaluation regime
eval_df.loc['NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]

There is more data in the TSV file, as it can be seen when printing the whole content:

In [None]:
! cat ./evaluation_data/evaluations/01_sample_nerc_coarse.tsv

## Manual assessment of the output

### Displaying the entities

To allow for manual assessment of output quality, the following cells display the identified entities in form of color-coded annotations via HTML. These kinds of insights into the results can both complement quantitative statistics and work as another way to estimate output quality. The latter is especially important for the frequent cases where no gold (or silver or bronze) standard is available that the NER output can be evaluated on. (Also, note that there are also other tools or modules out there that provide similar visualisations, e.g. [displaCy](https://spacy.io/usage/visualizers#ent) when using spaCy for NER.)

In [None]:
data = get_demo_data()
data.head()

In [None]:
# Highlighting the identified entities in form of color-coded annotations with links to authority files where available

# Color palette - add more colors if more labels are used or change them here
colors = ['#F0D9EF', '#FCDCE1', '#FFE6BB', '#E9ECCE', '#CDE9DC', '#C4DFE5', '#D9E5F0', '#F0E6D9', '#E0D9F0', '#E6FFF0', '#9CC5DF']

# Name Labels that should be shown in color, not mentioned labels will be shown in grey (this makes it easier to focus on certain categories if needed)
labels = ["PERSON", "DATE"]

# Mapping each label from the label set to a color from the palette
label_to_color = {label: colors[i % len(colors)] for i, label in enumerate(labels)}

# Generating the HTML - two changes can be made here:
# 1) by default, the column "NE-COARSE-LIT" is used for the entities, this can be changed via the argument "iob_column"
# 2) the entity identifiers are taken from the column "NEL-LIT"; by default, these are expected to be Wikidata identifiers (e.g. Q1744) and are combined with the Wikidate base URL; for another authority file, the base URL can be changed via the argument "base_url"
res,text,entities = highlight_entities(data)

# displaying the annotations
display(HTML(res))

### Giving feedback on the entities

The following cell gives a very simple example for how manual assessment of entities could be integrated into the data. Here, the user is asked for feedback on each identified entity which then shows up in designated column.

In [None]:
# Create a new column for manual assessment
data['manual_assessment'] = ''

# Display results again for better overview (no scrolling back and front)
display(HTML(res))

# Iterate through the identified entities
for e in entities:
    s, en = int(e["start_char"]), int(e["end_char"])
    etext = text[s:en + 1]
    etype = e.get("label", "Other")

    # Ask for feedback on the entity
    feedback = input(f'Is the entity "{etext.strip()}" with label "{etype}" correct? (y/n/feedback): ')

    # Store the feedback for all tokens within the entity span
    for index, row in data.iterrows():
        token_start = int(row["start_char"])
        token_end = int(row["end_char"])
        if max(s, token_start) <= min(en, token_end):
            data.loc[index, 'manual_assessment'] = feedback

# Show the altered data (now with user assessments)
data

## EL evaluation with groundtruth data

TODO.

## Variations and alternatives

Another approach if the user does not have groundtruth data could be to use an **LLM-judge** approach to evaluate the NER output in absence of labelled golden data. The task of the LLM is then to “review” the NER output and to assess its quality.