# Evaluation for named entity recognition (NER) and entity linking (EL)

---



## Summary

This recipe describes how to perform the evaluation of **named entity recognition (NER)** and **entity linking (EL)** output in different scenarios:

1. The user does have groundtruth data, where groundtruth data is the manually verified entity tags of entities found in a given text. In this case **quantitative evaluation** is possible.  
2. The user does not have groundtruth data, but they are willing to manually inspect the NER output in order to spot and flag errors, inconsistencies, hallucinations, etc. In this case, **qualitative evaluation** is necessary. As this process is time consuming, it can be supported by in-notebook visualizations for quick data inspection.
3. Another approach if the user does not have groundtruth data is to use an **LLM-judge** approach to evaluate the NER output in absence of labelled golden data. The task of the LLM is then to “review” the NER output and to assess its quality.

Note that the recipe only showcases a subset of the possible approaches, cf. [Variations and alternatives](#scrollTo=-riY-m5r7xv_&line=3&uniqifier=1).

## Rationale

These methods help the user assess the quality of **named entity recognition (NER)** and **entity linking (EL)** outputs. This is essential for any application, but especially when communicating with lay people, who often have reservations about new technologies.

The cookbook also allows for evaluation both in a situation in which data comes with ground truth labels (quantitative evaluation) and in a situation where data is not labeled (qualitative evaluation, a.k.a. eye-balling).

To run the quantitative evaluation with use the [`HIPE-scorer`](https://github.com/hipe-eval/HIPE-scorer), a set of Python scripts developed as part of the [HIPE shared task](https://hipe-eval.github.io/), focused on named entity processing of historical documents. As such, these scripts have certain requirements, for example when it comes to file naming or data format.

Output data format can be fed to application recipes for visualizing and analyzing errors, making the estimation of the performance easier also for lay people.  


## Process overview

The evaluation module takes as input a tsv file where the first column is the token and the others are used to classify the token.

If the file includes gold labels, the user can perform the quantitative evaluation of the annotated test data. The process uses the following steps:
1) Installing the HIPE scorer
2) Downloading the evaluation data and ground truths
3) Reshape data to the format required by the scorer
5) Running the scorer and saving the results

If the file does not include gold labels, the cookbook returns a visualization of the annotation and gives the possibility to the user to give a free-text feedback about the annotation.

## Preparation

The notebook cells in this section contain the defintion of functions that are used further down in the notebook. These cells **must be run** but you don't need to inspect them closely unless you want to modify the behaviour of this notebook.

### Preparation for HIPE scorer

In [1]:
! git clone https://github.com/enriching-digital-heritage/HIPE-scorer.git

Cloning into 'HIPE-scorer'...
remote: Enumerating objects: 1027, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (99/99), done.[K
remote: Total 1027 (delta 61), reused 65 (delta 27), pack-reused 897 (from 1)[K
Receiving objects: 100% (1027/1027), 343.10 KiB | 5.72 MiB/s, done.
Resolving deltas: 100% (652/652), done.


In [2]:
cd HIPE-scorer/


/content/HIPE-scorer


In [3]:
pip install -r requirements.txt

Collecting docopt (from -r requirements.txt (line 1))
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=5d9f33cb35a2748dd8ef1eb712ce27b97a420e0093cc8cfdc2aad728305c99c3
  Stored in directory: /root/.cache/pip/wheels/1a/bf/a1/4cee4f7678c68c5875ca89eaccf460593539805c3906722228
Successfully built docopt
Installing collected packages: docopt
Successfully installed docopt-0.6.2


In [5]:
! pip install .

Processing /content/HIPE-scorer
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: HIPE-scorer
  Building wheel for HIPE-scorer (setup.py) ... [?25l[?25hdone
  Created wheel for HIPE-scorer: filename=HIPE_scorer-2.0-py3-none-any.whl size=15478 sha256=bd2eec34bdfd7860fdf2a0dbfee993a7a9e36440e6a5918f00c5e2e559eb2e61
  Stored in directory: /root/.cache/pip/wheels/6c/70/25/36232846b9cd45c513678a5037cd77f079a0d86c8f80b7a6e7
Successfully built HIPE-scorer
Installing collected packages: HIPE-scorer
Successfully installed HIPE-scorer-2.0


In [4]:
import glob
import regex as re
from collections import defaultdict
import pandas as pd

### Preparation for manual assessment

In [None]:
from IPython.display import HTML, display

# Function for visualising the entities with link

def highlight_entities(
    data,
    iob_column = "NE-COARSE-LIT",
    base_url="https://www.wikidata.org/wiki/"
    ):
    # 1) Rebuild the text with spacing rules
    text_parts = []
    for idx, row in data.iterrows():
        tok = row["TOKEN"]
        if "NoSpaceAfter" in row["MISC"]:
            text_parts.append(tok)
        else:
            text_parts.append(tok + " ")
    text = "".join(text_parts)

    # 2) Merge contiguous IOB entities of the same type
    entities = []
    current = None  # {"start_char": int, "end_char": int, "label": str}

    for idx, row in data.iterrows():
        tag = row[iob_column]
        if tag == "_" or tag == "O":
            # close any open entity
            if current is not None:
                entities.append(current)
                current = None
            continue

        # Extract type and prefix
        if tag.startswith("B-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            # close previous if open
            if current is not None:
                entities.append(current)
            # start new
            current = {
                "start_char": int(row["start_char"]),
                "end_char": int(row["end_char"]),
                "label": etype,
                "id": eid
            }

        elif tag.startswith("I-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            if current is not None and current["label"] == etype:
                # extend current run
                current["end_char"] = int(row["end_char"])
            else:
                # stray I- (no open run or different type) → treat as B-
                if current is not None:
                    entities.append(current)
                current = {
                    "start_char": int(row["start_char"]),
                    "end_char": int(row["end_char"]),
                    "label": etype,
                    "id": eid
                }

        else:
            # Unknown tag → close any open entity
            if current is not None:
                entities.append(current)
                current = None

    # flush any remaining entity
    if current is not None:
        entities.append(current)

    # 3) Render with spans (note: end_char is inclusive → slice to en+1)
    entities.sort(key=lambda e: e["start_char"])
    result = ""
    last_idx = 0

    for e in entities:
        s, en = int(e["start_char"]), int(e["end_char"])
        etext = text[s:en + 1]  # inclusive end
        etype = e.get("label", "Other")
        eid = e.get("id", "")
        color = label_to_color.get(etype, "#dddddd")

        # decide whether to show eid as link or not
        if eid !="_" and eid !="NIL":
          eid_html = f'<a href="{base_url}{eid}">{eid}</a>'
        else:
          eid_html = ""  # if entity linking was not successful no link is shown

        result += text[last_idx:s]
        result += (
            f'<span style="background-color:{color}; color:black; padding:3px 6px; '
            f'border-radius:16px; margin:0 2px; display:inline-block; '
            f'box-shadow: 1px 1px 3px rgba(0,0,0,0.1);">'
            f'{etext}'
            f'<span style="font-size:0.75em; font-weight:normal; margin-left:6px; '
            f'background-color:rgba(0,0,0,0.05); padding:1px 6px; border-radius:12px;">'
            f'{etype} {eid_html}</span></span>'
        )
        last_idx = en + 1  # continue after inclusive end

    result += text[last_idx:]
    return result,text,entities

### Preparation for Groundtruth Assessment

In [None]:
# we need a function to convert the British Museum groundtruth data into the format expected by the HIPE scorer.
# The data is found here: http://145.38.185.232/enriching/bm.txt

In [None]:
# preparing data for groundtruth evaluation
import regex as re
import csv

def clean_format(input_file,output_file):
  file = open(input_file)
  output = open(output_file,mode='w')
  reader = file.readlines() #(file,delimiter="\t")
  #writer = csv.writer(output,delimiter="\t")
  output.write('\t'.join(["TOKEN","NE-COARSE-LIT","NE-COARSE-METO","NE-FINE-LIT","NE-FINE-METO","NE-FINE-COMP","NE-NESTED","NEL-LIT","NEL-METO","MISC\n"]))
  i = 1
  for line in reader:
    line = line.strip()
    mod_line = re.sub('_','-',line)
    if re.search('-DOCSTART- -DOCSTART- -DOCSTART-',mod_line):
      mod_line = re.sub('-DOCSTART- -DOCSTART- -DOCSTART-',f'# document_{i}',mod_line)
      i+=1
    mod_line = mod_line.split(' ')
    try:
      if len(mod_line)==2:
        output.write("\n"+" ".join(mod_line)+"\n")
      else:
        output.write('\t'.join([mod_line[0],mod_line[1],'-','-','-','-','-',mod_line[2],'-','-\n',]))
    except: continue

In [None]:
clean_format('./data/bm-ner-gold.txt','./data/gold/sample.tsv')
clean_format('./data/bm-ner-predictions.txt','./data/predictions/sample.tsv')
#

In [None]:
predictions = open('data/predictions/sample.tsv').read()
lines = predictions.strip().splitlines()
headings = lines[0].split()

# Step 2: parse documents
documents = defaultdict(list)
current_doc = None

for line in lines[1:]:
    line = line.strip()
    if not line:
        continue
    if line.startswith("# document"):
        current_doc = line.lstrip("# ").strip()
    else:
        row = line.split()
        if current_doc:
            documents[current_doc].append(row)


In [None]:
def get_demo_data(data):

  # Adding the start and end characters per token to the dataframe
  data['start_char'] = 0
  data['end_char'] = 0
  current_char = 0

  for index, row in data.iterrows():
      data.loc[index, 'start_char'] = current_char
      token = row['TOKEN']
      # Check if the next token should not have a space before it
      if index + 1 < len(data) and 'NoSpaceAfter' in data.loc[index, 'MISC']:
          current_char += len(token)
      else:
          current_char += len(token) + 1  # Add 1 for the space after the token

      data.loc[index, 'end_char'] = current_char - 1 # Subtract 1 because end_char is inclusive

  # Just for testing purposes: adds a Wikidata ID to one entity
  data.loc[0, 'NEL-LIT'] = 'Q1744'
  return data

In [None]:
pd.DataFrame(documents['document_5'],columns=headings)

Unnamed: 0,TOKEN,NE-COARSE-LIT,NE-COARSE-METO,NE-FINE-LIT,NE-FINE-METO,NE-FINE-COMP,NE-NESTED,NEL-LIT,NEL-METO,MISC
0,Two,O,-,-,-,-,-,-,-,-
1,examples,O,-,-,-,-,-,-,-,-
2,of,O,-,-,-,-,-,-,-,-
3,snakes,O,-,-,-,-,-,-,-,-
4,",",O,-,-,-,-,-,-,-,-
5,one,O,-,-,-,-,-,-,-,-
6,red,O,-,-,-,-,-,-,-,-
7,and,O,-,-,-,-,-,-,-,-
8,black,O,-,-,-,-,-,-,-,-
9,",",O,-,-,-,-,-,-,-,-


### Preparation for LLMs as a judges

In [None]:
!pip install openai
from pydantic import BaseModel
from openai import OpenAI
from google.colab import userdata
import json



In [None]:
def convert_document(mydoc):
    text_tokens = []
    ner_list = []
    nel_dict = {}

    for row in mydoc:
        token = row[0]
        label = row[1]  # NE-COARSE-LIT
        nel = row[7]    # NEL-LIT

        # Build plain text
        text_tokens.append(token)

        # Collect NER if label is not "O"
        if label != "O":
            ner_list.append({"text": token, "label": label.split('-')[-1]})

        # Collect NEL if available
        if nel != "-" and nel != "":
            nel_dict[token] = nel

    text = " ".join(text_tokens)

    return {
        "text": text,
        "NER": ner_list,
        "NEL": nel_dict
    }




In [None]:
# defines schema for using structured output_format (e.g. needed for o1 models)

class Feedback(BaseModel):
    valid: str
    explanation: str

# defines function for calls to OpenAI-API

def build_input(row, task):
  if task == "all":
    input = row["text"] + " NER result: " + str(row["NER"]) + " NEL result: " + str(row["NEL"]) + " RE result: " + str(row["RE"])
  else:
    input = row["text"] + " Result: " + str(row[task])
  return input

def generate(row, model, subtask):

  input = build_input(row, subtask)

  if subtask == "all":
    system_prompt = """You will get texts and corresponding results from Named Entity Recognition, Named Entity Linking, Relation Extraction.
                        Check if the results are valid and give back your verdict (valid: y/n) as well as a short explanation if not valid."""
  elif subtask == "NER":
    system_prompt = """You will get texts and corresponding results from Named Entity Recognition.
                        Check if the results are valid and give back your verdict (valid: y/n) as well as a short explanation if not valid."""
  elif subtask == "NEL":
    system_prompt = """You will get texts and corresponding results from Named Entity Linking.
                        Check if the results are valid and give back your verdict (valid: y/n) as well as a short explanation if not valid."""
  elif subtask == "RE":
    system_prompt = """You will get texts and corresponding results from Relation Extraction.
                        Check if the results are valid and give back your verdict (valid: y/n) as well as a short explanation if not valid."""

  completion = client.beta.chat.completions.parse(
      model = model, #available: see OpenAI website
      messages=[
          {"role": "system", "content": system_prompt},
          {
              "role": "user",
              "content": input
          }],
      response_format = Feedback # Schema defined in cell above
      )
  return (completion.choices[0].message.content, completion.usage)

## NER evaluation with groundtruth data by using the HIPE scorer

In [None]:
! python clef_evaluation.py --help

Evaluate the systems for the HIPE Shared Task

Usage:
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_coarse [options]
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_fine [options]
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nel [--n_best=<n>] [options]
  clef_evaluation.py -h | --help


Options:
    -h --help               Show this screen.
    -t --task=<type>        Type of evaluation task (nerc_coarse, nerc_fine, nel).
    -e --hipe_edition=<str> Specify the HIPE edition (triggers different set of columns to be considered during eval). Possible values: hipe-2020, hipe-2022 [default: hipe-2020]
    -r --ref=<fpath>        Path to gold standard file in CONLL-U-style format.
    -p --pred=<fpath>       Path to system prediction file in CONLL-U-style format.
    -o --outdir=<dir>       Path to output directory [default: .].
    -l --log=<fpath>        Path to log file.
    -g --original_nel       It splits the NEL boundaries using original CLEF

### Running the scorer for NER Evaluation


 ‼️ In the cell below it is important to note the parameter `--task`. This parameter value needs to be adjusted depending on the task one wants to evaluate (i.e. NER or EL). When evaluating NER we use `--task nerc_coarse`, while for evaluating EL we use `--task nel`.

In [None]:
for doc in glob.glob('./data/predictions/*'):
  gold = re.sub('predictions','gold',doc)
  ! python clef_evaluation.py --ref "{gold}" --pred "{doc}" --task nerc_coarse --outdir ./data/evaluations/ --hipe_edition hipe-2020 --log=./data/evaluations/scorer.log


true: 100 pred: 100
data_format_true [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [31], [90], [41], [28], [8], [41], [23], [30], [22], [13], [53], [66], [9], [10], [69], [8], [39], [48], [81], [52], [38], [19], [59], [30], [34], [27], [44], [32], [47], [45], [4], [58], [43], [17], [23], [56], [9]]
data_format_pred [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51

Let's now look at the various bits of data produced by the scorer. They are in the folder specified in the `--outdir` parameter of the scorer.

In [None]:
ls -la ./data/evaluations/

total 36
drwxr-xr-x 2 root root  4096 Sep 12 08:53 [0m[01;34m.[0m/
drwxr-xr-x 5 root root  4096 Sep 12 08:52 [01;34m..[0m/
-rw-r--r-- 1 root root    71 Sep 12 08:52 .gitignore
-rw-r--r-- 1 root root 16931 Sep 12 08:53 sample_nerc_coarse.json
-rw-r--r-- 1 root root  1343 Sep 12 08:53 sample_nerc_coarse.tsv
-rw-r--r-- 1 root root     0 Sep 12 08:53 scorer.log


Here is an overview of the files created by the scorer:
- `scorer.log` – the log produced by the scorer
- `01_sample_nerc_coarse.tsv` – a TSV file contaning the evaluation results for document `01_sample` and task `nerc_coarse`, at different levels of aggregation etc.
- `01_sample_nerc_coarse.json` – a JSON file with a more granular breakdown of the evaluation, which can be useful for error analysis and to better understand systems' performance.

In [None]:
! cat ./data/evaluations/scorer.log

In [None]:
eval_df = pd.read_csv('./data/evaluations/sample_nerc_coarse.tsv', sep='\t')
eval_df.drop(columns=['System'], inplace=True)
eval_df.set_index('Evaluation', inplace=True)

Let's print the **micro-averaged** precision, recall and F-score in a **strict** evaluation regime:

In [None]:
eval_df.loc['NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]

Unnamed: 0_level_0,P,R,F1
Evaluation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL,0.632,0.815,0.712
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL,0.493,0.723,0.586
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL,0.7,0.852,0.769


Let's print the **micro-averaged** precision, recall and F-score in a **fuzzy** evaluation regime:

In [None]:
# Let's print the micro-averaged precision, recall and F-score in a *fuzzy* evaluation regime
eval_df.loc['NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]

Unnamed: 0_level_0,P,R,F1
Evaluation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL,0.651,0.84,0.733
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL,0.493,0.723,0.586
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL,0.729,0.887,0.8


There is more data in the TSV file, as it can be seen when printing the whole content:

In [None]:
! cat ./data/evaluations/sample_nerc_coarse.tsv

System	Evaluation	Label	P	R	F1	F1_std	P_std	R_std	TP	FP	FN
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	ALL	0.651	0.84	0.733				136	73	26
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	LOC	0.493	0.723	0.586				34	35	13
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	PER	0.729	0.887	0.8				102	38	13
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	ALL	0.632	0.815	0.712				132	77	30
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	LOC	0.493	0.723	0.586				34	35	13
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	PER	0.7	0.852	0.769				98	42	17
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	ALL	0.622	0.832	0.784	0.25	0.396	0.292			
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	LOC	0.418	0.734	0.788	0.253	0.439	0.36			
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	PER	0.723	0.867	0.909	0.178	0.414	0.305			
	NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	ALL	0.604	0.806	0.761	0.272	0.4	0.309			
	NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	LOC	0.418	0.734	0.788	0.253	0.439	0.36			
	NE-COAR

### Running the scorer for NEL Evaluation

In [None]:
for doc in glob.glob('./data/predictions/*'):
  gold = re.sub('predictions','gold',doc)
  ! python clef_evaluation.py --ref "{gold}" --pred "{doc}" --task nel --outdir ./data/evaluations/ --hipe_edition hipe-2020 --log=./data/evaluations/scorer.log


true: 100 pred: 100
data_format_true [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [31], [90], [41], [28], [8], [41], [23], [30], [22], [13], [53], [66], [9], [10], [69], [8], [39], [48], [81], [52], [38], [19], [59], [30], [34], [27], [44], [32], [47], [45], [4], [58], [43], [17], [23], [56], [9]]
data_format_pred [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51

In [None]:
eval_df = pd.read_csv('./data/evaluations/sample_nel.tsv', sep='\t')
eval_df.drop(columns=['System'], inplace=True)
eval_df.set_index('Evaluation', inplace=True)

In [None]:
eval_df

Unnamed: 0_level_0,Label,P,R,F1,F1_std,P_std,R_std,TP,FP,FN
Evaluation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
NEL-LIT-micro-fuzzy-TIME-ALL-LED-ALL-@1,ALL,0.306,0.42,0.354,,,,68.0,154.0,94.0
NEL-LIT-micro-strict-TIME-ALL-LED-ALL-@1,ALL,0.275,0.377,0.318,,,,61.0,161.0,101.0
NEL-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL-@1,ALL,0.352,0.466,0.429,0.389,0.396,0.418,,,
NEL-LIT-macro_doc-strict-TIME-ALL-LED-ALL-@1,ALL,0.34,0.444,0.412,0.394,0.399,0.42,,,
NEL-METO-micro-fuzzy-TIME-ALL-LED-ALL-@1,ALL,0.0,0.0,0.0,,,,0.0,0.0,0.0
NEL-METO-micro-strict-TIME-ALL-LED-ALL-@1,ALL,0.0,0.0,0.0,,,,0.0,0.0,0.0
NEL-METO-macro_doc-fuzzy-TIME-ALL-LED-ALL-@1,ALL,,,,,,,,,
NEL-METO-macro_doc-strict-TIME-ALL-LED-ALL-@1,ALL,,,,,,,,,


In [None]:
eval_df.loc['NEL-LIT-micro-strict-TIME-ALL-LED-ALL-@1'][['P', 'R', 'F1']]

Unnamed: 0,NEL-LIT-micro-strict-TIME-ALL-LED-ALL-@1
P,0.275
R,0.377
F1,0.318


In [None]:
eval_df.loc['NEL-LIT-micro-fuzzy-TIME-ALL-LED-ALL-@1'][['P', 'R', 'F1']]

Unnamed: 0,NEL-LIT-micro-fuzzy-TIME-ALL-LED-ALL-@1
P,0.306
R,0.42
F1,0.354


## Manual assessment of the output

### Displaying the entities

To allow for manual assessment of output quality, the following cells display the identified entities in form of color-coded annotations via HTML. These kinds of insights into the results can both complement quantitative statistics and work as another way to estimate output quality. The latter is especially important for the frequent cases where no gold (or silver or bronze) standard is available that the NER output can be evaluated on. (Also, note that there are also other tools or modules out there that provide similar visualisations, e.g. [displaCy](https://spacy.io/usage/visualizers#ent) when using spaCy for NER.)

In [None]:
import random

n = random.randint(0,len(documents))

data = pd.DataFrame(documents[f'document_{n}'],columns=headings)
data = get_demo_data(data)
data.head(20)

Unnamed: 0,TOKEN,NE-COARSE-LIT,NE-COARSE-METO,NE-FINE-LIT,NE-FINE-METO,NE-FINE-COMP,NE-NESTED,NEL-LIT,NEL-METO,MISC,start_char,end_char
0,Betrothal,I-PER,-,-,-,-,-,Q1744,-,-,0,9
1,;,O,-,-,-,-,-,-,-,-,10,11
2,a,O,-,-,-,-,-,-,-,-,12,13
3,man,I-PER,-,-,-,-,-,Q6581097,-,-,14,17
4,seen,O,-,-,-,-,-,-,-,-,18,22
5,from,O,-,-,-,-,-,-,-,-,23,27
6,behind,O,-,-,-,-,-,-,-,-,28,34
7,at,O,-,-,-,-,-,-,-,-,35,37
8,right,O,-,-,-,-,-,-,-,-,38,43
9,places,O,-,-,-,-,-,-,-,-,44,50


In [None]:
# Highlighting the identified entities in form of color-coded annotations with links to authority files where available

# Color palette - add more colors if more labels are used or change them here
colors = ['#F0D9EF', '#FCDCE1', '#FFE6BB', '#E9ECCE', '#CDE9DC', '#C4DFE5', '#D9E5F0', '#F0E6D9', '#E0D9F0', '#E6FFF0', '#9CC5DF']

# Name Labels that should be shown in color, not mentioned labels will be shown in grey (this makes it easier to focus on certain categories if needed)
labels = ["PERSON", "DATE"]

# Mapping each label from the label set to a color from the palette
label_to_color = {label: colors[i % len(colors)] for i, label in enumerate(labels)}

# Generating the HTML - two changes can be made here:
# 1) by default, the column "NE-COARSE-LIT" is used for the entities, this can be changed via the argument "iob_column"
# 2) the entity identifiers are taken from the column "NEL-LIT"; by default, these are expected to be Wikidata identifiers (e.g. Q1744) and are combined with the Wikidate base URL; for another authority file, the base URL can be changed via the argument "base_url"
res,text,entities = highlight_entities(data)

# displaying the annotations
display(HTML(res))

### Giving feedback on the entities

The following cell gives a very simple example for how manual assessment of entities could be integrated into the data. Here, the user is asked for feedback on each identified entity which then shows up in designated column.

In [None]:
# Create a new column for manual assessment
data['manual_assessment'] = ''

# Display results again for better overview (no scrolling back and front)
display(HTML(res))

# Iterate through the identified entities
for e in entities:
    s, en = int(e["start_char"]), int(e["end_char"])
    etext = text[s:en + 1]
    etype = e.get("label", "Other")

    # Ask for feedback on the entity
    feedback = input(f'Is the entity "{etext.strip()}" with label "{etype}" correct? (y/n/feedback): ')

    # Store the feedback for all tokens within the entity span
    for index, row in data.iterrows():
        token_start = int(row["start_char"])
        token_end = int(row["end_char"])
        if max(s, token_start) <= min(en, token_end):
            data.loc[index, 'manual_assessment'] = feedback

# Show the altered data (now with user assessments)
data

Is the entity "Betrothal" with label "PER" correct? (y/n/feedback): n
Is the entity "man" with label "PER" correct? (y/n/feedback): n
Is the entity "woman" with label "PER" correct? (y/n/feedback): n
Is the entity "Etching" with label "LOC" correct? (y/n/feedback): n


Unnamed: 0,TOKEN,NE-COARSE-LIT,NE-COARSE-METO,NE-FINE-LIT,NE-FINE-METO,NE-FINE-COMP,NE-NESTED,NEL-LIT,NEL-METO,MISC,start_char,end_char,manual_assessment
0,Betrothal,I-PER,-,-,-,-,-,Q1744,-,-,0,9,n
1,;,O,-,-,-,-,-,-,-,-,10,11,
2,a,O,-,-,-,-,-,-,-,-,12,13,
3,man,I-PER,-,-,-,-,-,Q6581097,-,-,14,17,n
4,seen,O,-,-,-,-,-,-,-,-,18,22,
5,from,O,-,-,-,-,-,-,-,-,23,27,
6,behind,O,-,-,-,-,-,-,-,-,28,34,
7,at,O,-,-,-,-,-,-,-,-,35,37,
8,right,O,-,-,-,-,-,-,-,-,38,43,
9,places,O,-,-,-,-,-,-,-,-,44,50,


## LLMs as judges

### Using GPT models (OpenAI API key needed)

Important: The following cell establishes access to the OpenAI-API. For this, an OpenAI-API key is needed, which must have been saved in the Google Colab secrets and the Notebook must have been given access to this secret.

In [None]:
# Option 1: use in Colab with API key in Colab Secrets
openai_api = userdata.get('openai_api') # change value in case your secret has a different name

In [None]:
# Option 2: paste in your API key
openai_api = "yourkeyhere"

In [None]:
# Initialise the OpenAI API client
client = OpenAI(api_key=openai_api)

In [None]:
data = [convert_document(documents[x]) for x in documents]
examples_df = pd.DataFrame(data)
examples_df

Unnamed: 0,text,NER,NEL
0,Madonna and child ; the Virgin seated turned t...,"[{'text': 'Madonna', 'label': 'PER'}, {'text':...","{'Madonna': 'Q1744', 'Virgin': 'Q1370', 'Jesus..."
1,Lamentation over the body of Christ ; the Virg...,"[{'text': 'Virgin', 'label': 'PER'}, {'text': ...","{'Christ': 'Q302', 'Virgin': 'Q1370', 'Mary': ..."
2,Plate 10 : Houses . Landscape with a shepherd ...,"[{'text': 'Abraham', 'label': 'PER'}, {'text':...","{'Abraham': 'Q329811', 'Bloemaert': 'Q329811'}"
3,"Two musicians , from a series of six musicians...",[],{}
4,"Two examples of snakes , one red and black , e...","[{'text': 'Merian', 'label': 'PER'}, {'text': ...","{'Merian': 'Q37093060', 'Surinam': 'Q730'}"
...,...,...,...
95,Betrothal ; a man seen from behind at right pl...,"[{'text': 'Betrothal', 'label': 'PER'}, {'text...","{'Betrothal': 'Q157512', 'man': 'Q6581097', 'w..."
96,Four turkeys ; two by a trough and two by a co...,"[{'text': 'Etching', 'label': 'PER'}]",{'Etching': 'Q186986'}
97,Cameo ; amber ; rectangular ; two holes runnin...,"[{'text': 'field', 'label': 'LOC'}]",{'field': 'Q13560407'}
98,"One of 116 drawings in an album of plants , wh...","[{'text': 'Thomas', 'label': 'PER'}, {'text': ...","{'Thomas': 'Q7611100', 'Knowlton': 'Q7611100'}"


Users can choose both the task they want the LLM to judge ("NER", "NEL", "ER" and "all") and the model they want to use (e.g. "gpt-4o", "gpt-5", "gpt-4o-mini"). The LLM will give back both a judgement on validity (y/n) and an explanation if it judges the results as not valid. These judgements are integrated into the DataFrame. (The prompts are defined in the preparation section and can be adapted there.)

In [None]:
task = "NER"  # choose task to be evaluated
model = "gpt-5"  # choose GPT model to use

# Build dynamic column names
column_valid = f"valid_{task}_{model}"
column_explanation = f"explanation_{task}_{model}"

# Initialize new columns
examples_df[column_valid] = None
examples_df[column_explanation] = None

# Iterate through rows
for index, row in examples_df.iterrows():
    result, usage = generate(row, model, task)
    result_dict = json.loads(result)

    # Store results in the appropriate columns
    examples_df.at[index, column_valid] = result_dict.get("valid")
    examples_df.at[index, column_explanation] = result_dict.get("explanation")

examples_df

Unnamed: 0,text,NER,NEL,valid_NER_gpt-4o-mini,explanation_NER_gpt-4o-mini,valid_NEL_gpt-4o-mini,explanation_NEL_gpt-4o-mini,valid_NER_gpt-3.5-turbo,explanation_NER_gpt-3.5-turbo,valid_NER_gpt-5,explanation_NER_gpt-5
0,Madonna and child ; the Virgin seated turned t...,"[{'text': 'Madonna', 'label': 'PER'}, {'text':...","{'Madonna': 'Q1744', 'Virgin': 'Q1370', 'Jesus...",n,"The label 'PER' refers to persons, but 'Virgin...",y,,,,y,
1,Lamentation over the body of Christ ; the Virg...,"[{'text': 'Virgin', 'label': 'PER'}, {'text': ...","{'Christ': 'Q302', 'Virgin': 'Q1370', 'Mary': ...",n,The entity 'Mary Magdalen' should be treated a...,n,The result is not valid because 'Mary' and 'Ma...,,,n,“Mary Magdalen” was incorrectly split into two...
2,Plate 10 : Houses . Landscape with a shepherd ...,"[{'text': 'Abraham', 'label': 'PER'}, {'text':...","{'Abraham': 'Q329811', 'Bloemaert': 'Q329811'}",y,,n,The result incorrectly associates both 'Abraha...,,,n,The person name is split into two entities. It...
3,"Two musicians , from a series of six musicians...",[],{},n,The result for named entity recognition is emp...,n,"The result is empty, which means no named enti...",,,n,"The text contains at least a DATE (""1637"") and..."
4,"Two examples of snakes , one red and black , e...","[{'text': 'Merian', 'label': 'PER'}, {'text': ...","{'Merian': 'Q37093060', 'Surinam': 'Q730'}",y,,n,The linking of 'Merian' to Q37093060 is approp...,,,y,
...,...,...,...,...,...,...,...,...,...,...,...
95,Betrothal ; a man seen from behind at right pl...,"[{'text': 'Betrothal', 'label': 'PER'}, {'text...","{'Betrothal': 'Q157512', 'man': 'Q6581097', 'w...",n,"The results incorrectly classify 'Betrothal', ...",n,The entities 'man' and 'woman' are too generic...,,,n,Incorrect labels: ‘Betrothal’ is an artwork ti...
96,Four turkeys ; two by a trough and two by a co...,"[{'text': 'Etching', 'label': 'PER'}]",{'Etching': 'Q186986'},n,The label 'PER' (Person) is incorrectly assign...,y,,,,n,'Etching' is not a person; it’s a common noun/...
97,Cameo ; amber ; rectangular ; two holes runnin...,"[{'text': 'field', 'label': 'LOC'}]",{'field': 'Q13560407'},n,The term 'field' does not refer to a location ...,n,The result Q13560407 corresponds to a specific...,,,n,'field' here refers to the background of the e...
98,"One of 116 drawings in an album of plants , wh...","[{'text': 'Thomas', 'label': 'PER'}, {'text': ...","{'Thomas': 'Q7611100', 'Knowlton': 'Q7611100'}",y,,n,The result incorrectly assigns the same entity...,,,n,The person name should be a single span “Thoma...


### Open a sample of results from the LLM-as-judge approach

In [7]:
examples_df = pd.read_csv('data/evaluations/LLM_judge_exemplary_results.csv')

examples_df.head()

Unnamed: 0,text,NER,NEL,valid_NER_gpt-4o-mini,explanation_NER_gpt-4o-mini,valid_NEL_gpt-4o-mini,explanation_NEL_gpt-4o-mini,valid_NER_gpt-3.5-turbo,explanation_NER_gpt-3.5-turbo,valid_NER_gpt-5,explanation_NER_gpt-5
0,Madonna and child ; the Virgin seated turned t...,"[{'text': 'Madonna', 'label': 'PER'}, {'text':...","{'Madonna': 'Q1744', 'Virgin': 'Q1370', 'Jesus...",n,"The label 'PER' refers to persons, but 'Virgin...",y,,,,y,
1,Lamentation over the body of Christ ; the Virg...,"[{'text': 'Virgin', 'label': 'PER'}, {'text': ...","{'Christ': 'Q302', 'Virgin': 'Q1370', 'Mary': ...",n,The entity 'Mary Magdalen' should be treated a...,n,The result is not valid because 'Mary' and 'Ma...,,,n,“Mary Magdalen” was incorrectly split into two...
2,Plate 10 : Houses . Landscape with a shepherd ...,"[{'text': 'Abraham', 'label': 'PER'}, {'text':...","{'Abraham': 'Q329811', 'Bloemaert': 'Q329811'}",y,,n,The result incorrectly associates both 'Abraha...,,,n,The person name is split into two entities. It...
3,"Two musicians , from a series of six musicians...",[],{},n,The result for named entity recognition is emp...,n,"The result is empty, which means no named enti...",,,n,"The text contains at least a DATE (""1637"") and..."
4,"Two examples of snakes , one red and black , e...","[{'text': 'Merian', 'label': 'PER'}, {'text': ...","{'Merian': 'Q37093060', 'Surinam': 'Q730'}",y,,n,The linking of 'Merian' to Q37093060 is approp...,,,y,


### Using smaller local models

In [None]:
TBA

### Displaying sentences judged as invalid by LLMs

In [None]:
# Highlighting the identified entities in form of color-coded annotations with links to authority files where available

# Color palette - add more colors if more labels are used or change them here
colors = ['#F0D9EF', '#FCDCE1', '#FFE6BB', '#E9ECCE', '#CDE9DC', '#C4DFE5', '#D9E5F0', '#F0E6D9', '#E0D9F0', '#E6FFF0', '#9CC5DF']

# Name Labels that should be shown in color, not mentioned labels will be shown in grey (this makes it easier to focus on certain categories if needed)
labels = ["PERSON", "DATE"]

# Mapping each label from the label set to a color from the palette
label_to_color = {label: colors[i % len(colors)] for i, label in enumerate(labels)}

# Generating the HTML - two changes can be made here:
# 1) by default, the column "NE-COARSE-LIT" is used for the entities, this can be changed via the argument "iob_column"
# 2) the entity identifiers are taken from the column "NEL-LIT"; by default, these are expected to be Wikidata identifiers (e.g. Q1744) and are combined with the Wikidate base URL; for another authority file, the base URL can be changed via the argument "base_url"
res,text,entities = highlight_entities(data)

# displaying the annotations
display(HTML(res))

## Variations and alternatives

...