## Evaluate an LLM recognizer using the Presidio Evaluator framework

This notebook demonstrates how to evaluate a Presidio instance using the presidio-evaluator framework. It builds upon [example 4](4_Evaluate_Presidio_Analyzer.ipynb), with changes to the `PresidioAnalyzer` instance to improve detection accuracy. For more information on customizing the Presidio Analyzer, see the [Presidio Analyzer documentation](https://microsoft.github.io/presidio/analyzer/) or this [tutorial](https://microsoft.github.io/presidio/tutorial/).

Steps:
1. Load dataset from file
2. Simple dataset statistics
3. Define the AnalyzerEngine object (and its parameters)
4. Align the dataset's entities to Presidio's entities
5. Set up the Evaluator object
6. Run experiment
7. Evaluate results
8. Error analysis

In [7]:
# install presidio evaluator via pip if not yet installed

#!pip install presidio-evaluator
#!pip install "presidio-analyzer[transformers]"

In [1]:
from pathlib import Path
from copy import deepcopy
from pprint import pprint
from collections import Counter
from typing import Dict, List
import json
import warnings

warnings.filterwarnings("ignore")

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import Evaluator, ModelError, Plotter
from presidio_evaluator.models import PresidioAnalyzerWrapper
from presidio_evaluator.experiment_tracking import get_experiment_tracker

import pandas as pd

import dspy

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

%reload_ext autoreload
%autoreload 2
%matplotlib inline

stanza and spacy_stanza are not installed


## 1. Load dataset from file

In [29]:
dataset_name = "synth_dataset_v2.json"
dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "data", dataset_name))

dataset = dataset[1200:1300]


print(f"Loaded {len(dataset)} samples")

tokenizing input: 100%|██████████| 1500/1500 [00:06<00:00, 229.85it/s]

300





This dataset was auto generated. See more info here [Synthetic data generation](1_Generate_data.ipynb).

In [30]:
def get_entity_counts(dataset: List[InputSample]) -> Dict:
    """Return a dictionary with counter per entity type."""
    entity_counter = Counter()
    for sample in dataset:
        for tag in sample.tags:
            entity_counter[tag] += 1
    return entity_counter

## 2. Simple dataset statistics

In [31]:
entity_counts = get_entity_counts(dataset)
print("Count per entity:")
pprint(entity_counts.most_common(), compact=True)

print(
    "nMin and max number of tokens in dataset: "
    f"Min: {min([len(sample.tokens) for sample in dataset])}, "
    f"Max: {max([len(sample.tokens) for sample in dataset])}"
)

print(
    f"Min and max sentence length in dataset: "
    f"Min: {min([len(sample.full_text) for sample in dataset])}, "
    f"Max: {max([len(sample.full_text) for sample in dataset])}"
)

Count per entity:
[('O', 4302), ('STREET_ADDRESS', 744), ('PERSON', 317), ('ORGANIZATION', 137),
 ('GPE', 104), ('PHONE_NUMBER', 72), ('TITLE', 35), ('DATE_TIME', 31),
 ('AGE', 19), ('CREDIT_CARD', 17), ('NRP', 14), ('DOMAIN_NAME', 12),
 ('IP_ADDRESS', 11), ('EMAIL_ADDRESS', 9), ('ZIP_CODE', 4), ('IBAN_CODE', 1)]
nMin and max number of tokens in dataset: Min: 3, Max: 78
Min and max sentence length in dataset: Min: 9, Max: 406


## 3. Define the AnalyzerEngine object 
In this case, we customize the AnalyzerEngine to use a different NER model, some custom recognizers and the context aware enhancer.

### 3.1 Set up the NlpEngine
The NLP engine is in charge of text processing using spaCy, and named entity recognition using a transformers model

In [32]:
from presidio_analyzer.nlp_engine import SpacyNlpEngine

# Define which model to use
model_config = [
    {
        "lang_code": "en",
        "model_name": "en_core_web_sm"
    }
]
nlp_engine = SpacyNlpEngine(
    models=model_config,
)
nlp_engine.load()

### 3.2 Set up the relevant recognizers
Add and remove recognizers to fit the dataset in hand. 
Adding simple titles and zip code recognizers, another deny list for things that aren't considered PII but labeled as such,
and removing all the recognizers that don't map to entities in our dataset.

In [33]:
from presidio_evaluator.models.dspy_recognizer import DspyRecognizer
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry


registry = RecognizerRegistry()

registry.add_recognizer(DspyRecognizer())



### 3.4 Create the AnalyzerEngine object

In [34]:
# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer_engine = AnalyzerEngine(
    nlp_engine=nlp_engine,
    registry=registry,
    default_score_threshold=0.3,
)

pprint(f"Supported entities for English:")
pprint(analyzer_engine.get_supported_entities("en"), compact=True)

print(f"nLoaded recognizers for English:")
pprint(
    [
        rec.name
        for rec in analyzer_engine.registry.get_recognizers("en", all_fields=True)
    ],
    compact=True,
)

print(f"nLoaded Context Aware Enhancer:")
print(analyzer_engine.context_aware_enhancer.__class__.__name__)
pprint(json.dumps(analyzer_engine.context_aware_enhancer.__dict__), compact=True)


print(f"nLoaded NER models:")
pprint(analyzer_engine.nlp_engine.models)

'Supported entities for English:'
['PHONE_NUMBER', 'DATE_TIME', 'ORGANIZATION', 'IBAN', 'ID', 'GPE', 'PERSON',
 'NRP', 'ADDRESS', 'EMAIL_ADDRESS', 'CREDIT_CARD', 'AGE', 'BANK_ACCOUNT',
 'LOCATION', 'SSN', 'IP_ADDRESS', 'NATIONALITY', 'DOMAIN_NAME',
 'DRIVER_LICENSE', 'CRYPTO']
nLoaded recognizers for English:
['DspyRecognizer']
nLoaded Context Aware Enhancer:
LemmaContextAwareEnhancer
('{"context_similarity_factor": 0.35, "min_score_with_context_similarity": '
 '0.4, "context_prefix_count": 5, "context_suffix_count": 0}')
nLoaded NER models:
[{'lang_code': 'en', 'model_name': 'en_core_web_sm'}]


In [51]:
# Test Analyzer
#text = "Yesterday in Mt. Sinai AP: Dana Silver, 79 years old female was complaining of stomach pain. Her ID is 154555"
text = """Please return to 35108 ul. Zamkowa 33 Apt. 406
Beerze
, OV
 Netherlands 02032 in case of an issue."""
res = analyzer_engine.analyze(text=text, language="en", return_decision_process=True)
print(text)
for result in res:
    print(
        f"nEntity: {result.entity_type}, Text: {text[result.start:result.end]}"#nnAnalysis explanation:"
    )
    #pprint(result.analysis_explanation)

Please return to 35108 ul. Zamkowa 33 Apt. 406
Beerze
, OV
 Netherlands 02032 in case of an issue.
nEntity: STREET_ADDRESS, Text: 35108 ul. Zamkowa 33 Apt. 406
nEntity: GPE, Text: Beerze
nEntity: GPE, Text: OV
nEntity: GPE, Text: Netherlands
nEntity: ZIP_CODE, Text: 02032


## 4. Align the dataset's entities to Presidio's entities

There is possibly a difference between the names of entities in the dataset, and the names of entities Presidio can detect.
For example, it could be that a dataset labels a name as PER while Presidio returns PERSON. To be able to compare the predicted value to the actual and gather metrics, an alignment between the entity names is necessary. Consider changing the mapping if your dataset and/or Presidio instance supports difference entity types.

In [None]:
entities_mapping = PresidioAnalyzerWrapper.presidio_entities_map  # default mapping
entities_mapping["STREET_ADDRESS"] = "LOCATION"

print("Entities mapping:")
pprint(entities_mapping)

dataset = Evaluator.align_entity_types(
    dataset, entities_mapping=entities_mapping, allow_missing_mappings=False
)
new_entity_counts = get_entity_counts(dataset)
print("nCount per entity after alignment:")
pprint(new_entity_counts.most_common(), compact=True)

dataset_entities = list(new_entity_counts.keys())

Entities mapping:
{'ADDRESS': 'LOCATION',
 'AGE': 'AGE',
 'BIRTHDAY': 'DATE_TIME',
 'CITY': 'LOCATION',
 'CREDIT_CARD': 'CREDIT_CARD',
 'CREDIT_CARD_NUMBER': 'CREDIT_CARD',
 'DATE': 'DATE_TIME',
 'DATE_OF_BIRTH': 'DATE_TIME',
 'DATE_TIME': 'DATE_TIME',
 'DOB': 'DATE_TIME',
 'DOMAIN': 'URL',
 'DOMAIN_NAME': 'URL',
 'EMAIL': 'EMAIL_ADDRESS',
 'EMAIL_ADDRESS': 'EMAIL_ADDRESS',
 'FACILITY': 'LOCATION',
 'FIRST_NAME': 'PERSON',
 'GPE': 'LOCATION',
 'HCW': 'PERSON',
 'HOSP': 'ORGANIZATION',
 'HOSPITAL': 'ORGANIZATION',
 'IBAN': 'IBAN_CODE',
 'IBAN_CODE': 'IBAN_CODE',
 'ID': 'ID',
 'IP_ADDRESS': 'IP_ADDRESS',
 'LAST_NAME': 'PERSON',
 'LOC': 'LOCATION',
 'LOCATION': 'LOCATION',
 'NAME': 'PERSON',
 'NATIONALITY': 'NRP',
 'NORP': 'NRP',
 'NRP': 'NRP',
 'O': 'O',
 'ORG': 'ORGANIZATION',
 'ORGANIZATION': 'ORGANIZATION',
 'PATIENT': 'PERSON',
 'PATORG': 'ORGANIZATION',
 'PER': 'PERSON',
 'PERSON': 'PERSON',
 'PHONE': 'PHONE_NUMBER',
 'PHONE_NUMBER': 'PHONE_NUMBER',
 'PREFIX': 'TITLE',
 'SSN': 'US_S

## 5. Set up the Evaluator object

In [37]:
# Set up the experiment tracker to log the experiment for reproducibility
experiment = get_experiment_tracker()

# Create the evaluator object
evaluator = Evaluator(model=analyzer_engine)


# Track model and dataset params
params = {"dataset_name": dataset_name, "model_name": evaluator.model.name}
params.update(evaluator.model.to_log())
experiment.log_parameters(params)
experiment.log_dataset_hash(dataset)
experiment.log_parameter("entity_mappings", json.dumps(entities_mapping))

--------
Entities supported by this Presidio Analyzer instance:
PHONE_NUMBER, DATE_TIME, ORGANIZATION, IBAN, ID, GPE, PERSON, NRP, ADDRESS, EMAIL_ADDRESS, CREDIT_CARD, AGE, BANK_ACCOUNT, LOCATION, SSN, IP_ADDRESS, NATIONALITY, DOMAIN_NAME, DRIVER_LICENSE, CRYPTO


## 6. Run experiment

In [38]:
%%time

## Run experiment

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

# Track experiment results
experiment.log_metrics(results.to_log())
entities, confmatrix = results.to_confusion_matrix()
experiment.log_confusion_matrix(matrix=confmatrix, labels=entities)

# end experiment
experiment.end()

# Note that the experiment params and metrics are saved locally

Running model PresidioAnalyzerWrapper on dataset...
Finished running model on dataset
saving experiment data to experiment_20250123-080711.json
CPU times: user 7.79 s, sys: 551 ms, total: 8.34 s
Wall time: 10h 15min 27s


## 7. Evaluate results

In [39]:
# Plot output
plotter = Plotter(
    results=results,
    model_name=evaluator.model.name,
    # save_as="png",
    beta=2,
)

plotter.plot_scores()

In [40]:
pprint(
    {
        "PII F": results.pii_f,
        "PII recall": results.pii_recall,
        "PII precision": results.pii_precision,
    }
)

{'PII F': 0.6545102184637068,
 'PII precision': 0.9477040816326531,
 'PII recall': 0.6075224856909239}


## 8. Error analysis

Now let's look into results to understand what's behind the metrics we're getting.
Note that evaluation is never perfect. Some things to consider:

1. There's often a mismatch between the annotated span and the predicted span, which isn't necessarily a mistake. For example: `<Southern France>` compared with `Southern <France>`. In the second text, the word `Southern` was not annotated/predicted as part of the entity, but that's not necessarily an error.
2. Token based evaluation (which is used here) counts the number of true positive / false positive / false negative tokens. Some entities might be broken into more tokens than others. For example, the phone number `222-444-1234` could be broken into five different tokens, whereas `Krishna` would be broken into one token, resulting in phone numbers having more influence on metrics than names.
3. The synthetic dataset used here isn't representative of a real dataset. Consider using more realistic datasets for evaluation

In [41]:
plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix)
# plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix, save_as="png")

In [42]:
plotter.plot_most_common_tokens()

### 7a. False positives
#### Most common false positive tokens:

In [43]:
ModelError.most_common_fp_tokens(results.model_errors)

Most common false positive tokens:
[('Sweden', 6),
 ('Cyprus', 5),
 ('University', 4),
 ('ESPOO', 4),
 ('Czech', 4),
 ('Republic', 4),
 ('Spain', 4),
 ('AA', 3),
 ('Austria', 3),
 ('Italy', 3)]
---------------
Example sentence with each FP token:
	- Dataweave

Fuglie 41 Popović Dam
 Suite 860
 UPPLANDS VÄSBY
 Sweden 56687 (`Sweden` pred as GPE)
	- Please return to 94941 2505 Heatherleigh Suite 620 Apt. 581, ΠΑΦΟΣ, Cyprus (Greek) 13475 in case of an issue. (`Cyprus` pred as GPE)
	- Walker began writing as a teenager, publishing her first story, "The Dimensions of a Shadow", in 1988 while studying English and journalism at the University of KVALØYSLETTA. (`University` pred as ORGANIZATION)
	- Lopez, Santos and Coleman is a design agency based in ESPOO. (`ESPOO` pred as GPE)
	- How do I change the address linked to my credit card to 2046 50 Point Walter Road Suite 067 Apt. 162, Chomutice u Horic v Podkrkonoší, Czech Republic 33156? (`Czech` pred as GPE)
	- How do I change the address link

[('Sweden', 6),
 ('Cyprus', 5),
 ('University', 4),
 ('ESPOO', 4),
 ('Czech', 4),
 ('Republic', 4),
 ('Spain', 4),
 ('AA', 3),
 ('Austria', 3),
 ('Italy', 3)]

#### More FP analysis

In [44]:
fps_df = ModelError.get_fps_dataframe(results.model_errors, entity=["PERSON"])
fps_df[["full_text", "token", "annotation", "prediction"]].head(20)

Unnamed: 0,full_text,token,annotation,prediction
0,Dr. Burke is a 53 year old man who grew up in KNIVSTA.,Dr.,O,PERSON
1,"For my take on Ms. Seleznyova, see Guilty Pleasures: 5 Musicians Of The 70s You're Supposed To Hate (But Secretly Love)",Ms.,O,PERSON
2,My name appears incorrectly on credit card statement could you please correct it to Dr. Justin Beet?,Dr.,O,PERSON
3,"card number 4933870304038678414 is lost, can you please send a new one to USNS Sekerková\nFPO AA 65728? I am in Akureyri for a business trip",USNS,LOCATION,PERSON
4,"card number 4933870304038678414 is lost, can you please send a new one to USNS Sekerková\nFPO AA 65728? I am in Akureyri for a business trip",Sekerková,LOCATION,PERSON
5,"card number 4933870304038678414 is lost, can you please send a new one to USNS Sekerková\nFPO AA 65728? I am in Akureyri for a business trip",USNS,LOCATION,PERSON
6,"card number 4933870304038678414 is lost, can you please send a new one to USNS Sekerková\nFPO AA 65728? I am in Akureyri for a business trip",Sekerková,LOCATION,PERSON


### 7b. False negatives (FN)

#### Most common false negative examples + a few samples with FN

In [48]:
ModelError.most_common_fn_tokens(results.model_errors, n=50)

Most common false negative tokens:
[('ul', 6),
 ('AA', 6),
 ('Cyprus', 6),
 ('Czech', 6),
 ('Sweden', 6),
 ('Greek', 5),
 ('PSC', 5),
 ('APO', 5),
 ('FPO', 4),
 ('AE', 4),
 ('Road', 4),
 ('ESPOO', 4),
 ('Republic', 4),
 ('Spain', 4),
 ('Rua', 3),
 ('Austria', 3),
 ('Italy', 3),
 ('Hungary', 3),
 ('Estonia', 3),
 ('Netherlands', 3),
 ('del', 2),
 ('94', 2),
 ('Nicosia', 2),
 ('Josef', 2),
 ('Roma', 2),
 ('131', 2),
 ('Thursday', 2),
 ('41', 2),
 ('74', 2),
 ('BUNDALAGUAH', 2),
 ('VIC', 2),
 ('204', 2),
 ('22', 2),
 ('83', 2),
 ('Swedish', 2),
 ('DPO', 2),
 ('American', 2),
 ('Poland', 2),
 ('33', 2),
 ('Drive', 2),
 ('New', 2),
 ('Zealand', 2),
 ('55', 2),
 ('Thaddeus', 2),
 ('Brazil', 2),
 ('Greenlander', 2),
 ('Finland', 2),
 ('South', 2),
 ('Africa', 2),
 ('Nuova', 1)]
---------------
Example sentence with each FN token:
	- >Oline Mikaelsen
>Roadify Transit
>Oline Mikaelsen
>772 ul. Narewska 94
>Apt. 478
>Poznań
>Poland 35403 (`ul` annotated as LOCATION)
	- What is your address? it i

[('ul', 6),
 ('AA', 6),
 ('Cyprus', 6),
 ('Czech', 6),
 ('Sweden', 6),
 ('Greek', 5),
 ('PSC', 5),
 ('APO', 5),
 ('FPO', 4),
 ('AE', 4),
 ('Road', 4),
 ('ESPOO', 4),
 ('Republic', 4),
 ('Spain', 4),
 ('Rua', 3),
 ('Austria', 3),
 ('Italy', 3),
 ('Hungary', 3),
 ('Estonia', 3),
 ('Netherlands', 3),
 ('del', 2),
 ('94', 2),
 ('Nicosia', 2),
 ('Josef', 2),
 ('Roma', 2),
 ('131', 2),
 ('Thursday', 2),
 ('41', 2),
 ('74', 2),
 ('BUNDALAGUAH', 2),
 ('VIC', 2),
 ('204', 2),
 ('22', 2),
 ('83', 2),
 ('Swedish', 2),
 ('DPO', 2),
 ('American', 2),
 ('Poland', 2),
 ('33', 2),
 ('Drive', 2),
 ('New', 2),
 ('Zealand', 2),
 ('55', 2),
 ('Thaddeus', 2),
 ('Brazil', 2),
 ('Greenlander', 2),
 ('Finland', 2),
 ('South', 2),
 ('Africa', 2),
 ('Nuova', 1)]

#### More FN analysis

In [46]:
fns_df = ModelError.get_fns_dataframe(results.model_errors, entity=["PERSON"])

In [47]:
fns_df[["full_text", "token", "annotation", "prediction"]].head(20)

Unnamed: 0,full_text,token,annotation,prediction
0,>Oline Mikaelsen\n>Roadify Transit\n>Oline Mikaelsen\n>772 ul. Narewska 94\n>Apt. 478\n>Poznań\n>Poland 35403,Oline,PERSON,O
1,>Oline Mikaelsen\n>Roadify Transit\n>Oline Mikaelsen\n>772 ul. Narewska 94\n>Apt. 478\n>Poznań\n>Poland 35403,Mikaelsen,PERSON,O
2,"When they weren't singing about Hobbits, satanic felines and interstellar journeys, they were singing about the verses from Terence Vassiliev's Cautionary Tales. Is there a better example of unbridled creativity than early Vassiliev?",Vassiliev,PERSON,O
3,"When they weren't singing about Hobbits, satanic felines and interstellar journeys, they were singing about the verses from Alice Goldberg's Cautionary Tales. Is there a better example of unbridled creativity than early Goldberg?",Goldberg,PERSON,O
4,Josef had given Josef his address: 6138 Via Roma 131,Josef,PERSON,O
5,Josef had given Josef his address: 6138 Via Roma 131,Josef,PERSON,O
6,"Walker began writing as a teenager, publishing her first story, ""The Dimensions of a Shadow"", in 1988 while studying English and journalism at the University of KVALØYSLETTA.",Walker,PERSON,O
7,"Leverett: What a wife.\nDanielle: Remember me, Alexander? When I killed your brother, I talked just like this!\nJonathan: You saved my life! How can I ever repay you?",Leverett,PERSON,O
8,"Gijsbrecht: ""Who are you?""\nAshley:""I'm Melvin's daughter"".",Gijsbrecht,PERSON,O
9,"Adomo began writing as a teenager, publishing her first story, ""The Dimensions of a Shadow"", in 2001 while studying English and journalism at the University of BASHALL TOWN.",Adomo,PERSON,O
