## Evaluate a custom Presidio Analyzer using the Presidio Evaluator framework

This notebook demonstrates how to evaluate a Presidio instance using the presidio-evaluator framework. It builds upon [example 4](4_Evaluate_Presidio_Analyzer.ipynb), with changes to the `PresidioAnalyzer` instance to improve detection accuracy. For more information on customizing the Presidio Analyzer, see the [Presidio Analyzer documentation](https://microsoft.github.io/presidio/analyzer/) or this [tutorial](https://microsoft.github.io/presidio/tutorial/).

Steps:
1. Load dataset from file
2. Simple dataset statistics
3. Define the AnalyzerEngine object (and its parameters)
4. Align the dataset's entities to Presidio's entities
5. Set up the Evaluator object
6. Run experiment
7. Evaluate results
8. Error analysis

In [2]:
# install presidio evaluator via pip if not yet installed

# %pip install presidio-evaluator
# %pip install "presidio-analyzer[transformers]"

In [3]:
from pathlib import Path
from pprint import pprint
from collections import Counter
from typing import Dict, List
import json
import warnings
warnings.filterwarnings('ignore')

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import ModelError, Plotter, SpanEvaluator
from presidio_evaluator.models import PresidioAnalyzerWrapper
from presidio_evaluator.experiment_tracking import get_experiment_tracker

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

%reload_ext autoreload
%autoreload 2
%matplotlib inline

stanza and spacy_stanza are not installed
Flair is not installed by default


## 1. Load dataset from file

In [3]:
dataset_name = "synth_dataset.json"
dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "sample_data", dataset_name))

print(len(dataset))

tokenizing input:   0%|          | 0/1500 [00:00<?, ?it/s]

loading model en_core_web_sm


tokenizing input: 100%|██████████| 1500/1500 [00:05<00:00, 290.42it/s]

1500





This dataset was auto generated. See more info here [Synthetic data generation](1_Generate_data.ipynb).

In [4]:
def get_entity_counts(dataset: List[InputSample]) -> Dict:
    """Return a dictionary with counter per entity type."""
    entity_counter = Counter()
    for sample in dataset:
        for tag in sample.tags:
            entity_counter[tag] += 1
    return entity_counter


In [5]:
entity_counts = get_entity_counts(dataset)
print("Count per entity:")
pprint(entity_counts.most_common(), compact=True)

print("\nMin and max number of tokens in dataset: "\
f"Min: {min([len(sample.tokens) for sample in dataset])}, "\
f"Max: {max([len(sample.tokens) for sample in dataset])}")

print(f"Min and max sentence length in dataset: " \
f"Min: {min([len(sample.full_text) for sample in dataset])}, "\
f"Max: {max([len(sample.full_text) for sample in dataset])}")

Count per entity:
[('O', 19307), ('STREET_ADDRESS', 2906), ('PERSON', 1470),
 ('ORGANIZATION', 1118), ('GPE', 495), ('PHONE_NUMBER', 470),
 ('DATE_TIME', 217), ('CREDIT_CARD', 134), ('TITLE', 91), ('AGE', 77),
 ('NRP', 66), ('US_SSN', 50), ('ZIP_CODE', 45), ('EMAIL_ADDRESS', 44),
 ('IBAN_CODE', 29), ('DOMAIN_NAME', 29), ('IP_ADDRESS', 8),
 ('US_DRIVER_LICENSE', 5)]

Min and max number of tokens in dataset: Min: 2, Max: 86
Min and max sentence length in dataset: Min: 9, Max: 447


In [None]:
from models_config import analyzer as analyzer_engine

Fetching 21 files: 100%|██████████| 21/21 [00:00<00:00, 158133.54it/s]

Device set to use cpu
Device set to use cpu


In [7]:
pprint("Supported entities for English:")
pprint(analyzer_engine.get_supported_entities("en"), compact=True)

print("\nLoaded recognizers for English:")
pprint([rec.name for rec in analyzer_engine.registry.get_recognizers("en", all_fields=True)], compact=True)

print("\nLoaded Context Aware Enhancer:")
print(analyzer_engine.context_aware_enhancer.__class__.__name__)
pprint(json.dumps(analyzer_engine.context_aware_enhancer.__dict__), compact=True)


print("\nLoaded NER Models:")
pprint(analyzer_engine.nlp_engine.models)

print("\ndefault score threshold:")
pprint(analyzer_engine.default_score_threshold)

print("\naggregation strategy:")
pprint(analyzer_engine.nlp_engine.ner_model_configuration.aggregation_strategy)

pprint(analyzer_engine.nlp_engine.ner_model_configuration.model_to_presidio_entity_mapping)

'Supported entities for English:'
['US_PASSPORT', 'US_DRIVER_LICENSE', 'PERSON', 'CRYPTO', 'ORGANIZATION',
 'TITLE', 'URL', 'EMAIL', 'US_BANK_NUMBER', 'MEDICAL_LICENSE', 'ZIP_CODE',
 'CREDIT_CARD', 'US_ITIN', 'LOCATION', 'IBAN_CODE', 'AGE', 'ID',
 'EMAIL_ADDRESS', 'DATE_TIME', 'PHONE_NUMBER', 'IP_ADDRESS', 'US_SSN']

Loaded recognizers for English:
['CreditCardRecognizer', 'UsBankRecognizer', 'UsLicenseRecognizer',
 'UsItinRecognizer', 'UsPassportRecognizer', 'UsSsnRecognizer',
 'CryptoRecognizer', 'DateRecognizer', 'EmailRecognizer', 'IbanRecognizer',
 'IpRecognizer', 'MedicalLicenseRecognizer', 'PhoneRecognizer', 'UrlRecognizer',
 'TransformersRecognizer', 'TitlesRecognizer', 'ZipCodeRecognizer',
 'LocationDenylist', 'AgeRecognizer']

Loaded Context Aware Enhancer:
LemmaContextAwareEnhancer
('{"context_similarity_factor": 0.35, "min_score_with_context_similarity": '
 '0.4, "context_prefix_count": 10, "context_suffix_count": 10}')

Loaded NER Models:
[{'lang_code': 'en',
  'model_name

In [8]:
# Test Analyzer
text="Yesterday in Mt. Sinai AP: Dana Silver, 79 years old female was complaining of stomach pain. Her ID is 154555"
res = analyzer_engine.analyze(text=text, 
                              language="en", 
                              return_decision_process=True)
for result in res:
    print(f"\nEntity: {result.entity_type}, Text: {text[result.start:result.end]}\n\nAnalysis explanation:")
    pprint(result.analysis_explanation)


Entity: LOCATION, Text: AP

Analysis explanation:
{'recognizer': 'LocationDenylist', 'pattern_name': 'deny_list', 'pattern': '(?:^|(?<=\\W))(APO|PSC|AA|Cyprus\\ \\(Greek\\)|ul|AE|DPO|AP|nan)(?:(?=\\W)|$)', 'original_score': 1.0, 'score': 1.0, 'textual_explanation': 'Detected by `LocationDenylist` using pattern `deny_list`', 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None, 'regex_flags': regex.I|M|S}

Entity: PERSON, Text: Silver,

Analysis explanation:
{'recognizer': 'TransformersRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': np.float32(0.99171305), 'score': np.float32(0.99171305), 'textual_explanation': "Identified as PERSON by Transformers's Named Entity Recognition", 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None, 'regex_flags': None}

Entity: PERSON, Text: Dana

Analysis explanation:
{'recognizer': 'TransformersRecognizer', 'pattern_name': None, 'pattern': None, 'original_scor

## 4. Align the dataset's entities to Presidio's entities

There is possibly a difference between the names of entities in the dataset, and the names of entities Presidio can detect.
For example, it could be that a dataset labels a name as PER while Presidio returns PERSON. To be able to compare the predicted value to the actual and gather metrics, an alignment between the entity names is necessary. Consider changing the mapping if your dataset and/or Presidio instance supports difference entity types.

In [1]:
entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map # default mapping
# Add titles and zip codes as we have recognizers for those
entities_mapping["TITLE"] = "TITLE"
entities_mapping["PREFIX"] = "TITLE"
entities_mapping["ZIP_CODE"] = "LOCATION" # To avoid conflating zip codes with addresses
entities_mapping["NRP"] = "ORGANIZATION" # We don't have a NRP recognizer with this setup
# entities_mapping["LOCATION"] = "GPE"


print("Entities mapping:")
pprint(entities_mapping)

dataset = SpanEvaluator.align_entity_types(
    dataset, 
    entities_mapping=entities_mapping, 
    allow_missing_mappings=True
)
new_entity_counts = get_entity_counts(dataset)
print("\nCount per entity after alignment:")
pprint(new_entity_counts.most_common(), compact=True)

dataset_entities = list(new_entity_counts.keys())

NameError: name 'PresidioAnalyzerWrapper' is not defined

## 5. Set up the Evaluator object

In [10]:
# Set up the experiment tracker to log the experiment for reproducibility
# experiment = get_experiment_tracker()

# Create the evaluator object
evaluator = SpanEvaluator(model=analyzer_engine, iou_threshold=0.7)
evaluator?


# Track model and dataset params
params = {"dataset_name": dataset_name, "model_name": evaluator.model.name}
params.update(evaluator.model.to_log())
# experiment.log_parameters(params)
# experiment.log_dataset_hash(dataset)
# experiment.log_parameter("entity_mappings", json.dumps(entities_mapping))

--------
Entities supported by this Presidio Analyzer instance:
US_PASSPORT, US_DRIVER_LICENSE, PERSON, CRYPTO, ORGANIZATION, TITLE, URL, EMAIL, US_BANK_NUMBER, MEDICAL_LICENSE, ZIP_CODE, CREDIT_CARD, US_ITIN, LOCATION, IBAN_CODE, AGE, ID, EMAIL_ADDRESS, DATE_TIME, PHONE_NUMBER, IP_ADDRESS, US_SSN


[31mType:[39m           SpanEvaluator
[31mString form:[39m    <presidio_evaluator.evaluation.span_evaluator.SpanEvaluator object at 0x328a5d5d0>
[31mFile:[39m           /opt/miniconda3/envs/deid-pipeline/lib/python3.11/site-packages/presidio_evaluator/evaluation/span_evaluator.py
[31mDocstring:[39m      Evaluates PII detection using span-based fuzzy matching with character-level Intersection over Union (IoU).
[31mInit docstring:[39m
Initialize the SpanEvaluator for evaluating pii entities detection results.

:param iou_threshold: Minimum Intersection over Union (IoU) threshold for considering spans as matching.
                    Value between 0 and 1, where higher values require more overlap (default: 0.5)
:param skip_words: Optional list of custom skip words to ignore during token normalization,
                    should also include punctuation marks.
                 If None, uses skip words from skipwords.py (default: None).
                 Pass an empty list ([]) to 

In [11]:
LABEL_NORM = {
    "GPE": "LOCATION",
    "LOC": "LOCATION",
    "NORP": "NRP",
}

# def normalize_dataset(dataset):
#     for sample in dataset:
#         for ann in sample.get_tags():  # or sample.entities / sample.labels
#             pprint(ann)
#             # ann.entity_type = LABEL_NORM.get(ann.entity_type, ann.entity_type)
#     return dataset

sample = dataset[1]
sample.get_tags()[1]

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-IP_ADDRESS']

In [12]:
evaluator.align_entity_types

<function presidio_evaluator.evaluation.base_evaluator.BaseEvaluator.align_entity_types(input_samples: List[presidio_evaluator.data_objects.InputSample], entities_mapping: Dict[str, str] = None, allow_missing_mappings: bool = False) -> List[presidio_evaluator.data_objects.InputSample]>

## 6. Run experiment

In [13]:
%%time

## Run experiment

# dataset_norm = normalize_dataset(dataset)

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

# Track experiment results
# experiment.log_metrics(results.to_log())
# entities, confmatrix = results.to_confusion_matrix()
# experiment.log_confusion_matrix(matrix=confmatrix, 
                                # labels=entities)

# end experiment
# experiment.end()

# Note that the experiment params and metrics are saved locally

Running model PresidioAnalyzerWrapper on dataset...
Finished running model on dataset
Finished running model on dataset
CPU times: user 36.8 s, sys: 8.78 s, total: 45.6 s
Wall time: 42 s
CPU times: user 36.8 s, sys: 8.78 s, total: 45.6 s
Wall time: 42 s


## 7. Evaluate results

In [14]:
pprint({"PII F":results.pii_f, "PII recall": results.pii_recall, "PII precision": results.pii_precision})

{'PII F': 0.856934306569343,
 'PII precision': 0.8827067669172932,
 'PII recall': 0.8507246376811595}


## 8. Error analysis

Now let's look into results to understand what's behind the metrics we're getting.
Note that evaluation is never perfect. Some things to consider:

1. There's often a mismatch between the annotated span and the predicted span, which isn't necessarily a mistake. For example: `<Southern France>` compared with `Southern <France>`. In the second text, the word `Southern` was not annotated/predicted as part of the entity, but that's not necessarily an error.
1. The synthetic dataset used here isn't representative of a real dataset. Consider using more realistic datasets for evaluation

### 7a. False positives
#### Most common false positive tokens:

In [15]:
ModelError.most_common_fp_tokens(results.model_errors)

Most common false positive tokens:
[('metallica', 10),
 ('co.ltd', 9),
 ('czech', 9),
 ('501(c)3', 8),
 ('manager', 7),
 ('dutch', 7),
 ('8 16', 7),
 ('dafne mascarenas william martinsson carmen gordon franziska rothstein '
  'kaikou shimoda',
  7),
 ('lena røise brenda conway laura fernandes debra neal nicholas echeverri', 7),
 ('italian', 6)]
---------------
Example sentence with each FP token:
	- Metallica (`metallica` pred as O)
	- CO.LTD (`co.ltd` pred as O)
	- Czech (`czech` pred as O)
	- 501(c)3 (`501(c)3` pred as AGE)
	- manager (`manager` pred as O)
	- Dutch (`dutch` pred as O)
	- 8 16 (`8 16` pred as AGE)
	- Dafne Mascarenas William Martinsson Carmen Gordon , Franziska Rothstein Kaikou Shimoda (`dafne mascarenas william martinsson carmen gordon franziska rothstein kaikou shimoda` pred as O)
	- Lena Røise Brenda Conway Laura Fernandes , Debra Neal Nicholas Echeverri (`lena røise brenda conway laura fernandes debra neal nicholas echeverri` pred as O)
	- Italian (`italian` pred 

[('metallica', 10),
 ('co.ltd', 9),
 ('czech', 9),
 ('501(c)3', 8),
 ('manager', 7),
 ('dutch', 7),
 ('8 16', 7),
 ('dafne mascarenas william martinsson carmen gordon franziska rothstein kaikou shimoda',
  7),
 ('lena røise brenda conway laura fernandes debra neal nicholas echeverri', 7),
 ('italian', 6)]

#### More FP analysis

In [16]:
fps_df = ModelError.get_fps_dataframe(results.model_errors, entity=["AGE"])
fps_df[["full_text", "token", "annotation", "prediction"]].head(20)

Unnamed: 0,full_text,token,annotation,prediction
0,5,5,O,AGE
1,501(c)3,501(c)3,O,AGE
2,6,6,O,AGE
3,501(c)3,501(c)3,O,AGE
4,8 16,8 16,O,AGE
5,501(c)3,501(c)3,O,AGE
6,501(c)3,501(c)3,O,AGE
7,501(c)3,501(c)3,O,AGE
8,5,5,O,AGE
9,5,5,O,AGE


### 7b. False negatives (FN)

#### Most common false negative examples + a few samples with FN

In [27]:
results

[type: LOCATION, start: 8, end: 15, score: 0.7782188653945923]

In [17]:
ModelError.most_common_fn_tokens(results.model_errors, n=15)

Most common false negative tokens:
[('czech', 11),
 ('dutch', 7),
 ('brazil', 6),
 ('italian', 6),
 ('slovenian', 5),
 ('2017', 5),
 ('danish', 5),
 ('2024', 4),
 ('american', 4),
 ('french', 3),
 ('russian', 3),
 ('greenlander', 3),
 ('monday', 3),
 ('1970', 3),
 ('spanish', 3)]
---------------
Example sentence with each FN token:
	- Czech (`czech` annotated as O)
	- Dutch (`dutch` annotated as O)
	- Brazil (`brazil` annotated as O)
	- Italian (`italian` annotated as O)
	- Slovenian (`slovenian` annotated as O)
	- 2017 (`2017` annotated as O)
	- Danish (`danish` annotated as O)
	- 2024 (`2024` annotated as O)
	- American (`american` annotated as O)
	- French (`french` annotated as O)
	- Russian (`russian` annotated as O)
	- Greenlander (`greenlander` annotated as O)
	- Monday (`monday` annotated as O)
	- 1970 (`1970` annotated as O)
	- Spanish (`spanish` annotated as O)


[('czech', 11),
 ('dutch', 7),
 ('brazil', 6),
 ('italian', 6),
 ('slovenian', 5),
 ('2017', 5),
 ('danish', 5),
 ('2024', 4),
 ('american', 4),
 ('french', 3),
 ('russian', 3),
 ('greenlander', 3),
 ('monday', 3),
 ('1970', 3),
 ('spanish', 3)]

#### More FN analysis

In [18]:
fns_df = ModelError.get_fns_dataframe(results.model_errors, entity=["PERSON"])

In [19]:
fns_df[["full_text", "token", "annotation", "prediction"]].head(20)

Unnamed: 0,full_text,token,annotation,prediction
0,Ruby and Bem rakpart 81 .,ruby bem rakpart 81,PERSON,LOCATION
1,Seeclickfix,seeclickfix,PERSON,ORGANIZATION
2,Jubilant Industries Ltd SSgA SPDR ETFs Europe I Public Limited Company- SPDR Barclays 3 - 7 Year Euro Corporate Bond UCITS ETF,jubilant industries ssga spdr etfs europe public limited company- spdr barclays 3 7 euro corporate bond ucits etf,PERSON,ORGANIZATION
3,Importio,importio,PERSON,ORGANIZATION
4,Ruby and Bem rakpart 81 .,ruby bem rakpart 81,PERSON,LOCATION
5,Seeclickfix,seeclickfix,PERSON,ORGANIZATION
6,Jubilant Industries Ltd SSgA SPDR ETFs Europe I Public Limited Company- SPDR Barclays 3 - 7 Year Euro Corporate Bond UCITS ETF,jubilant industries ssga spdr etfs europe public limited company- spdr barclays 3 7 euro corporate bond ucits etf,PERSON,ORGANIZATION
7,Importio,importio,PERSON,ORGANIZATION


In [28]:
text = "Born in Poland. Czech ancestry."
doc = analyzer_engine.nlp_engine.process_text(text, language="en")
print(doc.entities)

[Poland.]


In [29]:
text = "Born in Poland. Czech ancestry."

results = analyzer_engine.analyze(text=text, language="en", return_decision_process=True)
for r in results:
    print(r.entity_type, r.start, r.end, repr(text[r.start:r.end]), r.score)


LOCATION 8 15 'Poland.' 0.77821887
