## Evaluate a custom Presidio Analyzer using the Presidio Evaluator framework

This notebook demonstrates how to evaluate a Presidio instance using the presidio-evaluator framework. It builds upon [example 4](4_Evaluate_Presidio_Analyzer.ipynb), with changes to the `PresidioAnalyzer` instance to improve detection accuracy. For more information on customizing the Presidio Analyzer, see the [Presidio Analyzer documentation](https://microsoft.github.io/presidio/analyzer/) or this [tutorial](https://microsoft.github.io/presidio/tutorial/).

Steps:
1. Load dataset from file
2. Simple dataset statistics
3. Define the AnalyzerEngine object (and its parameters)
4. Align the dataset's entities to Presidio's entities
5. Set up the Evaluator object
6. Run experiment
7. Evaluate results
8. Error analysis

In [None]:
# install presidio evaluator via pip if not yet installed

#!pip install presidio-evaluator
#!pip install "presidio-analyzer[transformers]"

In [29]:
from pathlib import Path
from pprint import pprint
from collections import Counter
from typing import Dict, List
import json

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import Evaluator, ModelError
from presidio_evaluator.models import PresidioAnalyzerWrapper
from presidio_evaluator.experiment_tracking import get_experiment_tracker

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

%reload_ext autoreload
%autoreload 2
%matplotlib inline

stanza and spacy_stanza are not installed
Flair is not installed by default
Flair is not installed


## 1. Load dataset from file

In [56]:
dataset_name = "data2.json"
dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "data modification", dataset_name))

print(len(dataset))



tokenizing input: 100%|██████████| 15/15 [00:00<00:00, 52.87it/s]

15





In [31]:
def get_entity_counts(dataset: List[InputSample]) -> Dict:
    """Return a dictionary with counter per entity type."""
    entity_counter = Counter()
    for sample in dataset:
        for tag in sample.tags:
            entity_counter[tag] += 1
    return entity_counter


In [32]:
entity_counts = get_entity_counts(dataset)
print("Count per entity:")
pprint(entity_counts.most_common(), compact=True)

print("\nMin and max number of tokens in dataset: "\
f"Min: {min([len(sample.tokens) for sample in dataset])}, "\
f"Max: {max([len(sample.tokens) for sample in dataset])}")

print(f"Min and max sentence length in dataset: " \
f"Min: {min([len(sample.full_text) for sample in dataset])}, "\
f"Max: {max([len(sample.full_text) for sample in dataset])}")

print("\nExample InputSample:")
print(dataset[1])

Count per entity:
[('O', 381), ('USERAGENT', 13), ('DATE', 9), ('IP', 5), ('FIRSTNAME', 4),
 ('AMOUNT', 4), ('PASSWORD', 4), ('ZIPCODE', 3), ('EMAIL', 3),
 ('CURRENCYCODE', 3), ('JOBTITLE', 3), ('USERNAME', 3), ('CURRENCYNAME', 2),
 ('PIN', 2), ('CURRENCYSYMBOL', 2), ('JOBTYPE', 2), ('STATE', 2), ('STREET', 2),
 ('SECONDARYADDRESS', 2), ('CITY', 2), ('JOBAREA', 1), ('ETHEREUMADDRESS', 1),
 ('GENDER', 1), ('AGE', 1), ('ACCOUNTNUMBER', 1), ('IBAN', 1), ('URL', 1),
 ('BUILDINGNUMBER', 1), ('CREDITCARDNUMBER', 1), ('VEHICLEVIN', 1)]

Min and max number of tokens in dataset: Min: 14, Max: 46
Min and max sentence length in dataset: Min: 92, Max: 262

Example InputSample:
Full text: Jessyca, you should compare our performance to the industry averages. This includes leads, conversion rates, bounce rates, page views, average spend per customer, and customer acquisition costs. Send a report to Roosevelt_Kshlerin@yahoo.com.
Spans: [Span(type: FIRSTNAME, value: Jessyca, char_span: [0: 7]), Span(ty

In [33]:
print("A few examples sentences containing each entity:\n")
for entity in entity_counts.keys():
    samples = [sample for sample in dataset if entity in set(sample.tags)]
    if len(samples) > 1 and entity != "O":
        print(f"Entity: <{entity}> two example sentences:\n"
              f"\n1) {samples[0].full_text}"
              f"\n2) {samples[1].full_text}"
              f"\n------------------------------------\n")

A few examples sentences containing each entity:

Entity: <FIRSTNAME> two example sentences:

1) Jessyca, you should compare our performance to the industry averages. This includes leads, conversion rates, bounce rates, page views, average spend per customer, and customer acquisition costs. Send a report to Roosevelt_Kshlerin@yahoo.com.
2) Dear Amely, due to the high demand, the early bird registration fees ₱889218 for the upcoming Intellectual Property Law webinar have been extended to 1929-11-13T02:09:09.402Z. Kindly apply at your earliest convenience.
------------------------------------

Entity: <EMAIL> two example sentences:

1) Jessyca, you should compare our performance to the industry averages. This includes leads, conversion rates, bounce rates, page views, average spend per customer, and customer acquisition costs. Send a report to Roosevelt_Kshlerin@yahoo.com.
2) Opportunities for Developer role in Saarland. Check out the details sent to Lura28@gmail.com.
-------------------

In [34]:
from presidio_analyzer import AnalyzerEngine
# Loading the vanilla Analyzer Engine, with the default NER model.
analyzer_engine = AnalyzerEngine(default_score_threshold=0.4)

pprint(f"Supported entities for English:")
pprint(analyzer_engine.get_supported_entities("en"), compact=True)

print(f"\nLoaded recognizers for English:")
pprint([rec.name for rec in analyzer_engine.registry.get_recognizers("en", all_fields=True)], compact=True)

print(f"\nLoaded NER models:")
pprint(analyzer_engine.nlp_engine.models)

'Supported entities for English:'
['US_ITIN', 'DATE_TIME', 'NRP', 'IN_PAN', 'LOCATION', 'US_PASSPORT', 'UK_NHS',
 'CRYPTO', 'AU_MEDICARE', 'AU_ACN', 'SG_NRIC_FIN', 'CREDIT_CARD', 'IN_VOTER',
 'IN_AADHAAR', 'IN_PASSPORT', 'IN_VEHICLE_REGISTRATION', 'US_SSN', 'IBAN_CODE',
 'AU_TFN', 'US_DRIVER_LICENSE', 'AU_ABN', 'EMAIL_ADDRESS', 'URL',
 'US_BANK_NUMBER', 'ORGANIZATION', 'MEDICAL_LICENSE', 'IP_ADDRESS', 'PERSON',
 'PHONE_NUMBER']

Loaded recognizers for English:
['CreditCardRecognizer', 'UsBankRecognizer', 'UsLicenseRecognizer',
 'UsItinRecognizer', 'UsPassportRecognizer', 'UsSsnRecognizer', 'NhsRecognizer',
 'SgFinRecognizer', 'AuAbnRecognizer', 'AuAcnRecognizer', 'AuTfnRecognizer',
 'AuMedicareRecognizer', 'InPanRecognizer', 'InAadhaarRecognizer',
 'InVehicleRegistrationRecognizer', 'InPassportRecognizer', 'CryptoRecognizer',
 'DateRecognizer', 'EmailRecognizer', 'IbanRecognizer', 'IpRecognizer',
 'MedicalLicenseRecognizer', 'PhoneRecognizer', 'UrlRecognizer',
 'InVoterRecognizer', 'Sp

In [7]:

presidio_entities_map1 = dict(
  FIRSTNAME=  "PERSON",
  LASTNAME = "PERSON",
  MIDDLENAME="PERSON",
  PERSON = "PERSON",

  DATE="DATE_TIME",
  TIME="DATE_TIME",
  DOB="DATE_TIME" ,
  DATE_TIME = "DATE_TIME",

  EMAIL="EMAIL_ADDRESS",
  EMAIL_ADDRESS="EMAIL_ADDRESS",

  PREFIX="TITLE",
  TITLE = "TITLE",

  URL="URL",

  STREET="LOCATION",
  STATE="LOCATION" , 
  CITY="LOCATION" , 
  COUNTY="LOCATION",
  SECONDARYADDRESS="LOCATION" ,
  LOCATION = "LOCATION",

  PHONEIMEI="PHONE_NUMBER",
  PHONENUMBER="PHONE_NUMBER",
  PHONE_NUMBER = "PHONE_NUMBER",

  IPV4="IP_ADDRESS",
  IPV6="IP_ADDRESS",
  IP="IP_ADDRESS",
  IP_ADDRESS = "IP_ADDRESS",

  CREDITCARDNUMBER="CREDIT_CARD",
  CREDIT_CARD = "CREDIT_CARD",

  ZIPCODE="ZIP_CODE",
  ZIP_CODE ="ZIP_CODE",

  COMPANYNAME="ORGANIZATION",
  ORGANIZATION= "ORGANIZATION",

  IBAN="IBAN_CODE",
  IBAN_CODE = "IBAN_CODE",

  SSN="US_SSN",
  US_SSN = "US_SSN",

  AGE="AGE",


  AMOUNT="O",
  USERNAME="O",
  JOBTITLE="O",
  JOBAREA="O",
  ACCOUNTNAME="O",
  ACCOUNTNUMBER="O",
  JOBTYPE="O",
  BUILDINGNUMBER="O" ,
  CURRENCYSYMBOL="O" ,
  PASSWORD="O",
  SEX="O",
  GENDER="O",
  BITCOINADDRESS="O",
  MASKEDNUMBER="O",
  USERAGENT="O",
  CURRENCY="O",
  ETHEREUMADDRESS="O",
  NEARBYGPSCOORDINATE="O",
  CREDITCARDISSUER="O",
  ORDINALDIRECTION="O",
  MAC="O" ,
  VEHICLEVRM="O",
  EYECOLOR="O",
  CREDITCARDCVV="O",
  HEIGHT="O" ,
  LITECOINADDRESS="O",
  VEHICLEVIN="O" ,
  CURRENCYCODE="O",
  CURRENCYNAME="O" ,
  BIC="O",
  PIN="O",
  O= "O",

)







In [36]:
#entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map 
entities_mapping = presidio_entities_map1
print("Using this mapping between the dataset and Presidio's entities:")
pprint(entities_mapping, compact=True)


dataset = Evaluator.align_entity_types(
    dataset, 
    entities_mapping=entities_mapping, 
    allow_missing_mappings=True
)
new_entity_counts = get_entity_counts(dataset)
print("\nCount per entity after alignment:")
pprint(new_entity_counts.most_common(), compact=True)

dataset_entities = list(new_entity_counts.keys())


Using this mapping between the dataset and Presidio's entities:
{'ACCOUNTNAME': 'O',
 'ACCOUNTNUMBER': 'O',
 'AGE': 'AGE',
 'AMOUNT': 'O',
 'BIC': 'O',
 'BITCOINADDRESS': 'O',
 'BUILDINGNUMBER': 'O',
 'CITY': 'LOCATION',
 'COMPANYNAME': 'ORGANIZATION',
 'COUNTY': 'LOCATION',
 'CREDITCARDCVV': 'O',
 'CREDITCARDISSUER': 'O',
 'CREDITCARDNUMBER': 'CREDIT_CARD',
 'CREDIT_CARD': 'CREDIT_CARD',
 'CURRENCY': 'O',
 'CURRENCYCODE': 'O',
 'CURRENCYNAME': 'O',
 'CURRENCYSYMBOL': 'O',
 'DATE': 'DATE_TIME',
 'DATE_TIME': 'DATE_TIME',
 'DOB': 'DATE_TIME',
 'EMAIL': 'EMAIL_ADDRESS',
 'EMAIL_ADDRESS': 'EMAIL_ADDRESS',
 'ETHEREUMADDRESS': 'O',
 'EYECOLOR': 'O',
 'FIRSTNAME': 'PERSON',
 'GENDER': 'O',
 'HEIGHT': 'O',
 'IBAN': 'IBAN_CODE',
 'IBAN_CODE': 'IBAN_CODE',
 'IP': 'IP_ADDRESS',
 'IPV4': 'IP_ADDRESS',
 'IPV6': 'IP_ADDRESS',
 'IP_ADDRESS': 'IP_ADDRESS',
 'JOBAREA': 'O',
 'JOBTITLE': 'O',
 'JOBTYPE': 'O',
 'LASTNAME': 'PERSON',
 'LITECOINADDRESS': 'O',
 'LOCATION': 'LOCATION',
 'MAC': 'O',
 'MASKED

In [37]:
print (dataset[0])

Full text: 89200-3325 schools are next in line for education reform pilot program. Mobility team, prepare accordingly!
Spans: [Span(type: ZIP_CODE, value: 89200-3325, char_span: [0: 10]), Span(type: O, value: Mobility, char_span: [72: 80])]



In [38]:
# Set up the experiment tracker to log the experiment for reproducibility
experiment = get_experiment_tracker()
 
# Create a wrapper for Presidio to be used within the presidio-evaluator framework
model = PresidioAnalyzerWrapper(analyzer_engine, 
                                entity_mapping=entities_mapping)

# Create the evaluator object
evaluator = Evaluator(model=model)


# Track model and dataset params
params = {"dataset_name": dataset_name, "model_name": model.name}
params.update(model.to_log())
experiment.log_parameters(params)
experiment.log_dataset_hash(dataset)
experiment.log_parameter("entity_mappings", json.dumps(entities_mapping))

--------
Entities supported by this Presidio Analyzer instance:
US_ITIN, DATE_TIME, NRP, IN_PAN, LOCATION, US_PASSPORT, UK_NHS, CRYPTO, AU_MEDICARE, AU_ACN, SG_NRIC_FIN, CREDIT_CARD, IN_VOTER, IN_AADHAAR, IN_PASSPORT, IN_VEHICLE_REGISTRATION, US_SSN, IBAN_CODE, AU_TFN, US_DRIVER_LICENSE, AU_ABN, EMAIL_ADDRESS, URL, US_BANK_NUMBER, ORGANIZATION, MEDICAL_LICENSE, IP_ADDRESS, PERSON, PHONE_NUMBER


In [39]:
## Run experiment

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

# Track experiment results
experiment.log_metrics(results.to_log())
entities, confmatrix = results.to_confusion_matrix()
experiment.log_confusion_matrix(matrix=confmatrix, 
                                labels=entities)

# Plot output
plotter = evaluator.Plotter(model=model, 
                            results=results, 
                            output_folder = ".", 
                            model_name = model.name, 
                            beta = 2)


# end experiment
experiment.end()

Mapping entity values using this dictionary: {'FIRSTNAME': 'PERSON', 'LASTNAME': 'PERSON', 'MIDDLENAME': 'PERSON', 'PERSON': 'PERSON', 'DATE': 'DATE_TIME', 'TIME': 'DATE_TIME', 'DOB': 'DATE_TIME', 'DATE_TIME': 'DATE_TIME', 'EMAIL': 'EMAIL_ADDRESS', 'EMAIL_ADDRESS': 'EMAIL_ADDRESS', 'PREFIX': 'TITLE', 'TITLE': 'TITLE', 'URL': 'URL', 'STREET': 'LOCATION', 'STATE': 'LOCATION', 'CITY': 'LOCATION', 'COUNTY': 'LOCATION', 'SECONDARYADDRESS': 'LOCATION', 'LOCATION': 'LOCATION', 'PHONEIMEI': 'PHONE_NUMBER', 'PHONENUMBER': 'PHONE_NUMBER', 'PHONE_NUMBER': 'PHONE_NUMBER', 'IPV4': 'IP_ADDRESS', 'IPV6': 'IP_ADDRESS', 'IP': 'IP_ADDRESS', 'IP_ADDRESS': 'IP_ADDRESS', 'CREDITCARDNUMBER': 'CREDIT_CARD', 'CREDIT_CARD': 'CREDIT_CARD', 'ZIPCODE': 'ZIP_CODE', 'ZIP_CODE': 'ZIP_CODE', 'COMPANYNAME': 'ORGANIZATION', 'ORGANIZATION': 'ORGANIZATION', 'IBAN': 'IBAN_CODE', 'IBAN_CODE': 'IBAN_CODE', 'SSN': 'US_SSN', 'US_SSN': 'US_SSN', 'AGE': 'AGE', 'AMOUNT': 'O', 'USERNAME': 'O', 'JOBTITLE': 'O', 'JOBAREA': 'O', 'AC

In [40]:

print (dataset[0])

Full text: 89200-3325 schools are next in line for education reform pilot program. Mobility team, prepare accordingly!
Spans: [Span(type: ZIP_CODE, value: 89200-3325, char_span: [0: 10]), Span(type: O, value: Mobility, char_span: [72: 80])]



In [None]:
plotter.plot_scores()

In [None]:
plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix)

In [359]:
plotter.plot_most_common_tokens()

In [1]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("token-classification", model="lakshyakh93/deberta_finetuned_pii")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


config.json:   0%|          | 0.00/6.31k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/555M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location=map_location)


tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/78.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

In [25]:
text = "My name is John and I live in California."
output = pipe(text, aggregation_strategy="first")
output

[{'entity_group': 'FIRSTNAME',
  'score': 0.95468575,
  'word': ' John',
  'start': 10,
  'end': 15},
 {'entity_group': 'STATE',
  'score': 0.98806274,
  'word': ' California.',
  'start': 29,
  'end': 41}]

In [57]:
with open('data2.json', 'r') as file:
    data = json.load(file)

In [68]:
text= data[0]['full_text']
output = pipe(text, aggregation_strategy="first")
output

[{'entity_group': 'ZIPCODE',
  'score': 0.9969326,
  'word': ' 89200-3325',
  'start': 0,
  'end': 10},
 {'entity_group': 'JOBAREA',
  'score': 0.993117,
  'word': ' Mobility',
  'start': 71,
  'end': 80}]

In [23]:
def evaluate_ (dataset_name,analyzer, map) : 

    from pathlib import Path
    from pprint import pprint
    from collections import Counter
    from typing import Dict, List
    import json

    from presidio_evaluator import InputSample
    from presidio_evaluator.evaluation import Evaluator, ModelError
    from presidio_evaluator.models import PresidioAnalyzerWrapper
    from presidio_evaluator.experiment_tracking import get_experiment_tracker
    import pandas as pd
    pd.set_option("display.max_columns", None)
    pd.set_option("display.max_rows", None)
    pd.set_option("display.max_colwidth", None)
    %reload_ext autoreload
    %autoreload 2
    %matplotlib inline
    dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "data modification", dataset_name))
    if analyzer == "presidio_analyzer":
        from presidio_analyzer import AnalyzerEngine
        analyzer_engine = AnalyzerEngine(default_score_threshold=0.4)
        experiment = get_experiment_tracker()
        model = PresidioAnalyzerWrapper(analyzer_engine, 
                                entity_mapping=map)
        
        evaluator = Evaluator(model=model)
        
        params = {"dataset_name": dataset_name, "model_name": model.name}
        params.update(model.to_log())
        experiment.log_parameters(params)
        experiment.log_dataset_hash(dataset)
        experiment.log_parameter("entity_mappings", json.dumps(map))
        
        entities_mapping = map
        dataset = Evaluator.align_entity_types(dataset, entities_mapping=map, 
                                               allow_missing_mappings=True)
        
        experiment = get_experiment_tracker()
        evaluation_results = evaluator.evaluate_all(dataset)
        results = evaluator.calculate_score(evaluation_results)
        experiment.log_metrics(results.to_log())
        entities, confmatrix = results.to_confusion_matrix()
        experiment.log_confusion_matrix(matrix=confmatrix, 
                                labels=entities)
        plotter = evaluator.Plotter(model=model, 
                            results=results, 
                            output_folder = ".", 
                            model_name = model.name, 
                            beta = 2)
        experiment.end()
        plotter.plot_scores()
        plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix)
        plotter.plot_most_common_tokens()
        
 
    

In [24]:
evaluate_ ("data2.json","presidio_analyzer",presidio_entities_map1)

tokenizing input: 100%|██████████| 15/15 [00:00<00:00, 103.43it/s]


--------
Entities supported by this Presidio Analyzer instance:
PERSON, IBAN_CODE, SG_NRIC_FIN, AU_ABN, US_PASSPORT, AU_ACN, US_ITIN, ORGANIZATION, NRP, CREDIT_CARD, MEDICAL_LICENSE, IN_PASSPORT, URL, LOCATION, US_SSN, DATE_TIME, PHONE_NUMBER, AU_MEDICARE, US_DRIVER_LICENSE, IP_ADDRESS, CRYPTO, US_BANK_NUMBER, IN_VEHICLE_REGISTRATION, AU_TFN, IN_AADHAAR, IN_PAN, EMAIL_ADDRESS, IN_VOTER, UK_NHS
Mapping entity values using this dictionary: {'FIRSTNAME': 'PERSON', 'LASTNAME': 'PERSON', 'MIDDLENAME': 'PERSON', 'PERSON': 'PERSON', 'DATE': 'DATE_TIME', 'TIME': 'DATE_TIME', 'DOB': 'DATE_TIME', 'DATE_TIME': 'DATE_TIME', 'EMAIL': 'EMAIL_ADDRESS', 'EMAIL_ADDRESS': 'EMAIL_ADDRESS', 'PREFIX': 'TITLE', 'TITLE': 'TITLE', 'URL': 'URL', 'STREET': 'LOCATION', 'STATE': 'LOCATION', 'CITY': 'LOCATION', 'COUNTY': 'LOCATION', 'SECONDARYADDRESS': 'LOCATION', 'LOCATION': 'LOCATION', 'PHONEIMEI': 'PHONE_NUMBER', 'PHONENUMBER': 'PHONE_NUMBER', 'PHONE_NUMBER': 'PHONE_NUMBER', 'IPV4': 'IP_ADDRESS', 'IPV6': 'IP_AD