# PII Detection Evaluation with Presidio Research
In this notebook, we will demonstrate how to evaluate the performance of PII (Personally Identifiable Information) detection using the `presidio-research` library.


## Import Libraries

First, we need to import the necessary libraries:

In [35]:
# Import libraries
from collections import Counter
from presidio_evaluator import InputSample, Span
from presidio_evaluator.evaluation import Evaluator, ModelError
from presidio_evaluator.models import PresidioAnalyzerWrapper
from copy import deepcopy
import pandas as pd

Counter is a Python library that allows us to count the occurrences of elements in a list. 
InputSample and Span are classes from the presidio_evaluator library that we will use to represent our data and the spans of PII in our data, respectively.

## Load Data

Next, we will load our data. This data should be in the form of a list of `InputSample` objects. Each `InputSample` object represents a piece of text and contains a list of `Span` objects that represent the ground truth spans of PII in the text. 

In this example, I create an instance of `InputSample` where `full_text` contains only one sentence: "My name is Trang Nguyen.". This sentence has the ground truth span declared inside the `spans` parameter. Since there are no ground truth tokenizations - which are needed for evaluation, we set `create_tags_from_span = True` to tokenize the `full_text`. The default `token_model_version` used for tokenization is `en_core_web_sm` and the IO schema. You can change these parameters to use a different model for tokenization.

In [66]:
# Define the data as an Input Sample object
sample = InputSample(
    full_text = "My name is Trang Nguyen. I live in France.",
    spans = [
                Span(start_position  = 11, 
                    end_position = 22, 
                    entity_value = "Trang Nguyen", 
                    entity_type = "PERSON"),
                Span(start_position = 33,
                    end_position = 38,
                    entity_value = "Paris",
                    entity_type = "LOCATION")
        ],
    create_tags_from_span=True
    )

In this code snippet, we are iterating over the tokens and their corresponding tags in the `sample` object.

The sample.tokens is a list of tokens, which are the individual words or punctuation from the full_text of the InputSample object. The sample.tags is a list of tags, where each tag represents the entity type of the corresponding token. For example, a tag of 'PERSON' indicates that the token is a person's name, while a tag of 'O' indicates that the token is not a PII entity.

In [67]:
for token, tag in zip(sample.tokens, sample.tags):
    print({token: tag})

entity_counter = Counter()
for tag in sample.tags:
    entity_counter[tag] += 1
print("Count per entity:")
print(entity_counter.most_common())

{My: 'O'}
{name: 'O'}
{is: 'O'}
{Trang: 'PERSON'}
{Nguyen: 'PERSON'}
{.: 'O'}
{I: 'O'}
{live: 'O'}
{in: 'O'}
{France: 'LOCATION'}
{.: 'O'}
Count per entity:
[('O', 8), ('PERSON', 2), ('LOCATION', 1)]


In presidio-research package, there are several Wrapper models were created to identify and evaluate some PII model such as crf, spacy, stanza flair, presidio or Azure text analytics. In this article, I will focus on evaluating Presidio PII identification capability. 

First we need to declare a PresidioAnalyzerWrapper() object, which wrappers for a specific PII recognizer from presidio 

In [68]:

model_name = "Presidio Analyzer"
model = PresidioAnalyzerWrapper()

Entities supported by this Presidio Analyzer instance:
SG_NRIC_FIN, IN_AADHAAR, UK_NHS, IP_ADDRESS, PERSON, EMAIL_ADDRESS, URL, IN_VEHICLE_REGISTRATION, CRYPTO, US_DRIVER_LICENSE, US_ITIN, ORGANIZATION, DATE_TIME, US_BANK_NUMBER, AU_TFN, PHONE_NUMBER, LOCATION, AGE, AU_ABN, EMAIL, US_PASSPORT, NRP, AU_ACN, US_SSN, ID, IBAN_CODE, CREDIT_CARD, MEDICAL_LICENSE, AU_MEDICARE, IN_PAN


The PresidioAnalyzerWrapper currently supported the following entities: SG_NRIC_FIN, IN_AADHAAR, UK_NHS, IP_ADDRESS, PERSON, EMAIL_ADDRESS, URL, IN_VEHICLE_REGISTRATION, CRYPTO, US_DRIVER_LICENSE, US_ITIN, ORGANIZATION, DATE_TIME, US_BANK_NUMBER, AU_TFN, PHONE_NUMBER, LOCATION, AGE, AU_ABN, EMAIL, US_PASSPORT, NRP, AU_ACN, US_SSN, ID, IBAN_CODE, CREDIT_CARD, MEDICAL_LICENSE, AU_MEDICARE, IN_PAN

You can also perform the PII detection by using the model we just declared by using the following snippet code. As you can see, the model has predicted the entities in the text. The output is a list of tags, where each tag corresponds to a token in the text. The tags are either "O" (non-PII) or the entity type (e.g., "PERSON", "LOCATION") if the token is part of a PII entity.

In [69]:
pii_prediction = model.predict(sample)
print("PII detection output by using PresidioAnalyzerWrapper model")
print(sample.tokens)
print(pii_prediction)

PII detection output by using PresidioAnalyzerWrapper model
My name is Trang Nguyen. I live in France.
['O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'LOCATION', 'O']


Now you can perform the evaluation by using the following snippet code

In [70]:

evaluator = Evaluator(model=model)

evaluation_results = evaluator.evaluate_all(dataset=[sample])
results = evaluator.calculate_score(evaluation_results)


Evaluating <class 'presidio_evaluator.models.presidio_analyzer_wrapper.PresidioAnalyzerWrapper'>: 100%|██████████| 1/1 [00:00<00:00, 79.77it/s]


In [60]:
entities, confmatrix = results.to_confusion_matrix()

print("Confusion matrix:")
print(pd.DataFrame(confmatrix, columns=entities, index=entities))

print("Precision and recall")
print(results)

Confusion matrix:
          LOCATION  O  PERSON
LOCATION         1  0       0
O                0  6       0
PERSON           0  0       2
Precision and recall
              Entity           Precision              Recall   Number of samples
            LOCATION             100.00%             100.00%                   1
              PERSON             100.00%             100.00%                   2
                 PII             100.00%             100.00%                   3
PII F measure: 100.00%
